You are on page 1of 15

Understanding

Quality of Experience Alerting


BY A RISH A LREJA, MICROSOFT CORPORATION

Published May 2011


Copyright
The information contained in this document represents the current view of Microsoft Corporation on the
issues discussed as of the date of publication. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot
guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED
OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or
for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks,
copyrights, or other intellectual property.

Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses,
logos, people, places and events depicted herein are fictitious, and no association with any real company,
organization, product, domain name, email address, logo, person, place or event is intended or should be
inferred.

2011 Microsoft Corporation. All rights reserved.

Microsoft and Lync are either registered trademarks or trademarks of Microsoft Corporation in the United
States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective
owners.

Microsoft Corporation 2
Contents
Copyright 2
Overview of QoE 4
QoE "Currency" 4
How QoE Alerting Works 5
Categories in QoE Alerting.................................................................................................................................................5
Network Locations.........................................................................................................................................................7
Media Infrastructure.....................................................................................................................................................8
Parameters for QoE Alerting..............................................................................................................................................8
Frequency of Polling (T).................................................................................................................................................9
Sliding Time Window (W)..............................................................................................................................................9
Minimum Call Volume (V)..............................................................................................................................................9
Threshold of the poor call percentage required to generate an error alert (Error).......................................................9
Threshold of the poor call percentage required to generate a warning alert (Warn)....................................................9
Include External calls (External).....................................................................................................................................9
Include Wi-Fi Calls (WiFi).............................................................................................................................................10
Include VPN Calls (VPN)...............................................................................................................................................10
Alerting Algorithm Flowchart..........................................................................................................................................10
Examples of QoE Alert Generation..................................................................................................................................11
Deployment Considerations for QoE Alerting..................................................................................................................13
Subnet vs. Location-Based Alerting.............................................................................................................................13
Time Window (W) vs. Minimum Call Volume (V).........................................................................................................14
Media Impeding Factors15

Microsoft Corporation 3
Overview of QoE
Voice is a mission critical workload; that means that detecting, diagnosing and addressing voice quality issues in a
deployment is an important part of an enterprise administrator's job. Microsoft's solution to measuring, reporting, and
alerting on voice quality issues is based on two key features:

 The Monitoring Server, which stores media quality data for voice and video calls
 The Microsoft® System Center Operation Manager (formerly Microsoft Operations Manager) pack, which
periodically evaluates the media quality data and raises real-time alerts whenever it detects voice quality issues
in an enterprise deployment
At the end of a call, Quality of Experience (QoE) reports containing a rich set of metrics reflecting the perceived voice
quality experience of the call are reported by all Microsoft unified communications (UC) endpoints. These reports are
then stored in the Monitoring Server's QoE database. This data can be accessed and analyzed by using tools such as
Microsoft® Lync™ Server 2010 Monitoring Server Reports or QoE alerting. In this document, we discuss how voice quality
reporting and alerting works in Lync Server 2010 and also the considerations for successfully deploying this technology.

QoE "Currency"
Lync Server introduces the notion of good quality versus poor quality calls. By classifying each call as either a good call or
a poor call, this helps eliminate some of the complexity involved in analyzing QoE data; administrators no longer have to
analyze all the metrics for a given call and determine for themselves whether this was a good call or a poor call. The call
classification criteria used to make these quality determinations is based on a set of seven core metrics reported by all
UC endpoints for each voice call. Each of these seven call classification metrics has a defined threshold. For example, the
round trip time metric has defined threshold of 500 milliseconds. If any one of these thresholds is exceeded (for
example, if the round trip time is 805 milliseconds), then the call is classified as a poor quality call. This will be the case
even if all the other metrics fall within the acceptable range. These core metrics are known as the "currency" of QoE
alerting.

The following metrics make up the call classification criteria:

Call classification metric Optimal range Acceptable range


Jitter 20 milliseconds 30 milliseconds
Packet Loss 0.10 0.05
Network MOS Degradation 0.60 1.0
Round Trip Time 200 milliseconds 500 milliseconds
Healer Metric: Concealed 0.03 0.07
Healer Metric: Stretched 1.0 1.0
Healer Metric: Suppressed 1.0 1.0

The call classification criteria apply to all media quality reporting and alerting. The media quality reports included with
Lync Server Monitoring Server Reports count and report poor quality calls based on this criteria. In addition, the reports
also highlight these metrics any time they exceeded the acceptable threshold. This makes it easier for administrators
Microsoft Corporation 4
troubleshooting a poor quality call to identify the root cause for that poor call. For example, note in the following figure
how metrics that exceed the acceptable threshold level are highlighted in red. Metrics that exceed the optimal level but
are still in the acceptable range are highlighted in yellow:

How QoE Alerting Works


Detecting and diagnosing the root cause of voice quality issues in a deployment has traditionally been a task that
requires a significant amount of specialized knowledge. To make it possible for administrators without this specialized
knowledge to accomplish these tasks, Lync Server relies on an intelligent alerting algorithm that can analyze the data and
identify areas of concern.

In this document, we provide an overview of how QoE alerting works, as well as examples of how the technology
operates in an actual deployment. That overview is followed by a discussion about how to optimally configure the
alerting algorithm based on deployment-specific considerations in order to achieve accurate alerting and to prevent
noise.

Note. In System Center Operations Manager, "noise" refers to the problem of administrators receiving too many
alerts, including alerts that aren't really important and alerts that are duplicates of previously-issued alerts. One
of the key considerations in using Operations Manager is to figure out a way to suppress noise while still allowing
the truly important alerts to surface.

Categories in QoE Alerting


In a deployment, several kinds of voice sessions might be in progress at any given time. In addition, each of these
sessions might be experiencing different voice quality issues. Because of this, the Lync Server QoE alerting technology
groups poor quality calls into predefined infrastructure categories. Alerts are raised only if a large percentage of the total
calls for a particular instance of a particular category are classified as poor quality calls. In a moment, we'll show you
exactly what that means.

In the following section, we describe the deployment used when we discuss how QoE alerting works in an actual
organization. Our sample deployment, one in which the enterprise network bridges the user sites and regions with wide
area network (WAN) links, is shown in the following figure:

Microsoft Corporation 5
As you can see, the deployment contains two regions:

 Europe
 North America
The North America region contains three user sites:

 New York
 San Francisco
 Boston
The Europe region contains two user sites:

 London
 Munich
The New York site also includes the following items:

 Two A/V Conferencing Servers (NY_AVMCU_01 and NY_AVMCU_02)


 One Mediation Server (NY_Mediation)
 Three public switched telephone network (PSTN) Gateways (NY_GW_01, NY_GW_02, and NY_GW_03)
The list of categories that apply to QoE Alerting are discussed in the following sections. The list can be divided into two
main categories: Network Locations and Media Infrastructure.

Network Locations

Microsoft Corporation 6
Apart from subnets, all the categories detailed in this section must be created using the Lync Server network
configuration capabilities and are not available by default.

Note. We do not discuss how user sites and regions are defined in Lync Server.

Subnets
This category includes all calls that are made within a subnet, to a subnet, or from a subnet. That means that the caller
endpoint, the callee endpoint, or both endpoints must be in the subnet for a call to be counted among the total calls for
that subnet.

Within User Sites


This category counts all calls that originate and terminate within a user site, which is a network entity composed of a
group of subnets. For example, the New York user site includes all the subnets found in Contoso.com's New York office.
In our sample deployment, this category counts all calls that originate and terminate within the New York user site; that
means that both endpoints involved in the call must belong to the collection of subnets grouped within the New York
user site.

Between Different User Sites


This category counts calls made from endpoints which are in different user sites. Each pair of logical user sites represents
a unique instance for this category. For example, suppose London, New York, San Francisco, Munich, and Boston are the
user sites in the deployment. That means this category will have the following pairs of sites:

 London-New York
 London-San Francisco
 London-Munich
 London-Boston
 New York – San Francisco
 New York –Boston
 New York – Munich
 San Francisco-Munich
 San Francisco-Boston
 Munich-Boston

Within a Region
This category counts all calls that originate and terminate within a specific region (such as North America). A region is a
network entity composed of a group of user sites. For example, the North America region includes the New York, Boston,
San Francisco user sites.

Between Different Regions


This category counts calls made from endpoints that are in different regions. Each pair of regions is considered a unique
instance for this category. For example, if North America and Europe are the only two regions in Contoso.com's
deployment, then there is only one region pair: North America-Europe.

Media Infrastructure

All of the instances of the categories in the Media Infrastructure category are automatically detected and generated by
Monitoring Server for the purposes of QoE alerting.

Microsoft Corporation 7
A/V Conferencing Server
This category applies to all conferencing sessions that involve audio, video, or both. Each individual A/V Conferencing
Server is considered a unique instance for this category.

Note. Conference announcements and the Response Group application’s interactive voice responses (IVRs) are
counted as calls when determining the total calls for this category.

PSTN Infrastructure (Media Bypass): PSTN Gateway


This category applies to all calls that go directly to a PSTN gateway without first passing through a Mediation Server. Each
PSTN gateway is a unique instance for this category.

PSTN Infrastructure (Non-Bypass): Mediation Server Proxy Leg


This category applies to PSTN calls that do pass through the Mediation Server. Calls that pass through the Mediation
Server involve two sessions: one session between the UC endpoint and the Mediation Server and another session
between the Mediation Server and the PSTN gateway. This category is only concerned with the sessions between the UC
endpoint and the Mediation Server. Each Mediation Server is a unique instance of this category.

PSTN Infrastructure (Non-Bypass): Mediation Server–Gateway Leg


This category also applies to PSTN calls that pass through the Mediation Server. As noted earlier, calls that pass through
the Mediation Server involve two sessions: one session between the UC endpoint and the Mediation Server and another
session between the Mediation Server and the PSTN gateway. This category is concerned only with the sessions
between the Mediation Server and the PSTN gateway. Each Mediation Server-PSTN gateway pair represents a unique
instance of this category.

Note. A Mediation Server can be configured to route calls to multiple PSTN gateways. Because of this, a single
Mediation Server can appear in multiple alerts, assuming there are voice quality problems between that server
and several PSTN gateways.

Parameters for QoE Alerting


In this section, we describe the parameters available for tuning QoE alerting from within the Operations Manager
console. These parameters provide administrators with considerable flexibility when configuring QoE alerting based on
deployment needs.

Subsequent sections contain a more detailed discussion of deployment considerations and how they relate to your
alerting configuration. You’ll also find guidelines for configuring QoE alerting appropriately for your deployment.

Frequency of Polling (T)

This parameter determines how frequently the QoE alerting algorithm checks for anomalies in the QoE reports sent by
UC endpoints. The default time interval is 15 minutes. Running the algorithm more frequently can provide more real-
time detection, but, at the same time, puts an additional load on the Monitoring Server, which must execute the complex
alerting logic more often.

Sliding Time Window (W)


Microsoft Corporation 8
This parameter defines the time window for which data is included in the alerting algorithm's analysis. The default time
window is two hours. That means that, if the algorithm is executed at 12:30 PM, then that algorithm will take into
account all the QoE reports received between 10:30 AM and 12:30 PM. Setting the time window to one hour would
mean that only the reports received in the past hour (that is, between 11:30 AM and 12:30 PM) will be used in the
analysis.

Minimum Call Volume (V)

The parameter indicates how many calls must be in any instance of a category before an alert can be triggered. The
default is 50, which means that no alerts will be triggered unless there are at least 50 calls in the instance. If an instance
has only 49 calls no alert will be issued, even if all 49 are considered poor calls. Note that this value should never be set
below 50.

Threshold of the poor call percentage required to generate an error alert (Error)

This parameter specifies the percentage of total calls that have to be classified as poor quality calls for an error alert to
be raised. By default, this threshold is set 12%. This means that, if a category has 100 calls and 11 of them are classified
as poor, no error alert will be raised. That's because only 11% of the calls have been classified as poor quality calls.

Threshold of the poor call percentage required to generate a warning alert (Warn)

This parameter specifies the percentage of total calls that have to be classified as poor quality calls for a warning alert to
be raised. By default, this threshold is set to 10%. This means that, if a category has 100 calls and nine of them are
classified as poor, no warning alert will be raised. That's because only 9% of the calls have been classified as poor quality
calls.

Include External calls (External)

This parameter indicates whether calls from external users (calls made over the A/V Edge Server) are included for the
purposes of QoE alerting. By default, this value is set to False, meaning that external calls are not considered by the QoE
alerting algorithm.

Include Wi-Fi Calls (WiFi)

This parameter indicates whether calls made over a wireless connection are included for the purposes of QoE alerting. By
default, this value is set to False, meaning that calls made over a wireless connection are not considered by the QoE
alerting algorithm.

Include VPN Calls (VPN)

This parameter indicates whether calls made through a virtual private network (VPN) connection are included for the
purposes of QoE alerting. By default, this value is set to False, meaning that VPN calls are not considered by the QoE
alerting algorithm.

Microsoft Corporation 9
Alerting Algorithm Flowchart
The process by which the alerting algorithm decides to issue an error, a warning, or do nothing at all is shown in the
following flowchart. This process will be explained in detail in the next section.

Microsoft Corporation 10
Examples of QoE Alert Generation
Let's now walk through some examples of how the alerting algorithm applies to a particular instance (the New York site's
NY_AVMCU_01) of a particular category (A/V Conferencing Server). This example illustrates how the QoE alerting
algorithm works for any instance of any of the alerting categories previously described. For this discussion, we'll use the
default settings for the alerting configuration:

 Time window (W) = 120 minutes


 Polling frequency (T) = 15 minutes
 Minimum call volume (V) = 50
 Error threshold (Error) = 12%
 Warning threshold (Warn) = 10%
 WiFi = False
 External = False
 VPN = False
The following graph shows the call volume for a particular day on NY_AVMCU_01, with the red and green sections of the
chart representing the good quality and the poor quality calls, respectively. The granularity of data points included on
this chart is based on 15-minute time intervals:

Microsoft Corporation 11
Total Calls in the Time Windows (W) = 120 minutes

Minimum Call Volume (V) = 50

Poor Quality Percentage Over Calls in the Past W Minutes

Warning Threshold (Warn) = 10%

Error Threshold (Error) = 12%

As we noted, in this example, the alerting algorithm executes with a polling frequency of once every 15 minutes (T = 15
minutes). For illustration purposes, we'll look at the execution and results of the algorithm at 6:00 AM, 10:00 AM, and
4:00 PM.

At 6:00 AM:

 All the calls on AVMCU_NY_01 between 4:00 AM and 6:00 AM (W = 2 hours) are counted.
 There is a total of 15 calls, which means that AVMCU_NY_01 is not alert eligible. That's because the call count must
be at least 50 (V = 50).
 No further analysis takes place for AVMCU_NY_01 until the next polling interval (T= 15 minutes later)
At 10:00 AM:

 All the calls on AVMCU_NY_01 between 8:00 am and 10:00 am (W = 2 hours) are counted.
 There is a total of 89 calls, which means that AVMCU_NY_01 is alert eligible (V = 50).
Microsoft Corporation 12
 There is a total of four poor quality calls in the 8:00 AM – 10:00 AM window.
 The algorithm calculates the poor quality call percentage calculates as 4.49% (four poor calls divided by 89 total
calls).
 The poor quality call percentage (4.49%) is less than both the Error (12%) and Warn (10%) thresholds.
 No Operations Manager alerts are generated.
At 4:00 PM:

 All the calls on AVMCU_NY_01 between 2:00 PM and 4:00 PM (W = 2 hours) are counted.
 There is a total of 162 calls, which means that AVMCU_NY_01 is alert eligible (V = 50).
 There are total of 23 poor quality calls in the 2:00 PM – 4:00 PM window.
 The algorithm calculates the poor quality call percentage calculates as 14.19% (23 poor calls divided by 162 total
calls).
 The poor quality call percentage (14.19%) is greater than both the Error (12%) and Warn (10%) thresholds.
 As a result, an Error level Operations Manager alert is generated.

Deployment Considerations for QoE Alerting


Now let's look at some deployment-specific considerations that are relevant to successfully configuring and deploying
QoE alerting.

Subnet vs. Location-Based Alerting

Subnets represent the default out-of-the-box mode for network location alerting. All other categories require the
provisioning of network data that maps subnets to user sites and user sites to regions. We recommend that network data
be provisioned any time that QoE alerting is deployed due to the following considerations:

 High alert volume due to granularity Many organizations have a large number of subnets, which can result in a
large number of alerts being generated from the same underlying network issue. For example, if a network
outage affects five different subnets, alerts can be generated from all 5 subnets.
 Scaling in Operations Manager As noted, many organizations have a large number of subnets. Operations
Manager allows a maximum of 500 instances to be monitored. This means that, if a deployment has more than
500 subnets, all those subnets cannot be monitored for voice quality.
 Call volume Depending on the size and granularity of a subnet, it is quite possible that subnets by themselves
will not have enough call volume to be eligible for alerts.
Subnet-based alerting is the most primitive method of monitoring network locations for voice quality issues and is
subject to the limitations just described. Network location-based QoE alerting at the user site and region levels is
recommended because it provides a less-noisy and more-actionable view of voice quality. When network locations are
configured, it is recommended that you turn off subnet-based alerting.

Time Window (W) vs. Minimum Call Volume (V)

Having a reasonable value for minimum call volume (V) prevents a small number of poor quality calls from resulting in an
alert. By having a reasonable minimum call volume (for example, V = 50 or more), this type of noise can be prevented.

Microsoft Corporation 13
We recommend that you do not set the minimum call volume to less than 50. However, based on the number of calls
you receive, you might want to increase this value.

If an instance being monitored has less than the minimum call volume at the time it is polled, then any poor quality calls
experienced on that instance will not be considered for alerting purposes. Because of that, QoE alerting is effective only
if as many instances as possible have more than the minimum call volume any time they are polled. In other words, if
you only get 20 calls per hour then a time window of two hours will not be very useful: during those two hours you can
expect to get only 40 calls, which is less than the minimum call volume. Meeting the minimum call volume is typically not
a problem in large deployments with hundreds of calls per hour. However, in smaller deployments, there might not be
enough call volume to actively monitor your infrastructure using the default settings.

To get around the call volume issue, the time window (W), which has a default of 120 minutes, can be extended. This
means that, instead of calls from the last two hours being counted when an instance of infrastructure is polled, calls for
the last three hours or perhaps the last four hours will be polled. (The actual value depends on the value you configure
for the time window.) This helps ensure that the minimum number of calls will be counted, making more instances alert
eligible.

Of course, there is a tradeoff to using a longer time window: an outage or disruption might continue to be counted for
several hours (W minutes) after it occurred and will continue to issue QoE alerts even though the problem has already
been fixed. For example, suppose network equipment goes down at 4:00 PM, leading to congestion that results in 80% of
all the calls being classified as poor quality calls. The network problem is resolved at 4:30 PM. However, with a time
window of two hours, the poor quality calls that happened between 4:00 PM to 4:30 PM will continue to contribute to
QoE alerts until 6:30 PM.

Limitations of the Time Window Approach


For deployments that have very low call volumes (call volumes that might require time windows of four hours or more),
QoE alerting might not be the ideal solution for monitoring voice quality. In those cases, proactive use of Monitoring
Server Reports is recommended.

Corollary: High Minimum Call Volume (V) Allows for More Granular Alerting Thresholds
In large deployments with aggressive service level agreements for voice quality, additional modifications for very
granular alerting should be taken into account. For example, consider this question:

If V = 50, what is the most granular threshold that can be set?

To answer that, assume that exactly 50 calls take place in the time window. The most granular warning or error
percentage threshold we can set is +/- 2%. For example, if we set the threshold to 5%, then the third poor call (out of 50)
will result in an alert. Why? Because two poor calls (out of 50) equals a poor call percentage of 4%; three poor calls
equals a poor call percentage of 6%, which exceeds the 5% threshold. 6% - 4% yields a granularity of 2%.

Compare that to this question:

If V = 500, what is the most granular threshold we can set?

Again, assume that exactly 500 calls take place. In that case, the most granular warning or error percentage threshold we
can set is +/- 0.2 %. For example, if we set the threshold to 5%, then the 25th poor quality call that takes place (out of

Microsoft Corporation 14
500) will result in an alert. Why? Because 24 poor calls is a poor call percentage of 4.8%; 25 poor calls results in a poor
call percentage of 5%, which exceeds the threshold and triggers an alert. In this example, 5% - 4.8% leaves a granularity
of 0.2%.

That means that a more granular threshold can be monitored without noise as long as we have a larger minimum call
volume configured.

Media Impeding Factors


When media is routed over the Internet or over physical layers with variable reliability (such as wireless), maintaining
good voice quality may be beyond the control of an enterprise administrator. In general, we do not expect the following
parameters for QoE alerting to be modified by an enterprise administrator in the typical deployment. However, the
presence of these parameters recognizes the deployment choices that can be made in an enterprise (for example,
supporting Voice over Internet Protocol, or VoIP, over wireless, VPNs, and so on). The following sections provides a brief
summary of the pros and cons of each choice.

Include External Calls (Default Value False)

External calls are subject to media being routed over the Internet, through the A/V Edge Server, and then on to the
enterprise deployment. Voice quality is going to be impacted by the route the media takes over the Internet, a route that
the administrator cannot control. We recommend that this value remain set to False, which will prevent external calls
from being factored into the poor call quality percentage.

Include Wi-Fi (Default Value False)

Wireless networks are "lossier" than wired networks because of the variation in the reliability of the physical layer of a
wireless network. That simply means that a call over a wireless network is likely to suffer more degradation in call quality
than a call over a wired network. (It's for this very reason that many enterprises do not support VoIP over their wireless
networks.)

Because of this, wireless calls are excluded from the QoE alerting algorithm by default. This should be changed only if
you want to support VoIP calls over wireless and receive alerts in case of poor call quality.

Include VPN (Default Value False)


VPN involves a logical network tunnel from the Internet to the enterprise deployment and is therefore subject to
network conditions on Internet (the same problems faced by external calls, except that no A/V Edge Server is involved
with VPN calls). Because of this, VPN calls are excluded by default from the QoE alerting algorithm. This should be
changed only if you want to support VoIP calls over VPNs and receive alerts in case of poor call quality.

Microsoft Corporation 15

You might also like