Professional Documents
Culture Documents
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED
OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or
for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks,
copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses,
logos, people, places and events depicted herein are fictitious, and no association with any real company,
organization, product, domain name, email address, logo, person, place or event is intended or should be
inferred.
Microsoft and Lync are either registered trademarks or trademarks of Microsoft Corporation in the United
States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective
owners.
Microsoft Corporation 2
Contents
Copyright 2
Overview of QoE 4
QoE "Currency" 4
How QoE Alerting Works 5
Categories in QoE Alerting.................................................................................................................................................5
Network Locations.........................................................................................................................................................7
Media Infrastructure.....................................................................................................................................................8
Parameters for QoE Alerting..............................................................................................................................................8
Frequency of Polling (T).................................................................................................................................................9
Sliding Time Window (W)..............................................................................................................................................9
Minimum Call Volume (V)..............................................................................................................................................9
Threshold of the poor call percentage required to generate an error alert (Error).......................................................9
Threshold of the poor call percentage required to generate a warning alert (Warn)....................................................9
Include External calls (External).....................................................................................................................................9
Include Wi-Fi Calls (WiFi).............................................................................................................................................10
Include VPN Calls (VPN)...............................................................................................................................................10
Alerting Algorithm Flowchart..........................................................................................................................................10
Examples of QoE Alert Generation..................................................................................................................................11
Deployment Considerations for QoE Alerting..................................................................................................................13
Subnet vs. Location-Based Alerting.............................................................................................................................13
Time Window (W) vs. Minimum Call Volume (V).........................................................................................................14
Media Impeding Factors15
Microsoft Corporation 3
Overview of QoE
Voice is a mission critical workload; that means that detecting, diagnosing and addressing voice quality issues in a
deployment is an important part of an enterprise administrator's job. Microsoft's solution to measuring, reporting, and
alerting on voice quality issues is based on two key features:
The Monitoring Server, which stores media quality data for voice and video calls
The Microsoft® System Center Operation Manager (formerly Microsoft Operations Manager) pack, which
periodically evaluates the media quality data and raises real-time alerts whenever it detects voice quality issues
in an enterprise deployment
At the end of a call, Quality of Experience (QoE) reports containing a rich set of metrics reflecting the perceived voice
quality experience of the call are reported by all Microsoft unified communications (UC) endpoints. These reports are
then stored in the Monitoring Server's QoE database. This data can be accessed and analyzed by using tools such as
Microsoft® Lync™ Server 2010 Monitoring Server Reports or QoE alerting. In this document, we discuss how voice quality
reporting and alerting works in Lync Server 2010 and also the considerations for successfully deploying this technology.
QoE "Currency"
Lync Server introduces the notion of good quality versus poor quality calls. By classifying each call as either a good call or
a poor call, this helps eliminate some of the complexity involved in analyzing QoE data; administrators no longer have to
analyze all the metrics for a given call and determine for themselves whether this was a good call or a poor call. The call
classification criteria used to make these quality determinations is based on a set of seven core metrics reported by all
UC endpoints for each voice call. Each of these seven call classification metrics has a defined threshold. For example, the
round trip time metric has defined threshold of 500 milliseconds. If any one of these thresholds is exceeded (for
example, if the round trip time is 805 milliseconds), then the call is classified as a poor quality call. This will be the case
even if all the other metrics fall within the acceptable range. These core metrics are known as the "currency" of QoE
alerting.
The call classification criteria apply to all media quality reporting and alerting. The media quality reports included with
Lync Server Monitoring Server Reports count and report poor quality calls based on this criteria. In addition, the reports
also highlight these metrics any time they exceeded the acceptable threshold. This makes it easier for administrators
Microsoft Corporation 4
troubleshooting a poor quality call to identify the root cause for that poor call. For example, note in the following figure
how metrics that exceed the acceptable threshold level are highlighted in red. Metrics that exceed the optimal level but
are still in the acceptable range are highlighted in yellow:
In this document, we provide an overview of how QoE alerting works, as well as examples of how the technology
operates in an actual deployment. That overview is followed by a discussion about how to optimally configure the
alerting algorithm based on deployment-specific considerations in order to achieve accurate alerting and to prevent
noise.
Note. In System Center Operations Manager, "noise" refers to the problem of administrators receiving too many
alerts, including alerts that aren't really important and alerts that are duplicates of previously-issued alerts. One
of the key considerations in using Operations Manager is to figure out a way to suppress noise while still allowing
the truly important alerts to surface.
In the following section, we describe the deployment used when we discuss how QoE alerting works in an actual
organization. Our sample deployment, one in which the enterprise network bridges the user sites and regions with wide
area network (WAN) links, is shown in the following figure:
Microsoft Corporation 5
As you can see, the deployment contains two regions:
Europe
North America
The North America region contains three user sites:
New York
San Francisco
Boston
The Europe region contains two user sites:
London
Munich
The New York site also includes the following items:
Network Locations
Microsoft Corporation 6
Apart from subnets, all the categories detailed in this section must be created using the Lync Server network
configuration capabilities and are not available by default.
Note. We do not discuss how user sites and regions are defined in Lync Server.
Subnets
This category includes all calls that are made within a subnet, to a subnet, or from a subnet. That means that the caller
endpoint, the callee endpoint, or both endpoints must be in the subnet for a call to be counted among the total calls for
that subnet.
London-New York
London-San Francisco
London-Munich
London-Boston
New York – San Francisco
New York –Boston
New York – Munich
San Francisco-Munich
San Francisco-Boston
Munich-Boston
Within a Region
This category counts all calls that originate and terminate within a specific region (such as North America). A region is a
network entity composed of a group of user sites. For example, the North America region includes the New York, Boston,
San Francisco user sites.
Media Infrastructure
All of the instances of the categories in the Media Infrastructure category are automatically detected and generated by
Monitoring Server for the purposes of QoE alerting.
Microsoft Corporation 7
A/V Conferencing Server
This category applies to all conferencing sessions that involve audio, video, or both. Each individual A/V Conferencing
Server is considered a unique instance for this category.
Note. Conference announcements and the Response Group application’s interactive voice responses (IVRs) are
counted as calls when determining the total calls for this category.
Note. A Mediation Server can be configured to route calls to multiple PSTN gateways. Because of this, a single
Mediation Server can appear in multiple alerts, assuming there are voice quality problems between that server
and several PSTN gateways.
Subsequent sections contain a more detailed discussion of deployment considerations and how they relate to your
alerting configuration. You’ll also find guidelines for configuring QoE alerting appropriately for your deployment.
This parameter determines how frequently the QoE alerting algorithm checks for anomalies in the QoE reports sent by
UC endpoints. The default time interval is 15 minutes. Running the algorithm more frequently can provide more real-
time detection, but, at the same time, puts an additional load on the Monitoring Server, which must execute the complex
alerting logic more often.
The parameter indicates how many calls must be in any instance of a category before an alert can be triggered. The
default is 50, which means that no alerts will be triggered unless there are at least 50 calls in the instance. If an instance
has only 49 calls no alert will be issued, even if all 49 are considered poor calls. Note that this value should never be set
below 50.
Threshold of the poor call percentage required to generate an error alert (Error)
This parameter specifies the percentage of total calls that have to be classified as poor quality calls for an error alert to
be raised. By default, this threshold is set 12%. This means that, if a category has 100 calls and 11 of them are classified
as poor, no error alert will be raised. That's because only 11% of the calls have been classified as poor quality calls.
Threshold of the poor call percentage required to generate a warning alert (Warn)
This parameter specifies the percentage of total calls that have to be classified as poor quality calls for a warning alert to
be raised. By default, this threshold is set to 10%. This means that, if a category has 100 calls and nine of them are
classified as poor, no warning alert will be raised. That's because only 9% of the calls have been classified as poor quality
calls.
This parameter indicates whether calls from external users (calls made over the A/V Edge Server) are included for the
purposes of QoE alerting. By default, this value is set to False, meaning that external calls are not considered by the QoE
alerting algorithm.
This parameter indicates whether calls made over a wireless connection are included for the purposes of QoE alerting. By
default, this value is set to False, meaning that calls made over a wireless connection are not considered by the QoE
alerting algorithm.
This parameter indicates whether calls made through a virtual private network (VPN) connection are included for the
purposes of QoE alerting. By default, this value is set to False, meaning that VPN calls are not considered by the QoE
alerting algorithm.
Microsoft Corporation 9
Alerting Algorithm Flowchart
The process by which the alerting algorithm decides to issue an error, a warning, or do nothing at all is shown in the
following flowchart. This process will be explained in detail in the next section.
Microsoft Corporation 10
Examples of QoE Alert Generation
Let's now walk through some examples of how the alerting algorithm applies to a particular instance (the New York site's
NY_AVMCU_01) of a particular category (A/V Conferencing Server). This example illustrates how the QoE alerting
algorithm works for any instance of any of the alerting categories previously described. For this discussion, we'll use the
default settings for the alerting configuration:
Microsoft Corporation 11
Total Calls in the Time Windows (W) = 120 minutes
As we noted, in this example, the alerting algorithm executes with a polling frequency of once every 15 minutes (T = 15
minutes). For illustration purposes, we'll look at the execution and results of the algorithm at 6:00 AM, 10:00 AM, and
4:00 PM.
At 6:00 AM:
All the calls on AVMCU_NY_01 between 4:00 AM and 6:00 AM (W = 2 hours) are counted.
There is a total of 15 calls, which means that AVMCU_NY_01 is not alert eligible. That's because the call count must
be at least 50 (V = 50).
No further analysis takes place for AVMCU_NY_01 until the next polling interval (T= 15 minutes later)
At 10:00 AM:
All the calls on AVMCU_NY_01 between 8:00 am and 10:00 am (W = 2 hours) are counted.
There is a total of 89 calls, which means that AVMCU_NY_01 is alert eligible (V = 50).
Microsoft Corporation 12
There is a total of four poor quality calls in the 8:00 AM – 10:00 AM window.
The algorithm calculates the poor quality call percentage calculates as 4.49% (four poor calls divided by 89 total
calls).
The poor quality call percentage (4.49%) is less than both the Error (12%) and Warn (10%) thresholds.
No Operations Manager alerts are generated.
At 4:00 PM:
All the calls on AVMCU_NY_01 between 2:00 PM and 4:00 PM (W = 2 hours) are counted.
There is a total of 162 calls, which means that AVMCU_NY_01 is alert eligible (V = 50).
There are total of 23 poor quality calls in the 2:00 PM – 4:00 PM window.
The algorithm calculates the poor quality call percentage calculates as 14.19% (23 poor calls divided by 162 total
calls).
The poor quality call percentage (14.19%) is greater than both the Error (12%) and Warn (10%) thresholds.
As a result, an Error level Operations Manager alert is generated.
Subnets represent the default out-of-the-box mode for network location alerting. All other categories require the
provisioning of network data that maps subnets to user sites and user sites to regions. We recommend that network data
be provisioned any time that QoE alerting is deployed due to the following considerations:
High alert volume due to granularity Many organizations have a large number of subnets, which can result in a
large number of alerts being generated from the same underlying network issue. For example, if a network
outage affects five different subnets, alerts can be generated from all 5 subnets.
Scaling in Operations Manager As noted, many organizations have a large number of subnets. Operations
Manager allows a maximum of 500 instances to be monitored. This means that, if a deployment has more than
500 subnets, all those subnets cannot be monitored for voice quality.
Call volume Depending on the size and granularity of a subnet, it is quite possible that subnets by themselves
will not have enough call volume to be eligible for alerts.
Subnet-based alerting is the most primitive method of monitoring network locations for voice quality issues and is
subject to the limitations just described. Network location-based QoE alerting at the user site and region levels is
recommended because it provides a less-noisy and more-actionable view of voice quality. When network locations are
configured, it is recommended that you turn off subnet-based alerting.
Having a reasonable value for minimum call volume (V) prevents a small number of poor quality calls from resulting in an
alert. By having a reasonable minimum call volume (for example, V = 50 or more), this type of noise can be prevented.
Microsoft Corporation 13
We recommend that you do not set the minimum call volume to less than 50. However, based on the number of calls
you receive, you might want to increase this value.
If an instance being monitored has less than the minimum call volume at the time it is polled, then any poor quality calls
experienced on that instance will not be considered for alerting purposes. Because of that, QoE alerting is effective only
if as many instances as possible have more than the minimum call volume any time they are polled. In other words, if
you only get 20 calls per hour then a time window of two hours will not be very useful: during those two hours you can
expect to get only 40 calls, which is less than the minimum call volume. Meeting the minimum call volume is typically not
a problem in large deployments with hundreds of calls per hour. However, in smaller deployments, there might not be
enough call volume to actively monitor your infrastructure using the default settings.
To get around the call volume issue, the time window (W), which has a default of 120 minutes, can be extended. This
means that, instead of calls from the last two hours being counted when an instance of infrastructure is polled, calls for
the last three hours or perhaps the last four hours will be polled. (The actual value depends on the value you configure
for the time window.) This helps ensure that the minimum number of calls will be counted, making more instances alert
eligible.
Of course, there is a tradeoff to using a longer time window: an outage or disruption might continue to be counted for
several hours (W minutes) after it occurred and will continue to issue QoE alerts even though the problem has already
been fixed. For example, suppose network equipment goes down at 4:00 PM, leading to congestion that results in 80% of
all the calls being classified as poor quality calls. The network problem is resolved at 4:30 PM. However, with a time
window of two hours, the poor quality calls that happened between 4:00 PM to 4:30 PM will continue to contribute to
QoE alerts until 6:30 PM.
Corollary: High Minimum Call Volume (V) Allows for More Granular Alerting Thresholds
In large deployments with aggressive service level agreements for voice quality, additional modifications for very
granular alerting should be taken into account. For example, consider this question:
To answer that, assume that exactly 50 calls take place in the time window. The most granular warning or error
percentage threshold we can set is +/- 2%. For example, if we set the threshold to 5%, then the third poor call (out of 50)
will result in an alert. Why? Because two poor calls (out of 50) equals a poor call percentage of 4%; three poor calls
equals a poor call percentage of 6%, which exceeds the 5% threshold. 6% - 4% yields a granularity of 2%.
Again, assume that exactly 500 calls take place. In that case, the most granular warning or error percentage threshold we
can set is +/- 0.2 %. For example, if we set the threshold to 5%, then the 25th poor quality call that takes place (out of
Microsoft Corporation 14
500) will result in an alert. Why? Because 24 poor calls is a poor call percentage of 4.8%; 25 poor calls results in a poor
call percentage of 5%, which exceeds the threshold and triggers an alert. In this example, 5% - 4.8% leaves a granularity
of 0.2%.
That means that a more granular threshold can be monitored without noise as long as we have a larger minimum call
volume configured.
External calls are subject to media being routed over the Internet, through the A/V Edge Server, and then on to the
enterprise deployment. Voice quality is going to be impacted by the route the media takes over the Internet, a route that
the administrator cannot control. We recommend that this value remain set to False, which will prevent external calls
from being factored into the poor call quality percentage.
Wireless networks are "lossier" than wired networks because of the variation in the reliability of the physical layer of a
wireless network. That simply means that a call over a wireless network is likely to suffer more degradation in call quality
than a call over a wired network. (It's for this very reason that many enterprises do not support VoIP over their wireless
networks.)
Because of this, wireless calls are excluded from the QoE alerting algorithm by default. This should be changed only if
you want to support VoIP calls over wireless and receive alerts in case of poor call quality.
Microsoft Corporation 15