You are on page 1of 33

A Rohde & Schwarz Company

Voice Quality with ITU-T P.863 ‘POLQA’


Application Note
July 2012

SwissQual AG
Allmendweg 8 CH-4528 Zuchwil Switzerland
t +41 32 686 65 65 f +41 32 686 65 66 e info@swissqual.com
www.swissqual.com

Part Number: 12-070-200912-4


SwissQual has made every effort to ensure that eventual instructions contained in the document are adequate and free
of errors and omissions. SwissQual will, if necessary, explain issues which may not be covered by the documents.
SwissQual’s liability for any errors in the documents is limited to the correction of errors and the aforementioned advisory
services.

Copyright 2000 - 2012 SwissQual AG. All rights reserved.

No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system, or translated
into any human or computer language without the prior written permission of SwissQual AG.

Confidential materials.

All information in this document is regarded as commercial valuable, protected and privileged intellectual property, and is
provided under the terms of existing Non-Disclosure Agreements or as commercial-in-confidence material.

When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo trademark
somewhere in your text.

SwissQual®, Seven.Five®, SQuad®, QualiPoc®, NetQual®, VQuad®, Diversity® as well as the following logos are
registered trademarks of SwissQual AG.

Diversity Explorer™, Diversity Ranger™, Diversity Unattended™, NiNA+™, NiNA™, NQAgent™, NQComm™, NQDI™,
NQTM™, NQView™, NQWeb™, QPControl™, QPView™, QualiPoc Freerider™, QualiPoc iQ™, QualiPoc Mobile™,
QualiPoc Static™, QualiWatch-M™, QualiWatch-S™, SystemInspector™, TestManager™, VMon™, VQuad-HD™ are
trademarks of SwissQual AG.

SwissQual acknowledges the following trademarks for company names and products:

Adobe®, Adobe Acrobat®, and Adobe Postscript® are trademarks of Adobe Systems Incorporated.

Apple is a trademark of Apple Computer, Inc.

DIMENSION®, LATITUDE®, and OPTIPLEX® are registered trademarks of Dell Inc.

ELEKTROBIT® is a registered trademark of Elektrobit Group Plc.

Google® is a registered trademark of Google Inc.

Intel®, Intel Itanium®, Intel Pentium®, and Intel Xeon™ are trademarks or registered trademarks of Intel Corporation.

INTERNET EXPLORER®, SMARTPHONE®, TABLET® are registered trademarks of Microsoft Corporation.

Java™ is a U.S. trademark of Sun Microsystems, Inc.

Linux® is a registered trademark of Linus Torvalds.

Microsoft®, Microsoft Windows®, Microsoft Windows NT®, and Windows Vista® are either registered trademarks or
trademarks of Microsoft Corporation in the United States and/or other countries U.S.

NOKIA® is a registered trademark of Nokia Corporation.

Oracle® is a registered US trademark of Oracle Corporation, Redwood City, California.

SAMSUNG® is a registered trademark of Samsung Corporation.

SIERRA WIRELESS® is a registered trademark of Sierra Wireless, Inc.

TRIMBLE® is a registered trademark of Trimble Navigation Limited.

U-BLOX® is a registered trademark of u-blox Holding AG.

UNIX® is a registered trademark of The Open Group.


Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Contents
Voice Quality with ITU-T P.863 ‘POLQA’ ........................................................................................................ 0

1 Introduction – What is P.863 ‘POLQA’? ............................................................................................ 4


More complex telecommunication networks and handsets ................................................................... 4
Demand for wideband audio transmission ............................................................................................ 5
Narrowband vs. super-wideband tests .................................................................................................. 5

2 Technical Details of POLQA ............................................................................................................... 6


POLQA as a model of a subjective listening test .................................................................................. 6
POLQA in narrow-band and super-wideband mode ............................................................................. 7
POLQA internal processing steps ......................................................................................................... 8
Time Alignment ................................................................................................................................ 8
Psycho-acoustic model .................................................................................................................. 10
POLQA prediction performance and typical results ............................................................................14

3 Narrow-band Voice Quality measurements with P.863 ‘POLQA' in Diversity ............................16


Idea of the narrowband test .................................................................................................................16
Speech reference signals for narrow-band tests .................................................................................17
What are the differences to the previous ITU P.862 ‘PESQ’? .............................................................18
Test definition and result presentation.................................................................................................19

4 Wideband Voice Quality measurements with P.863 ‘POLQA' in Diversity .................................22


Idea of the wideband test ....................................................................................................................22
Wideband speech reference signals ...................................................................................................22
What are the differences to narrowband? ...........................................................................................23
Where wideband quality can be assessed ..........................................................................................24
Wideband analysis in Diversity ............................................................................................................25

5 Real field measurements ..................................................................................................................26


Results in GSM / UMTS networks compared to P.862 ‘PESQ’ ...........................................................26
Results in real field networks compared to P.862 ‘PESQ’ ...................................................................28
Sample dependency of P.863 ‘POLQA’ scores in real field measurements .......................................29
Results in real field networks in super-wideband mode of P.863 ‘POLQA’.........................................30

6 Conclusion .........................................................................................................................................32

| ii
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Figures
Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks .................... 6
Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model .................................. 6
Figure 3: Basic scheme of the main components of P.863 ‘POLQA’ ................................................................ 8
Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts ................. 9
Figure 5: Illustration of assigned signal parts and the optimal ‘path’ of signal correspondences ..................... 9
Figure 6: Example of an aligned pair of reference and degraded signal ........................................................... 9
Figure 7: Block-scheme of POLQA as in ITU-T P.863 .................................................................................... 10
Figure 8: Application of masking slopes to the Bark spectrum........................................................................ 12
Figure 9: Consideration of fully and partially masked spectral parts ............................................................... 13
Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking ....................... 13
Figure 11: Insertion and capturing in a speech test setup ............................................................................... 16
Figure 12: IRS in send and receive direction as specified in ITU-T P.48 ........................................................ 17
Figure 13: P.863 ‘POLQA’ narrowband main result representation in NQDI .................................................. 19
Figure 14: P.863 ‘POLQA’ narrowband detail result representation in NQDI ................................................. 20
Figure 15: P.863 ‘POLQA’ test selection in NQDI ........................................................................................... 20
Figure 16: P.863 ‘POLQA’ statistical report in MS EXCEL .............................................................................. 21
Figure 17: P.863 ‘POLQA’ wideband main result representation in NQDI ...................................................... 25
Figure 18: P.863 ‘POLQA’ wideband audio bandwidth representation in NQDI ............................................. 25
Figure 19: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI ........................................................... 26
Figure 20: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI with signal interruptions .................... 26
Figure 21: Distribution of predicted MOS scores by P.862.1 ‘PESQ’ ............................................................. 27
Figure 22: Distribution of predicted MOS scores by P.863 ‘POLQA’ .............................................................. 28
Figure 23: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB modeand NB mode .................... 31
Figure 24: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode in wideband networks ....... 32

Tables
Table 1: Improvement in performance of P.863 ‘POLQA’ to P.862 ‘PESQ’ .................................................... 14
Table 2: Typical predicted MOS-LQ values for common transmission techniques ......................................... 18
Table 3: Typical P.863 ‘POLQA’ scores for common transmission techniques .............................................. 24
Table 4: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in high qualitative UMTS/GSM setups ... 27
Table 5: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in common real field setups .................. 29
Table 6: Comparison of different speech samples in common real field setups ............................................. 30
Table 7: Comparison of the NB and SWB mode of P.863 ‘POLQA’ in common real field setups .................. 31

| iii
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

1 Introduction – What is P.863 ‘POLQA’?


SwissQual has been driving the development of new objective perceptual quality prediction algorithms since
it was founded in 2000. Immediately after SwissQual’s foundation, its voice quality predictor SQuad was
specifically developed to meet the requirements of mobile and Voice-over-IP scenarios. SQuad still forms the
backbone of the entire voice quality suite of SwissQual to this day. Already from the beginning it overcame
disadvantages of ITU-T P.862 ‘PESQ’ in these application areas. In order to keep up with the latest
advancements in network and processing technologies, SQuad was continuously maintained and improved
over the years to deliver precise quality scores to the customer.
Already in 2005 ITU-T started a project for standardization of a new objective voice quality model. This
project called P.OLQA should extend the scope of the existing ITU-T P.862 ‘PESQ’ and overcome
disadvantages and known problems of PESQ.
The P.OLQA project was finalized in 2010 by a competition between six candidate models, including the
latest SQuad algorithm. In a detailed analysis based on more than 45’000 speech files, the SQuad algorithm
was selected as one winning model passing the challenging thresholds set by ITU-T.
Together with the two other selected models from Opticom and TNO, SQuad was integrated into a Joint
Model ‘POLQA’ that combines the strengths of the three underlying algorithms and now forms the new ITU-T
P.863 ‘POLQA’ approved in January 2011.
SwissQual is one of the most active drivers for the development of objective measures in international
standardization bodies. As a consequence SwissQual leads the corresponding working group at ITU-T and
both initiated and set several standards over the last years such as ITU-T P.563 (a no-reference voice quality
measure), ITU-T J.341 (a full-reference measure for HDTV), ETSI TR 102 506 (a method for an ‘Estimation
of Quality per Call’) and now the brand-new ITU-T P.863 ‘POLQA’.
POLQA is becoming an integral and central part of SwissQual’s voice quality analysis suite and will be the
recommended voice quality predictor for both narrowband and wideband speech.
The existing and widely introduced SQuad algorithm remains a part of Diversity and can still be used if
desired. SQuad can still be combined with the previous ITU-T P.862 ‘PESQ’. This gives all customers the
possibility to continue their ongoing measurement campaigns and to plan a transition to ITU-T
P.863 ‘POLQA’ on their own schedule.

More complex telecommunication networks and handsets


Telecommunication networks are being equipped more and more with highly non-linear components and
long distance calls are usually passing through several such components and even through different
networks.
Today, speech quality is no longer determined by the speech codec used or by lost frames alone. We also
have an interaction with different other components that perform automatic the signal level adjustments,
smart loss concealments and similar strategies in order to increase intelligibility in case of critical situations.
Unfortunately, these components are not used just once in a connection; there are rather several of them,
potentially causing interferences.
In addition to that, we also saw some progress in the standardization of speech codecs, with recently
standardized coding schemes now being integrated in the networks. These new schemes, such as EVRC
and EVRC-B used in CDMA networks as well as wideband coding schemes such as AMR-WB and
EVRC-WB were considered from the very beginning of the development of the latest SQuad version and are
now covered by POLQA as well. In addition to traditional schemes for voice source coding, audio
compression methods (e.g. MP3, AAC, …) are increasingly being used in telecommunication services as
well.
Besides more complex handsets for traditional telephony applications, new and packet-based transmission
technologies and multi-stage connections are becoming wide-spread. This leads to a lot of new distortion
types that did not appear in traditional quasi circuit-switched networks. We will see an increasing amount of
time-warping and asynchronous re-sampling effects as well as non-exact replacements of missed packets or
Chapter 1 | Introduction – What is P.863 ‘POLQA’? 4
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

speech frames. All of this considerably changes the physical signal, without necessarily affecting its
qualitative perception. The correct rating of these types of signal distortions is a clear shortcoming of PESQ
and is now solved by POLQA.

Demand for wideband audio transmission


Telecom industries are now initiating the evolution from narrowband telephony to wideband speech
transmission. The codecs for wideband are ready and approved by the standardizations bodies, the
handsets are not restricted in processing power and the core networks are being upgraded.
Of course, narrowband speech is the normal experience for telephone users and has been accepted for
decades. As the mobile telephone becomes an increasingly multi-media based device, the traditional
telephone “sound” seems less and less acceptable. The expectation of the consumer is changing whilst, at
the same time, the increased processing power allows wider audio bandwidths. The standardization bodies
provide the corresponding coding schemes and the core networks are being upgraded. The first step –
wideband transmission up to 7000Hz – is already being overhauled by the emergence of so-called super-
wideband transmission technology which opens the band up to 14000Hz, and of course by the family of
audio codecs including MP3 and AAC which even allows transmission above the hearing threshold.
SwissQual had already upgraded the entire audio processing chain in their products to address wideband
audio transmission in 2009 and in SwissQual products ‘wideband’ test applications are already available
since early 2010 using the SQuad algorithm.
SwissQual’s Diversity platform was the first measurement platform for mobile network testing and
benchmarking to integrate a real-time test of wideband speech. Therefore, SwissQual has been able to gain
the most experience with wideband testing in the field and can use this superior know-how for a smooth and
transparent integration of P.863 ‘POLQA’ in their products.

Narrowband vs. super-wideband tests


P.863 ‘POLQA’ offers the same two operational modes supported by SQuad: one for traditional narrowband
telephony and one for super-wideband (which translates to no audio bandwidth limitation in practice).
Choosing the operational mode is less a matter of how the channel is set up. The question is rather to which
kind of reference the received signal shall be compared and what is inserted into the channel.
For narrowband tests, the reference signal is a perfect signal in telephony bandwidth (in principle 300 –
3400Hz). This signal is inserted into the channel as well. The resulting MOS prediction gives a quality that is
relative to the narrowband signal. This means that a perfectly transmitted signal will be scored close to
MOS = 5 (practically 4.5). A potential wideband or super-wideband capability of the channel is not tested,
since the input signal is just narrowband.
This narrowband test case is well defined over years; it allows compatibility to existing measures like
P.862 ‘PESQ’ and SQuad in narrowband mode as well. The entire quality scale is used for these narrowband
conditions.
In case of the super-wideband mode, the reference signal is a perfect signal in almost full bandwidth (50 –
14’000Hz). Here this full-bandwidth signal is inserted into the channel. The resulting MOS prediction gives a
quality that is relative to the super-wideband signal. This means that a perfectly transmitted super-
wideband(!) signal will be scored with close to MOS = 5 (practically 4.5). If the actual channel is ‘just’
narrowband, the super-wideband signal becomes limited in audio bandwidth by the channel and the score
relative to the super-wideband reference is lower than 5.0.
This super-wideband mode should be used in all scenarios where super-wideband or wideband systems are
to be compared against each other or qualified against a narrowband network.

Chapter 1 | Introduction – What is P.863 ‘POLQA’? 5


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

2 Technical Details of POLQA

POLQA as a model of a subjective listening test


ITU-T P.863 ‘POLQA’ is a so-called full-reference model. The quality estimation is based on the comparison
of the transmitted signal with the high quality reference signal.

high quality transmission Full reference


speech signal channel measurement

copy of
high quality
speech signal

Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks


The basic approach follows the common approach used by other measures such as SQuad. It compares the
received and potentially degraded signal with an undistorted reference signal. This allows a very detailed
and fine analysis of any kind of differences between the two signals. To consider human perception, at first a
model of the listening device (i.e. a handset or a headphone) is applied. That way, the exact same signal as
it would be heard by using such a device is used.

Distorted
signal Psycho-acoustic model
Model of Device
(frequency and intensity
(i.e.handset)
warping, masking)

Distance Cognitive
MOS-LQO
Similarity model
Reference
signal Psycho-acoustic model
Model of Device
(frequency and intensity
(i.e.handset)
warping, masking)

Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model


The more important step however is the application of a psycho-acoustic model that transforms the signal
into an internal sound representation under consideration of frequency and intensity warping and masking
effects. In this internal sound representation plane, the differences between the degraded and reference
speech are calculated. These differences describe differences that would be perceptible in a direct subjective
comparison. Since speech perception and recognition is more than just listening to sound stimuli, a cognitive
model is the last step of the quality prediction. Here individual distortions are weighted according to speech
perception. For example, in case of human voice, a listener is more tolerant to certain distortion types as
long as they can be considered ‘natural’ even if they differ significantly from the reference signal.
POLQA predicts the voice quality as it is perceived in an ITU-T P.800 subjective listening only test (LOT).
Those tests are the most used listening tests in telecommunications. A listener scores the quality of a
presented voice sample on a 1 (bad) to 5 (excellent) so-called Absolute Category Rating (ACR) scale. The
listener does not compare the signal directly to a reference; he compares the signal to an internal reference,
i.e. his or her expectation of ‘how it should it sound if it were perfect’).

Chapter 2 | Technical Details of POLQA 6


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

POLQA in narrow-band and super-wideband mode


Please note that POLQA has two different operational modes, one for narrowband and one for super-
wideband. The structure of the model is the same; the main difference is: ‘What is the reference?’
In narrowband mode the reference signal is a narrowband (telephony band) signal. This simply means that
all degradations are rated in relation to this reference. The assumed listening situation is – corresponding to
narrowband telephony – an ordinarily shaped handset on one ear.
The band limitation of the telephony band itself is not considered as degradation, since the expectation of the
listener is exactly that limited signal. Therefore POLQA compares to a telephony band signal too. In addition,
POLQA in narrowband mode does not take distortions into account that are outside of the spectral range of a
telephone handset; usually frequencies below 200Hz are not transmitted anymore.
As a consequence, a ‘only’ band limited but undistorted signal is scored by POLQA with a high value in this
context. A perfect narrow-band signal will receive a POLQA score of 4.5. This narrow-band mode of POLQA
and the maximum value of 4.5 makes it backwards compatible to ITU-T P.862 ‘PESQ’ as well as to SQuad
with respect to the scale use and the targeted telephony test scenario.
Additionally, POLQA supports super-wideband signals in a super-wideband operational mode. Here – of
course – the reference signal which POLQA compares to is super-wideband. In this mode, POLQA scores
like a human listener wearing headphones and expecting HiFi quality. Consequently, for getting high scores
with POLQA the signal needs to be a clean signal and almost unlimited in bandwidth. In case of an ideal
super-wideband signal, POLQA scores with 4.75.
It is very important to understand that a narrowband telephony signal scored in super-wideband mode will
get significantly lower scores than it would get in narrowband mode. This is logical, since due to the
comparison to a super-wideband reference, all missing spectral components are considered as distortions.
That is the same as listening to telephony speech through a headphone: the listener expects HiFi quality, but
only telephony bandwidth is received.
ITU-T P.863 ‘POLQA’ is the first recommended model by ITU-T for super-wideband speech. In the context of
speech, super-wideband can even be considered ‘unlimited’ speech, since there are no relevant speech
parts anymore above 14’000Hz.
An interesting intermediate case is the traditional wideband that is limited to 7’000Hz. However, wideband is
a term that has been used in different ways. The first trials for extended audio bandwidths already started in
the 1980’s and opened the band up to 7000Hz (a sampling frequency of 16kHz was used here). The lower
band limitation was sometimes 50Hz, sometimes 100Hz. An early coding standard was based on an ADPCM
scheme (ITU-T G.722) and remained untouched for many years. Now wideband speech transmission is
coming to mobile networks and will be enabled by AMR-WB and EVRC-WB. There is still an upper limit of
7’000Hz, however the lower end is just limited by the electro-acoustical components in the mobile phones.
These traditional wideband scenarios will be scored by POLQA in its super-wideband mode as well. This
means that the wideband signal is compared against a super-wideband reference. Consequently, the parts
above 7’000Hz are missing in the signal, leading to a measured degradation. However, the parts above
7’000Hz only contribute to a lesser extent to the perceived speech quality. Therefore, an ideal signal just
limited at 7’000Hz will be scored by POLQA at around 4.5, making it backwards compatible with ITU-T
P.862.2 ‘PESQ-WB’.
It is important to note that a traditional wideband channel must be scored in POLQA super-wideband mode.
It is not in line with ITU-T P.863 to use a 7’000Hz reference signal.

Chapter 2 | Technical Details of POLQA 7


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

POLQA internal processing steps


The following block scheme provides a brief overview of POLQA. There is a differentiation between the time-
alignment part and the psycho-acoustic ‘perceptual’ and cognitive model.
Input
Reference
Idealization

Perceptual
model

Internal
Idealization representation
of the ideal

Space / Time
Alignment
Difference in
Quality
Cognitive
internal
representation model
Internal
Idealization representation
of the output
Perceptual
Degraded model
Output

Figure 3: Basic scheme of the main components of P.863 ‘POLQA’

Time Alignment
Why does POLQA perform time alignment?
POLQA and other objective measures following the same base structure compare the (spectral) short-term
characteristics of the reference signal and the degraded signal frame by frame. The alignment marks
corresponding sections in both signals. Only this way can the correct frames be compared to each other.
What makes it challenging?
Aligning two signals is simple for constant delay between the two signals and a linear transmission. Here,
just an offset has to be compensated. More complicated are un-synchronous devices (clock drift), they lead
to a constantly increasing / decreasing ‘delay‘. Here the compensation is not constant but at least constantly
and linearly changing over time. Even more challenging are processing components transmitting individual
parts of the signal with different delays. These can lead to stretched or compressed speech pauses but also
to stretched or compressed speech parts. This stretching or compressing can be done by preserving the
pitch or by just ‘warping’ the entire signal part.
In all these cases, each individual short frame of the degraded signal (usually 32ms in length) has to be
assigned to a corresponding frame in the reference signal.
How can it be done in a robust and fast way?
At first POLQA indicates signal parts where the delay can be assumed to be constant and flags them as
‘landmarks’. These parts can be of different length; in the simplest case one single part covers the entire
signal (if there is a constant delay over the entire file).

Chapter 2 | Technical Details of POLQA 8


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Correspondence
with confidence
REFERENCE PROCESSED

Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts
In a second step, the areas between these landmarks are analyzed. Therefore, the signal is sub-divided
more and more into a series of smaller parts. Each part has an assigned corresponding part in the other
signal.
Each assigned signal part is given a value that rates the confidence of the assignment. In less confident
areas a wider signal range is analyzed, whereas the assignment correspondences of parts with a high
confidence are considered as fixed.
This approach allows a very efficient and robust search structure since the search range becomes more and
more restricted as more landmarks are set. The result is a kind of matrix with corresponding signal parts and
associated search ranges.

Figure 5: Illustration of assigned signal parts and the optimal ‘path’ of signal correspondences
A Viterbi-like algorithm then calculates the most likely ‘path’ through this matrix and fixes the corresponding
signal parts.
The end result of the time alignment step is a correspondence table with start and the end times of each
signal part and its correspondence in the reference. Parts of the degraded signal with no correspondence in
the reference (i.e. inserted or added parts), as well as parts of the reference signal that are missing in the
degraded signal, are marked as well. The following signal graph illustrates a practical example. The upper
graph shows the (complete) reference signal, the lower graph shows the received and degraded signal.

Figure 6: Example of an aligned pair of reference and degraded signal

Chapter 2 | Technical Details of POLQA 9


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

The green areas denote signal parts assigned with high confidence, the blue ones are those with lower
confidence. The red signal part indicates a part of the reference signal that was lost during transmission and
is no longer present in the degraded signal. Unassigned silent parts (white) are not used for direct
comparison but rather for an analysis of the annoyance of the noise floor in there.

Psycho-acoustic model
Just like any of the models that have the same basic approach as POLQA, the psycho-acoustic model starts
with a global level alignment followed by a frame-wise spectral analysis of overlapping frames. As is usual in
these models, a short-term level scaling is applied as well, and the application of a cosine-based window and
a FFT is used for converting the audio signal from the time domain to the spectral domain.
The block scheme of the POLQA psycho-acoustic model is shown in the figure below.

Input Reference Output Degraded

Scaling towards
Scaling towards
degraded
playback level
Idealization Frequency response
Noise estimation
Reverb
Windowed FFT Windowed FFT
FRQ NOI RVB indicators

Frequency warping Frequency warping


to pitch scale to pitch scale

Frequency response
x
compensation

Masking Masking

Intensity warping to Intensity warping to


loudness scale loudness scale

Partial Local and Global


x x
scaling

Nose suppression Nose suppression

Perceptual subtraction

Asymmetry processing

Lp time integration Lp time integration

FRQ NOI Disturbance indic. ‚Da’ Disturbance indic. ‚D’ RVB


spectral shaping Stationary and Room
Disturbances in speech Disturbances in speech
band limitation switched noises reverberations

Cognitive model
- Combination of individual indicators
- Training on subjective reference scores
- Mapping into MOS scale

Predicted Listening
Quality MOS-LQO

Figure 7: Block-scheme of POLQA as in ITU-T P.863


The basic approach of the psycho-acoustic model, which means the use of critical bands and the loudness
compression, looks similar to well-known state-of-the-art models.

Chapter 2 | Technical Details of POLQA 10


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

However, there are three parts that make P.863 ‘POLQA’ different from established standards such as
P.862 ‘PESQ’.
Removing / Reduction of individual distortion types and separate consideration of them
Idealization of the reference signal
Sharpened loudness spectra

Removing / Reduction of individual distortion types


It is too easy to assume that each difference in the signal after transformation to an internal psycho-acoustic
representation will be considered correctly by this purely physiological view. There are some kinds of
distortions that are not well covered by the established and used psycho-acoustic models. These are for
example some kinds of linear distortions, such as so-called frequency responses (leading to a colored or
shaped spectral distribution), echoes and reverberations as well as strong additive background noises.
The reason for these shortcomings becomes clearer when we look at how psycho-acoustic models were
designed and evaluated. These models mainly describe the spectral integration into critical bands (due to the
so-called frequency-to-place transformation on the basilar membrane in the human ear), the sensitivity in
different spectral areas, the non-linear perception of intensity as well as spectral and temporal masking
effects. These models were widely developed and validated with signals like sine waves and narrow-band
noises. They do not include any assumptions about speech recognition. For example an echo, creating a
distortion due to slightly delayed repetition of the same speech, cannot be distinguished from a pure noise of
the same intensity using this approach.
Therefore, it makes sense to detect and quantify those distortion types in prior to the application of the
psycho-acoustic model. The long-term frequency response is calculated and compensated in the signals. An
indicator ‘FRQ’ is calculated separately and considered in the final MOS prediction. The same applies to
background noises. They are measured and widely removed from the signal. The amount of noise is later
considered through a ‘NOI’ indicator. In a similar way, echoes and reverberations are calculated for
correction of the final predicted MOS.
By applying these corrections, the signal is now much closer to a signal to which the psycho-acoustic model
can be applied. It has been freed of spectral shaping and strong noises.

Idealization of the reference signal


A truly new part to existing and established methods is a so-called idealization of the reference signal. The
idea is to remove slight distortions such as noises and to align the spectral shape and timbre towards an
ideal. This makes sense, since a listener in a scoring situation as in a P.800 ACR test, does not compare the
degraded signal to the input signal (the actually used reference), but rather to a conception of how that talker
should sound. This step is modeled for the first time in P.863 ‘POLQA’.
Common objective models compare the signal to be scored with a reference and weight all (perceptually
relevant) differences as distortion. As a consequence, if the signal for scoring is identical to the reference
(totally transparent transmission or just a copy of the reference signal), no differences will be found and the
predicted MOS is at the maximum (e.g. 4.5 for narrow-band).
POLQA is different. The internal representation of the ideal reference signal is not equal to the internal
representation of the signal used as reference. This means that a non-optimal reference, e.g. having a low
noise floor, will have that noise removed. If POLQA gets this (noisy) reference signal for scoring, it compares
that signal with the internally calculated ‘ideal’ and may provide a score lower than the maximum. POLQA is
not a model that rates differences between two signals; it rates an absolute quality and uses the idealized
reference as an upper bound. Absolute quality is the difference to an imagined or expected ideal, just like in
subjective absolute category rating tests.
It may be a bit irritating for a technical user, but POLQA is just a consequent model of the subjective listening
test. In a listening test too, listeners will never give a very high score to a signal that is a bit noisy.

Chapter 2 | Technical Details of POLQA 11


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Sharpened loudness spectra


The usual approach for transforming a signal to an internal representation is
1. Application of a time-to-frequency transformation by estimation of a short-term spectral power
density. This can be done by a filter-bank or a Fourier transformation. POLQA uses a FFT approach
after applying a Hann window. Window length is 32ms with 50% overlap.
2. Subdivision of spectrum into bands. This is usually motivated by the – in principle – logarithmic
rd
perception of frequency. This may be done in a simplified manner using 3 octave bands or – as is
more common and used in POLQA too – using critical bands according to Zwicker. At the end of this
aggregation, the hearing spectrum is sub-divided into 24 bands with increasing bandwidth towards
higher frequencies. For each band a power or intensity value is computed. This frequency scale
using critical bands is known as ‘Bark’ scale.
3. The intensity is transformed to a perceived loudness scale. Basically, the intensity is compressed at
higher sound intensities, similarly to a decibel scale. In addition, the varying sensitivity at different
frequencies is taken into account. Intensities below the hearing threshold are discarded as well. This
loudness scale is known as ‘Sone’ scale.
Of course, all objective models following this approach will apply the standard range of signal compensations
in addition to the plain psycho-acoustic transformation. These compensations include further and individual
short-term level scaling, spectral compensation and weighting functions. However, the psycho-acoustic
models almost always follow the approach outlined above.
Common models such as P.862 ‘PESQ’ apply the spectral masking thresholds directly to this internal
representation. The result is a so-called smeared spectrum. In principle this is modeling the self-masking
effects of the signal. That means that quieter parts are masked by louder parts at neighboring frequencies.
This effect is widely used and described for audio coding as in e.g. MP3, where masked parts of the signal
are considered redundant and are not transmitted. Furthermore, the quantization noise is shaped such as to
be masked by the signal and thus not to add perceptible distortions.
In POLQA this approach was revised, since we are less interested in the self-masking effects of the signal
but rather in the perception of remaining (or unmasked) differences between two signals. The chosen
approach can be imagined – compared to a ‘smeared loudness spectrum’ – as a ‘sharpened loudness
spectrum’.
In a first step the masking slopes are calculated (Figure 8):
Sone

Sone

Bark Bark

Figure 8: Application of masking slopes to the Bark spectrum


The second step consists of analyzing which masking slopes other parts of the spectrum, either fully or
partially (Figure 9):

Chapter 2 | Technical Details of POLQA 12


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

unmasked

Partially masked
Sone

Sone
Bark Bark

Masked

Figure 9: Consideration of fully and partially masked spectral parts


In a third step the fully masked spectral parts are removed and the partially masked parts are reduced in
their loudness (Figure 10):

unmasked

Partially masked
Sone

Sone

Bark Bark

Masked

Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking
Finally, we get a loudness spectrum that represents the individual spectral parts as they contribute to
perception. This means that fully masked parts are taken out, while partially masked parts are attenuated.
These modified spectra of the reference and the degraded signal are then compared and differences are
considered as ‘perceptible’ differences. The big advantage of the ‘sharpened’ approach is the remaining high
resolution of the spectrum. It allows a high spectral resolution in the analysis, as required e.g. for a valid
qualitative assessment of the reproduction of fine spectral structures in upper bands by compression
algorithms.

Aggregation and Cognitive Effects


The steps above are performed for a short-term spectral comparison across all frames in the speech signal.
At the end of this ‘main loop’ across all frames, we have a quality indicator for each frame window. This
quality indicator is based on the differences of the short-term representations of the reference and degraded
signal. It is dimensionless, but may be imagined as a kind of signal similarity over time.
These individual quality indicators represent a quality at a certain point in the audio signal. They are
aggregated over all frames. This aggregation is not a plain averaging; the aggregation contains non-linear
weightings and slopes in time. These aggregated quality indicators are then weighted and corrected using
the previously calculated descriptors for spectral shaping (FRQ), additive noises (NOI) and echoes and re-
verberations (RVB).
Finally, the aggregated overall quality indicator is mapped to the MOS scale. In the narrowband operational
mode the indicator is mapped to a range from 1.0 to 4.5, in case of super-wideband to a range of 1.0 to 4.75.
The upper bound represents the typical maximal MOS obtained in subjective listening tests.

Chapter 2 | Technical Details of POLQA 13


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

POLQA prediction performance and typical results


The main trigger for the development of ITU-T P.863 was that the existing P.862 ‘PESQ‘ did not cover
today‘s quality variations in telecommunication networks. This decreasing inaccuracy of the prediction
performance of P.862 ‘PESQ’ required an evolved voice quality prediction model to cope with
 New types of speech codecs and codecs not yet used in telecommunications, e.g. audio codecs
 Enhanced frame loss concealment techniques
 Voice Quality Enhancement (VQE) systems, non-linear processing for increasing intelligibility
 Re-sampling, time-warping

In addition the P.863 development should extend the scope of P.862 mainly by
 Extension to super-wideband (50 to 14’000Hz)
 Qualitative prediction of intermediate bandwidth, changes in audio bandwidth, bandwidth extension
 Acoustical ‘interfaces’, echoes, reverberations
 Sound presentation level

Due to the wide scope of P.863, the development and evaluation required a huge amount of test data. Test
data means, speech samples with this variation of degradations scored by human listeners in defined sub-
jective experiments. In the end, for the evaluation of P.863 ‘POLQA’ a total of 62 subjectively scored data
sets were used containing more than 45’000 voice samples.
1
These data sets were used for calculating the prediction performance by means of residual square errors or
correlation coefficients. The residual square error or – as in previous times – Pearson’s correlation coefficient
is the indicator for the accuracy of the objective measure; it is given by the remaining prediction error to the
‘true’ scores obtained in the subjective tests.
These values give an overview of the performance in general. However, the actual reached numbers depend
on the construction of the data set and the kind of conditions it contains. It is always true that there are test
conditions that can be predicted ‘easily’ in an accurate way by a model (e.g. noises, waveform codecs and
so on) and others where the deviation is higher (usually combinations of distortions). The occurrence of such
conditions in a data set has a strong influence on these figures. This is not only due to the objective
prediction method rather caused by uncertainties of the listeners in the auditory tests as well.
For the P.863 ‘POLQA’ evaluation ITU-T has chosen a statistic approach that is based on an r.m.s.e.
calculation, but takes the uncertainty of the subjectively derived MOS values into account. Based on these
figures, the performance evaluation of P.863 ‘POLQA’ compared to P.862.1 and P.862.2 ‘PESQ’ was done.
Table 1: Improvement in performance of P.863 ‘POLQA’ to P.862 ‘PESQ’
rmse* P.862.1 'PESQ' P.863 'POLQA' in NB mode Improvement by
Classical narrowband exp. 0.157 0.123 22%
Advanced narrowband exp. 0.227 0.154 32%

P.862.2 'PESQ-WB' P.863 'POLQA' in SWB mode Improvement by


Wideband experiments 0.345 0.150 57%

1
A data set, also often called experiment or database, is a set of speech files processed or transmitted under different
real field or simulated conditions and scored subjectively. A data set usually consists of about 200 individual speech
samples. The prediction accuracy is calculated by comparison of the MOS scores given by the listeners and the
prediction by the objective measure as e.g. P.863 ‘POLQA’.
Chapter 2 | Technical Details of POLQA 14
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

The so-called classical set of narrowband experiments covers 22 data sets used in ITU-T already for
standardization efforts from the mid 90’s until about 2003. They contain common codec and noise
nd rd
distortions, mobile channels of the 2 and 3 generation as well as VoIP as it was state of the art at the
millennium. Even though these databases cover distortions that were already used during the development
of P.862 ‘PESQ’, the new method P.863 ‘POLQA’ shows even higher prediction accuracy here.
The advanced set of narrowband experiments is more focused on the latest coding technologies, frame loss
rd th
concealment strategies, noise reduction and of course 3 and 4 generation mobile as well as the newest
VoIP implementations. This set is based on 15 data sets. The improvement reached with the new method
P.863 ‘POLQA’ is evident. This set covers a wide range of test conditions of latest technologies which P.863
was designed for.
Finally, there was a set of common wideband data as well. It covers 7 different data sets. Here the
improvement over P.862.2 ‘PESQ-WB’ is extremely high.

Chapter 2 | Technical Details of POLQA 15


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

3 Narrow-band Voice Quality measurements with


P.863 ‘POLQA' in Diversity

Idea of the narrowband test


The idea of a narrowband test is a test situation, in which ‘a listener’ listens to a speech signal using a
conventionally shaped telephone handset. That means, he or she is restricted to the telephony bandwidth
and does expect such a band-limited signal as well. As a consequence a perfect sounding but band-limited
signal will get a high score, since it exactly matches that listener’s expectation of excellent quality in this
context. Despite of a channel having a wider audio bandwidth; the listener would not experience this, since
there is a limitation in the transfer of the handset.
Therefore, a narrowband speech test, independent of whether it is performed with P.863 ‘POLQA’, with
P.862 ‘PESQ’ or with SQuad in narrowband mode, always models a conventional narrowband telephony
situation.
This test approach has been very commonly used over the last years or even decades and is perfectly suited
for characterization of narrowband networks and systems. However, no qualification relatively to wideband or
super-wideband systems is possible.
Of course, neither P.863 ‘POLQA’ nor SQuad ‘listens’ to the speech signal using a handset. The handset
transfer characteristic is modeled in the algorithms itself. That means, only spectral parts that are perceptible
via such a handset will by analyzed by the quality predictor. Both P.863 ‘POLQA’ and SQuad analyze signals
as they are recorded at an electrical interface to the network. This can either be an ISDN or PSTN line but
also a headphone connector of a mobile device. The specification of such a receiving handset is called IRS
receive (IRS: Intermediate Reference System) and is described in ITU-T P.48. The IRS receive characteristic
can be imagined as a weak telephone band-pass with a slight preference for higher frequencies towards
3kHz.
Reference circuit with the assumption of a channel gain of 0dB and and Overall Loudness Rating of 10dB

89 dB(A) SPL -26dB ovl -26dB ovl 79 dB(A) SPL

electr. electr.
network / real channel
interface interface

Reference model of electr. electr. model of


network / real channel
speech signal microphone interface interface handset
psycho-
acoustic
model
Copy of reference model of
speech signal handset

MOS predictor (i.e. POLQA)

Figure 11: Insertion and capturing in a speech test setup


(SPL: sound pressure level, OVL relative to overload point)

Similar to this, the sending direction is modeled in this narrow-band setup as well. The source speech signal
is inserted into an electrical interface, either a PSTN or ISDN line or into the microphone input of a mobile
device. In reality at this point the signal has passed the microphone and some voice processing components
already. To emulate this part of the signal path, a model of a typical narrowband microphone is applied. This
is called IRS send, since it models the device in sending direction. It can also be imagined as a weak

Chapter 3 | Narrow-band Voice Quality measurements with 16


P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

telephony band-pass but with a quite strong pre-emphasis up to 3kHz. This makes the speech sound a bit
2
‘sharp’ but with higher intelligibility in background noise situations.
Figure 11 schematizes the idea behind a narrow-band test. The modeled sending device allows a direct
electric coupling to the channel under test and guarantees reproducible results independent from an actual
used microphone.
The frequency responses for the two filters modeling the device are given in Figure 12. It is clearly visible
that there is a bandwidth limitation to the telephony band, although a slightly wider band can pass than just
300 to 3400Hz.

10 IRS send direction (ITU-T P.48) 10 IRS rcv direction (ITU-T P.48)

0 0

a / dB
a / dB

-10 -10

-20 -20

-30 -30
0 1000 2000 3000 4000 0 1000 2000 3000 4000
f / Hz f / Hz

Figure 12: IRS in send and receive direction as specified in ITU-T P.48

While for ISDN and PSTN interfaces defined level and impedance requirements are given and fulfilled by the
interface devices, for mobile phones only the headset connector as a proprietary interface is available.
SwissQual’s connector interface for mobile phones is adjusted for this type of interface. It applies the correct
level, adjusts the frequency response and matches to the impedance of each individual phone type and
enables a quasi-standard electrical network termination point even for mobile handsets.

Speech reference signals for narrow-band tests


ITU-T P.863 ‘POLQA’ is a so-called full-reference model. The basic approach follows the common approach
of those measures that is the same for e.g. SQuad or P.862 ‘PESQ’ as well. It compares the received and
potentially degraded signal with an undistorted reference signal. This allows a very detailed and fine analysis
of any kind of difference between the two signals. In consideration of human perception, at first a model of
the listening device (in case of narrow-band a handset) is applied in the model itself. This way, the exact
same signal is compared as it would be heard by using such a device. In a narrow-band test case the signal
is compared to an optimal narrow-band reference.
ITU-T P.800 and P.862.3 give constraints and requirements to the speech samples to be used. That is
mainly the temporal structure and signal level of the speech signal. SwissQual’s measurement systems pro-
vide a set of speech samples in different languages. All speech samples follow the same rules composition
and pre-processing, they are all composed of a meaningful female and a male sentence. The reference
speech file is 6s in length and contains more than 3.5s of active speech. There is a speech pause between
the two sentences as required by P.862.3 and P.863. The signal is adjusted to a speech r.m.s.e. level of -
3
26dB rel. OVL that corresponds to an analogue level of -20dBm at a 600 Ohms four-wire interface.
In addition P.862.3 defines the insertion and capturing process for the speech signal. These definitions are
describing the insertion point and the expected spectral characteristics. In narrow-band channels the signals
are inserted after a pre-filtering according to the IRSsend characteristic as shown and explained above.

2
This characteristic is taken from older carbon microphones: the pre-emphasis should compensate the low-pass
characteristic of the inductive loaded analogue lines at that time.
3
The value of -26dB relates to an overload point of 32767/-32768 as is used in 16bit resolution in the digital signal
domain.
Chapter 3 | Narrow-band Voice Quality measurements with 17
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

What are the differences to the previous ITU P.862 ‘PESQ’?


Actually the differences are quite small for common applications in cellular networks. A customer may only
see a slightly changing MOS-LQ value for error-free or high quality transmission using EFR or AMR with
higher bitrates. Instead of a typical value in the range of 4.0 for EFR, they may now obtain values in the
range of 3.9. This is mainly due to the fact that the actual bandwidth limitation in the narrowband channel is
also considered by P.863 ‘POLQA’, i.e. limitations relative to the IRSrcv frequency response. In case the
actually used bandwidth is slightly narrower, the MOS will be lower by a very narrow margin as well.
An improvement will be seen for EVRC type codecs as used in CDMA. The new P.863 ‘POLQA’ shows an
4
even better comparability to EFR/AMR codecs.
Furthermore, POLQA is trained for scoring complex channels including more than just a codec, e.g. noise
reduction, variable gain and filtering as well as strong time warping.
The following table shows the main differences in scores between SQuad version ‘08’, P.862.1 ‘PESQ’ and
P.863 ‘POLQA’.
The results are based on typical speech samples and are an average across six speech samples (i.e.
American English as used in SwissQual Diversity). Except the ‘transparent transmission’ all samples were
pre-filtered by IRSsend.
Table 2: Typical predicted MOS-LQ values for common transmission techniques for SQuad ‘08’, as well as
P.862.1 ‘PESQ’ and P.863 ‘POLQA’.

P.862.1 SQuad-LQ ‘08’ P.863


(narrowband) (narrowband) (narrowband)

Linear distortions
Transparent transmission 4.50 4.50 4.50
~40 – ~3800 Hz
Transparent transmission 4.40 4.50 4.30
~180 – ~3500 Hz (G.712)
Transparent transmission 4.50 4.50 4.40
~200 – ~3500 Hz (IRSsend)
Transparent transmission 4.10 4.30 3.60
300 – 3400 Hz (box block)
IRSsend + G.711 4.40 4.40 4.30
(A-Law standard PCM)
Codec conditions
IRSsend + EFR / AMR 12.2kbps 4.15 4.15 4.20
IRSsend + EFR 4.10 4.15 4.10
(real loss-free connection)
IRSsend + QCELP 13kbps 3.90 4.00 4.00
IRSsend + EVRC 9.5 kbps 3.75 3.90 3.90
IRSsend + EVRC-B 9.3 kbps 3.75 4.00 3.90
IRSsend + AMR 7.95 kbps 3.90 4.00 3.95
IRSsend + AMR 6.70 kbps 3.75 3.90 3.85
AMR 4.75 kbps 3.40 3.70 3.65

4
ITU-T and 3GPP do not recommend the use of the P.862 family for EVRC-type codecs.
Chapter 3 | Narrow-band Voice Quality measurements with 18
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

The codecs are used as reference SW implementations. In addition one EFR condition is shown as it
behaves in a real loss-free channel, using a commercial Nokia handset as access device to the network. The
channel was terminated by an ISDN card device running G.711 A-Law.
Firstly, a very slight more pessimistic prediction is enabled by P.863 ‘POLQA’ compared to SQuad08.
However, for practical use cases this absolute difference is negligible. Compared to P.862.1 the higher rates
of AMR match very well even though the lower rates are scored higher by P.863. In addition, the EVRC type
codecs are scored higher and more realistic by P.863 and especially SQuad08 compared to P.862.1.
P.863 ‘POLQA’ considers linear distortions and bandwidth limitations in its score. For super-wideband mode
it is obvious. There, a signal is always compared to a super-wideband reference (50 to 14000 Hz). It is
important to note that P.863 ‘POLQA’ in narrow-band mode considers a ‘full narrow-band’ signal (~50 to
3800 Hz) as reference. To this signal an IRSrcv filter is applied in P.863 ‘POLQA’ itself. That means
limitations lowering this bandwidth will lead to a predicted distortion. With P.863 ‘POLQA’ the actual channel
filters and band-pass characteristics in the microphone and loudspeaker path of the used mobile phone are
5
taken more into account as it was for P.862 ‘PESQ’.
SwissQual’s SQuad08 also considers linear distortion in narrow-band mode; however it is less sensitive than
P.863 ‘POLQA’ and is supposed to be less dependent from the actually used phone and its internal filtering.

SwissQual’s speech quality suite offers two methods for predicting listening quality: The known SQuad08
and the new ITU-T P.863 ‘POLQA’. Both models may be combined with ITU-T P.862 ‘PESQ’ as an option.
The entire framework as known from SQuad including the voice samples, the insertion and capturing
procedure and – of course – all of the additional signal analysis results are used and available for
P.863 ‘POLQA’ in the same way.

Test definition and result presentation


The definition of tests, the timing and the selection of speech files are exactly the same as for ‘Speech’ tests
using SQuad. The only difference is the naming: speech tests with P.863 are called ‘Speech POLQA’.
P.863 ‘POLQA’ is embedded in the same framework that is used for SQuad. For P.863 ‘POLQA’ tests as
well, all additional information such as levels, noise analysis, delay variations and frequency response are
calculated and are available. Consequently, the obtained results are presented in the same format in
SwissQual’s NQDI. This underlines the close relationship between the speech quality measures. Only the
MOS prediction is either made by SQuad or by P.863 ‘POLQA’; the value measured by both is ‘Listening
Quality’ as indicated by the type of test.

Figure 13: P.863 ‘POLQA’ narrowband main result representation in NQDI


To differentiate P.863 ‘POLQA’ tests from SQuad and P.862 ‘PESQ’, the actually used method is given in
parentheses behind label ‘Listening Quality’. For an immediate visual feedback, the POLQA logo is shown
right below the predicted MOS score.

5
Since, P.863 ‘POLQA’ measures the actual spectral loss of the speech signal, the actual impact by band-limitations
depend on the actual spectral power distribution if the speech sample. That means there are samples more or less
affected by this filtering due to their spectral characteristic e.g. losing more or less high frequency parts.

Chapter 3 | Narrow-band Voice Quality measurements with 19


P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

In addition to the global values for the entire speech sample, graphs illustrate the quality profile over the
sample duration, the signal envelopes as well as the signal gain

Figure 14: P.863 ‘POLQA’ narrowband detail result representation in NQDI


P.863 ‘POLQA’ is treated as a separate method for listening quality measurements in NQDI. The test
selection tab sheet in NQDI can be used to select individual P.863 ‘POLQA’ tests.

Figure 15: P.863 ‘POLQA’ test selection in NQDI


For reporting, the group of ‘Voice’ reports in NQDI sports a ‘LQ narrowband statistic’ report. It reports not
only the P.863 results but rather the results of all other algorithms such as SQuad and P.862 ‘PESQ’ in the

Chapter 3 | Narrow-band Voice Quality measurements with 20


P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

same table. The results for each algorithm are given in a separate column.

Figure 16: P.863 ‘POLQA’ statistical report in MS EXCEL

Chapter 3 | Narrow-band Voice Quality measurements with 21


P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

4 Wideband Voice Quality measurements with


P.863 ‘POLQA' in Diversity

Idea of the wideband test


The idea of a wideband or – in correct terms – a super-wideband test is a test situation in which ‘a listener’
listens to a speech signal using HiFi headphones. This means that he or she is not restricted to any
bandwidth. The headphone is able to transmit the entire perceptible audio bandwidth.
As a consequence, a perfect sounding and not band-limited signal will get a high score, since it exactly
matches the listener’s expectation of excellent quality in such a setup. On the one hand, the headphone
equipment itself sets a high expectation; on the other hand, the listener ‘knows’ the unlimited speech signal,
it is presented in this experimental context.
The modeling is similar to the narrowband telephony case as shown in Figure 11 but is adapted to the
changed setup. This means that the MOS predictor, e.g. P.863 ‘POLQA’, will not ‘listen’ through a telephony
handset, but rather models a headphone as listening device. It is modeled in a simplified manner as a flat
6
filter from 50 to 14’000Hz.
Similarly, a wideband or super-wideband device does not follow the IRS send filter characteristic in the
microphone path anymore. It is also close to a flat filter with a band limitation at a higher point in frequency.
This has the consequence that the channel or system under test receives a super-wideband input signal. In
case there is a narrowband channel or device, this channel or device will restrict the bandwidth. At the other
end the predictor ‘listens with a headset’ and compares the received signal to the unlimited reference signal.
This leads to recognition of missed spectral parts and this ‘missing information’ consequently leads to a drop
in quality. It can be imagined as listening to HiFi signals through a headphone and suddenly being presented
a narrow-band signal. Here a human listener will also perceive it as lower in quality.
The transmitted bandwidth becomes much more important for a super-wideband test. Restrictions in audio
bandwidth are always compared relative to a super-wideband reference signal.
Note: The use of the super-wideband test scenario is NOT restricted to wideband or super-wideband
channels or devices. The scenario just defines the reference that is super-wideband in this case. Test
scenarios in super-wideband will be the common test case in the near future, they are not only required for a
valid evaluation of wideband systems but rather also for correct ranking of systems or networks with different
bandwidths downwards to narrowband.
A super-wideband test scenario implies some technical requirements. Within SwissQual’s product lines the
whole audio processing chain from the handset’s audio connector across the analogue circuits in Diversity as
well as the digital signal processing is designed and extended to higher sampling frequencies and audio
bandwidth already from the beginning.
Along with Diversity Release 10.2.0, SwissQual has launched a super-wideband test application for the first
time along with SQuad. Now, in Release 10.6.5 the wideband test application has been completed by the
integration of the new ITU-T Recommendation P.863 ‘POLQA'.

Wideband speech reference signals


As already mentioned ITU-T P.863 ‘POLQA’ is a so-called full-reference model. It compares the received
and potentially degraded signal with an undistorted reference signal but this is the undistorted reference that
is practically unlimited in bandwidth.

6
For narrow-band mode P.863 ‘POLQA’ applies an IRS receive filter that emulates a narrow-band handset
(see: Figure 12: IRS in send and receive direction as specified in ITU-T P.48)
Chapter 4 | Wideband Voice Quality measurements with 22
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

This is the difference to the narrow-band case. The comparison of the recorded signal is made relatively to a
super-wideband reference. In the same way, the recorded signal is not post-filtered to avoid any band
limitation that models a receiving HiFi headphone.
That means, in case of a ‘full-band’ audio channel (i.e. a VoIP connection using full audio bandwidth or an
application using a MP3 with sufficient bitrate as in video or audio streaming), the recorded signal matches to
the reference in its bandwidth. In case of a common wideband or even a narrow-band channel or device, the
bandwidth becomes limited during transmission. In case this signal is recorded and compared to the full
reference, the spectral loss is weighted as degradation.
Of course the exploration of a wideband channel requires also the insertion of a signal with sufficient
bandwidth. To actually feed wideband signals into the channel, new voice samples were recorded. They are
without a perceptual bandwidth limitation and are stored at 32kHz sampling frequency in a separate
reference folder ‘Speech-Wideband’ or ‘Speech-Wideband POLQA’ respectively. As usual, the samples are
constructed out of a male and a female spoken sentence and have a constant length of 6s. Thus, the
continuity to the narrowband tests is completely given.
For the time being SwissQual provides samples in
German (German pronunciation)
German (Swiss pronunciation)
British English
Italian
Dutch
Each language sample is provided without any pre-filtering (except for a 50 – 14’000Hz band-pass) and
called i.e. GE_fm_wide.wav. As specified for wideband devices, the microphone path is considered as flat in
the transmission band. It means no IRSsend as for required narrow-band is applied. The signal remains ‘flat’,
without any further band limitation and without any pre-emphasis as in the IRS.

What are the differences to narrowband?


In traditional telephony scenarios, the expectation is set to a perfect but narrowband voice signal. A signal
that is close or identical to such a signal is scored with a high quality value (usually a MOS-LQ of around 4.5
7
on a five-point scale). Additional degradations will decrease the quality value up to a minimum of 1.0.
Within a wideband scenario, the expectation of excellent quality is a perfect wideband speech signal. Since
the same scale is used here, such a perfect wideband signal is scored with 4.5 too. Obviously, a narrowband
signal in the same context will not fulfill the expectation of high quality due to its band limitation.
Consequently, it will be scored lower in this context.
This is roughly spoken the main difference. There are other effects such as a different perception of noises,
since there are noise parts in the higher frequency ranges which are less or not masked by voice anymore,
as well as other effects. But the main difference will be the lower scored narrowband signals.
Most important for customers will be typical values to be obtained with the wideband application compared to
narrowband measurements.
The following table shows typical values obtained in the two test scenarios for the same type of conditions by
averaging the predicted scores of five different speech samples.

7
For more detailed information, please refer to:
‚White Paper – About MOS and Quality Measurements’ published by SwissQual AG in 2011.
Chapter 4 | Wideband Voice Quality measurements with 23
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Table 3: Typical P.863 ‘POLQA’ scores for common transmission techniques in a wideband and a narrowband context

P.863 in P.863 in
super-wideband narrowband
(50-14000 Hz) (300-3400 Hz)

Transparent transmission 50 – 14000Hz or wider 4.75 -

Transparent transmission 50 – 7000Hz (‘common’ wideband) 4.3 -


AMR-WB 12.65 kbps (50 – 7000Hz) 3.8 -
Transparent transmission 50 – 3800Hz (‘Full Narrowband’) 3.6 4.5
Transparent transmission ~250 – 3500Hz (‘IRSsend’) 3.5 4.4
Transparent transmission 300 – 3400Hz (‘telephony box block’) 3.0 3.6
IRSsend + G.712 + G.711 (A-Law standard PCM channel) 3.5 4.3
IRSsend + EFR / AMR 12.2kbps 3.2 4.2
IRSsend + EVRC 9.5 kbps 3.0 3.9
IRSsend + EVRC-B 9.3 kbps 3.0 3.9
IRSsend + AMR 7.95 kbps 2.9 3.9

It can be seen that the rank-order of the systems remains independent from the test scenario. The upper
range of the wide-band scale is just used for the high qualitative wideband voice samples. The common
narrowband scenarios are compressed to the lower 60% of the scale and thus show a smaller gradient as
well.
In case of optimizing and benchmarking pure narrowband networks and applications, the common
narrowband test application can be used without any problems. The individual systems are more clearly
discriminated due to the wider scale range used.
For optimizing wideband applications and networks and especially for benchmarking of wideband networks
against narrowband ones, a wideband test application is required.
Firstly, the degradations in wideband mode can only be assessed in a wideband test application and
secondly, a wideband signal can only ‘show’ its better quality against narrowband in wideband mode.
Note: Narrowband MOS-LQ values and wideband MOS-LQ values must never be mixed or directly
compared. They are referring to different interpretations of the MOS scale.

Where wideband quality can be assessed


Although wideband is a normal use case in daily life’s communication such as TV and FM radio, it is still not
popular in telecommunications.
It was used for commercial video conferencing systems, but it was Internet Telephony that enabled
wideband telephony for normal users for the first time. Today, common VoIP clients support a wide range of
wideband codecs and use them if a sufficient bit-rate is available for the service.
Now the next step in wideband telephony is the evolution of cellular networks and handsets. The networks
and user devices are being equipped with AMR-WB, allowing an audio bandwidth up to 7000Hz while still
remaining in the typical bit-rate range used for GSM and UMTS.
Typical applications where wideband is already in use are mobile connections in GSM / UMTS. Here the first
operators have enabled AMR-WB in TrFO mode. Usually, in GSM the AMR-WB bitrate is restricted to
12.65 kbps while in UMTS AMR-WB bitrates up to 23.85kbps are used. Another application that can be
tested in real field applications with Diversity today is VoIP connections. Here even super-wideband trans-
mission is possible. There are different codecs in use, with both standardized and proprietary solutions. Both
Chapter 4 | Wideband Voice Quality measurements with 24
P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

were considered in the huge training set for SQuad and P.863 ‘POLQA’.
The main focus of Diversity’s wideband test solution is of course the evaluation and benchmarking of
wideband channels in cellular networks.
An additional application area for wideband voice testing in Diversity is video streaming. In video streaming
audio codecs are usually used; these don’t have any bandwidth restriction, except in very low bitrate
conditions. Consequently, Speech Wideband as a test case is also applied to video streaming starting with
Release 10.2 of Diversity and completed in Release 11.0 with the full support of ITU-T P.863 ‘POLQA’.

Wideband analysis in Diversity


The super-wideband test application forms an own-standing test ‘Speech Wideband POLQA’, whilst the tests
‘Speech’ and ‘Speech POLQA’ remain at narrowband.
While ‘Speech Wideband’ runs the SQuad algorithm, ‘Speech Wideband POLQA’ enables P.863 as the
voice quality estimator. The same is true for ‘Speech’ and ‘Speech POLQA’.
All these test types, ‘Speech’, ‘Speech Wideband’, ‘Speech POLQA’ and ‘Speech Wideband POLQA’ can be
selected as separate tests with SwissQual’s Test Manager and in the post-processing tools NQDI and
NQView. The presentation in NQDI looks almost the same; however, the test name differs to differentiate
between the tests and the used algorithms.

Figure 17: P.863 ‘POLQA’ wideband main result representation in NQDI


The application type (highlighted in red) explains the modeled listening situation in detail. In addition, since a
potential bandwidth reduction is a serious impact in a wideband scenario, the actual bandwidth of the
channel is measured and reported as well (highlighted in green). There are three classes:
narrowband (up to ~3’800Hz)
wideband (up to 8’000Hz)
super-wideband (up to 14’000Hz).
The remaining values are the same as usual and well known for SQuad and are visible in narrowband tests
as well. They provide information about the speech level, noise floor, the amount of missed voice and the
gain applied by the channel.
The tab sheet ‘Speech Details’ clearly shows the audio bandwidth of the measured audio channel, in this
case a ‘common’ wide band channel up to almost 8’000Hz (Figure 18).

Figure 18: P.863 ‘POLQA’ wideband audio bandwidth representation in NQDI

The lower and upper bound are marked with blue lines. As is clearly visible, Diversity and ITU-T P.863 make
use of real super-wideband signals. The frequency scale here ends at 16’000 Hz; this corresponds to an
internal sampling frequency of 32’000kHz.

Chapter 4 | Wideband Voice Quality measurements with 25


P.863 ‘POLQA' in Diversity
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

5 Real field measurements


One of the most important questions is the relation of P.863 ‘POLQA’ results to previous P.862 ‘PESQ’
measurements under real field conditions. Of course, P.862 ‘PESQ’ and P.863 ‘POLQA’ are different
algorithms and treat distortions in the signal differently. However, at the end the predicted MOS should
accurately describe the quality of the voice or of the voice channel. This means that in cases where P.862
‘PESQ’ delivered accurate predictions, the newer and improved P.863 ‘POLQA’ should predict almost the
same value. For distortions where P.862 ‘PESQ’ produced more inaccurate predictions, P.863 ‘POLQA’ as
8
an improved method will predict more accurate but therefore differently from P.862 ‘PESQ’.
In real field measurements the channel consists of more than just a codec. Even under perfect radio
conditions there can be other factors that limit the maximum quality. These could be further bandwidth
limitations that are due to the actual device used, or further speech processing steps such as noise and gain
control that are applied in the device or in the network. There might also be trans-coding, i.e. a second
encoding/decoding step, for example in case of mobile-to-mobile connections or in special gateways from
the mobile core to PSTN networks. For these reasons, the MOS scores obtained in a plain codec emulation
as given in Table 3 are usually only reached in real field cases where the device and the network can be
considered as transparent and do not apply further speech signal processing as e.g. through noise or gain
control.

Results in GSM / UMTS networks compared to P.862 ‘PESQ’


The following arbitrarily picked sample shows the correspondence. It is a perfect transmission from a
perfectly matched PSTN line at the land side to a GSM network using AMR at 12.2 kbps. The audio level is
perfectly adjusted, there are no perceptible audio bandwidth limitations and there is no speech missed (no
temporal clipping, no interruptions). Most notably, the devices can be considered as quite transparent, as
they don’t apply aggressive noise reduction or gain control mechanisms. The results can be considered as
identical between the two measures.

Figure 19: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI

A good example of the difference between the two algorithms is the treatment of interruptions and lost
speech. Here P.862 ‘PESQ’ is suspected of scoring inaccurately and usually too optimistic. In the example
almost 4% of the original speech was lost, however P.862 ‘PESQ’ scores with 3.2, while P.863 ‘POLQA’ only
predicts 2.7 which appears closer to the perceived score here.

Figure 20: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI with signal interruptions

By analyzing a larger number of quality scores obtained in a drive test, the picture remains almost the same.
The following figures are based on a drive test and a collection of data from a European operator. The

8
P.862 ‚PESQ’ defines the algorithm technically. The actual transformation from the P.862 outcome to a MOS-like scale
is defined in P.862.1. All predicted MOS scores in this document are computed in accordance to P.862 and were
converted to the MOS domain according to P.862.1.
Chapter 5 | Real field measurements 26
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

speech sample used was American English and each given number is based on a collection of around 100
individual scores.
Table 4: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in high qualitative UMTS/GSM setups
Average Maximum
P.862.1 P.863 P.862.1 P.863
PESQ POLQA PESQ POLQA

Downlink UMTS 2100 Device A 3.97 3.97 4.19 4.19


Device B 4.04 4.06 4.17 4.17
GSM 900 Device A 3.78 3.77 4.19 4.18
Device B 3.87 3.87 4.17 4.20

Uplink UMTS 2100 Device A 3.92 3.80 4.13 4.02


Device B 4.01 3.83 4.12 4.04
GSM 900 Device A 3.74 3.60 4.11 4.01
Device B 3.78 3.59 4.10 3.99

Just looking at ‘Downlink’ which is usually the less critical direction, there is on average a difference between
PESQ and POLQA averages of just 0.02, which is completely negligible. There are small differences in
average between the phones and the two technologies GSM and UMTS. But the behavior is always the
same for either method, i.e. GSM 900 is scored lower by 0.2 MOS on average with both methods.
In Uplink the situation is slightly different. Here P.863 ‘POLQA’ scores slightly lower than PESQ, on average
by 0.15 MOS. This effect is due to several reasons, the main one being the more restricted audio bandwidth
by using the microphone path of the mobile device as it is the case in Uplink. By contrast, the Downlink is
using the (wider) loudspeaker path of the phone. The former P.862 ‘PESQ’ compensates the frequency
response of the channel and therefore ‘ignores’ that band-limitation mostly. P.863 ‘POLQA’ considers
changes in bandwidth as they are perceived by a user and consequently a limitation will lead to a slightly
lower score here.
Besides the average values, the distribution of the predicted values provides information of the measures
behavior. The following two graphs are based on the downlink scores of Device ‘A’ in UMTS 2100 as above.

Listening Quality distribution (P.862.1 'PESQ') PDF

50%
PDF Number of Values

45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1

2-2.1

3-3.1

4-4.1
1.2-1.3

1.4-1.5

1.6-1.7

1.8-1.9

2.2-2.3

2.4-2.5

2.6-2.7

2.8-2.9

3.2-3.3

3.4-3.5

3.6-3.7

3.8-3.9

4.2-4.3

4.4-4.5

4.6-4.7

4.8-4.9

Listening Quality

Figure 21: Distribution of predicted MOS scores by P.862.1 ‘PESQ’ (Device A, UMTS, Downlink as in Table 4)

Chapter 5 | Real field measurements 27


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Listening Quality distribution (P.863-NB 'POLQA') PDF

50%
PDF Number of Values

45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1

2-2.1

3-3.1

4-4.1
1.2-1.3

1.4-1.5

1.6-1.7

1.8-1.9

2.2-2.3

2.4-2.5

2.6-2.7

2.8-2.9

3.2-3.3

3.4-3.5

3.6-3.7

3.8-3.9

4.2-4.3

4.4-4.5

4.6-4.7

4.8-4.9
Listening Quality

Figure 22: Distribution of predicted MOS scores by P.863 ‘POLQA’ (Device A, UMTS, Downlink as in Table 4)

Both distribution functions are very close and concentrate a wide majority of the scores in the range of 4.0 to
4.2 that corresponds to the best quality in error-free connections. It is logical that a certain quality can’t be
exceeded. It is set by the coding scheme, the channel limits and other included voice processing. Even in
undistorted conditions they insert a certain amount of degradation. This defines the upper level that can’t be
exceeded in this setup. This causes the steep decline towards higher values on the right-hand side. Usually,
the majority of scores are in this region which corresponds to error-free transmission.
In the direction of lower values, the distribution falls shallower. Values in this region indicate degradations in
addition to the unavoidable distortions. In cellular networks these problems are usually interruptions (due to
handovers), falling back to lower bitrates in case of AMR (due to bad radio conditions) and frame losses that
were concealed artificially by the AMR decoder. In principle there could be other distortions as well,
e.g. transcodings in case of special routing or noise bursts coupled into analogue parts of the PSTN.
Regarding the absolute maximum as shown in Table 4 there is no difference between the phones and the
technologies used, meaning that the reachable quality is identical for both and the slightly differing averages
are caused by individual test conditions e.g. slightly different RF coupling or a few more bad channels in the
averaging process. It should be noted that the reached maximum is the same as obtained by just processing
the same speech sample over an AMR 12.2 kbps codec in offline emulation. This indicates that there are no
further distortions introduced by the phone or constant speech processing components in the network.

Results in real field networks compared to P.862 ‘PESQ’


The previous section discussed results of an almost transparent network without further speech processing.
This is not always the case. This section introduces and discusses results of more complex and
heterogeneous network setups and routings. The following figures are restricted to downlink and the use of
the American English speech sample. In addition to the average and the absolute maximum score, a
percentile value of 80% gives a good approximation of the quality that can be achieved confidently in this
setup under ideal circumstances. This value is just below the best 20% samples in the collection and is a bit
more robust than an absolute maximum.
The first two lines show again the almost transparent network (named ‘Network 1’) along with a phone that
does not apply aggressive gain or noise control. The maximum values are as they can be expected with the
AMR codec running at 12.2 kbps.

Chapter 5 | Real field measurements 28


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Table 5: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in common real field setups

Average Maximum 80% Percentile


P.862.1 P.863 P.862.1 P.863 P.862.1 P.863
PESQ POLQA PESQ POLQA PESQ POLQA

Downlink Network 1 Device A 3.97 3.97 4.19 4.19 4.13 4.12


UMTS
Device B 4.04 4.06 4.17 4.17 4.14 4.13

Network 2
Device C 3.35 3.42 3.69 3.69 3.54 3.56
UMTS

Network 3
Device D 3.98 3.79 4.10 3.97 4.03 3.87
UMTS

Network 4
Device E 3.30 3.29 3.76 3.66 3.62 3.54
CDMA / EVRC

Network 5
Device F 3.33 3.44 3.77 3.84 3.56 3.64
CDMA / EVRC-B

For network ‘2’ the situation is different, despite of a device that applies some gain and noise control, the
network is here limited to AMR at 5.9 kbps. This reduces the achievable quality compared to a
network/device combination as in network ‘1’ significantly.
Network ‘3’ is somewhat in between networks ‘1’ and ‘2’, it enables AMR at 12.2 kbps but the used handset
is not as transparent as the devices ‘A’ and ‘B’ are. P.863 ‘POLQA’ scores these device characteristics lower
than P.862 ‘PESQ’.
The two real field CDMA networks are also in the range of networks ‘2’ and ‘3’. The quality is determined by
the coding schemes used. Mainly for EVRC-B the quality scores are improved compared to P.862 ‘PESQ’.
However, aggressive noise and gain control have a strong influence to the achieved scores as well. Finally,
even the maximum scores are lower than what could be expected by plain encoding distortions.

The achieved quality figures in real field measurements are – of course – depending on the RF conditions in
the network. However, it has to be considered that a certain quality can’t be exceeded due to fixed speech
processing components in the channel and in the device. Despite of comparing averages, a closer look at
the distribution of the scores and the values where most of the scores are located, give useful information
about potential reasons of a non-perfect quality.

Sample dependency of P.863 ‘POLQA’ scores in real field


measurements
An operator or user of a measurement system often uses one single speech sample for all measurements.
Usually the chosen sample is in the native language of the operator’s regions. It has to be noted that the
measured quality depends on the actually used speech sample. This not so much caused by the language
itself but rather by characteristics of the talkers and the recording environment used for the sample, as well
as of the chosen text and its distribution of phonemes.
The individual samples are treated differently by speech processing system; this can lead to more or less
perceptual degradations. Even the speech quality prediction algorithm itself may prefer one speech sample
to another. The consequence of all this is that MOS predictions for each sample are not exactly the same,
even though the transmission condition is.

Chapter 5 | Real field measurements 29


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

To get an impression of how this deviation could be, the following analysis was made. Nine different speech
samples were transmitted consecutively in a phone call during a drive test in a real UMTS network. A total of
30 calls were made. It can be assumed that the ‘distribution of real channel quality’ was the same for all nine
samples.
The following table shows the averages, the absolute maximum values and the 80% percentiles of the MOS
scores obtained with the nine speech samples. For a better overview, the samples are grouped by language.
The test situation is the ‘reference situation’ as above, i.e. network 1 (European UMTS 2100MHz, Device ‘B’
as a quasi-transparent device and ‘uncritical’ downlink only).
Table 6: Comparison of different speech samples in common real field setups

Average Maximum 80% Percentile


P.862.1 P.863 P.862.1 P.863 P.862.1 P.863
PESQ POLQA PESQ POLQA PESQ POLQA

Network 1 American English 4.04 4.06 4.17 4.17 4.14 4.13


UMTS German 4.04 4.07 4.20 4.26 4.18 4.18
Downlink Spanish 3.98 4.03 4.21 4.31 4.19 4.25
Device B
Greek 3.98 3.96 4.13 4.17 4.09 4.11
Russian 3.87 3.87 4.09 4.09 4.03 4.01
Hungarian 3.93 4.17 4.14 4.39 4.11 4.35
Arabic 3.84 3.85 4.02 4.06 4.00 3.99
Polish 3.82 3.95 4.09 4.32 4.06 4.26
Japanese 3.82 3.96 3.99 4.10 3.92 4.06

In general it can be observed that there is considerable difference between the samples. The averages and
the maximum values span over a range of >0.2 MOS in case of P.862 ‘PESQ’ and even >0.3 MOS for P.863
‘POLQA’. There are two reasons for this. First, the individual samples are treated slightly differently by the
voice processing in the channel. They are more or less affected by e.g. band-pass filtering or compression.
Secondly, there is the consideration of the talker’s timbre, the spectral power distribution of the reference and
degraded signal in P.863 ‘POLQA’. Since there are differences in the talkers’ individual characteristics and
the actual recording conditions of the reference speech samples, P.863 ‘POLQA’ scores slightly different too.
The situation is widely systematic, i.e. a speech sample that is scored slightly lower, will tend to lower scores
under all realistic test conditions. Therefore, when comparing MOS values from different investigations, the
influence of the speech sample used should not be overlooked. Ideally, results that are to be compared to
each other should be based on the same speech sample or the same selection of those.

Results in real field networks in super-wideband mode of P.863 ‘POLQA’


The term ‘super-wideband mode’ just describes the application of P.863 ‘POLQA’. It is independent from the
real presence of a wideband or a super-wideband channel. It just means that the inserted reference signal is
super-wideband and the recorded signal is compared to this super-wideband reference signal.
Technically speaking, pure narrowband networks can be analyzed with P.863 ‘POLQA’ in super-wideband
mode as well. However, the resulting scores are considerably lower compared to an analysis in NB mode. As
it can be seen in Table 3, the transparent narrowband signal is scored around MOS = 3.5 in the SWB mode
of P.863 ‘POLQA’, and AMR at 12.2kbps in the range of 3.2. This decrease in quality is caused by the limited
audio band-width that counts as degradation compared to an (expected) super-wideband signal.
Comparing the two modes, we can see that the achievable quality on a narrowband mobile channel is about
0.9 MOS lower in SWB mode than in NB mode. As an example Table 7 gives some example results that
were obtained in the same drive test campaign, conducting alternately a NB test and a SWB test with P.863
‘POLQA’. The NB test case is the same as discussed and presented above.

Chapter 5 | Real field measurements 30


CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

Table 7: Comparison of the NB and SWB mode of P.863 ‘POLQA’ in common real field setups

P.863 'POLQA'
Average Maximum 80% Percentile
NB SWB NB SWB NB SWB
Network 1
UMTS DL 4.06 3.03 4.17 3.25 4.13 3.17
Device A

The achievable quality scores of 3.17 (80% percentile) or 3.25 (absolute maximum in the collection) fit the
simulated value of 3.2 given in Table 3 quite well (which itself is an average over a set of different speech
samples).
In a more detailed analysis the distributions of the two test cases are compared in Figure 23. It can be seen
that the typical shape of MOS distribution is shifted and compressed towards the lower scale end.

Listening Quality distribution (P.863-NB 'POLQA') PDF

50%
PDF Number of Values

45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1

2-2.1

3-3.1

4-4.1
1.2-1.3

1.4-1.5

1.6-1.7

1.8-1.9

2.2-2.3

2.4-2.5

2.6-2.7

2.8-2.9

3.2-3.3

3.4-3.5

3.6-3.7

3.8-3.9

4.2-4.3

4.4-4.5

4.6-4.7

4.8-4.9
Listening Quality

Listening Quality distribution (P.863-SWB 'POLQA') PDF

50%
PDF Number of Values

45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1

2-2.1

3-3.1

4-4.1
1.2-1.3

1.4-1.5

1.6-1.7

1.8-1.9

2.2-2.3

2.4-2.5

2.6-2.7

2.8-2.9

3.2-3.3

3.4-3.5

3.6-3.7

3.8-3.9

4.2-4.3

4.4-4.5

4.6-4.7

4.8-4.9

Listening Quality

Figure 23: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode (lower graph) and NB mode (upper
graph) using Device A, UMTS, Downlink as in Table 4

It has to be considered that the reporting of SWB scores will not match the expected figures obtained with
NB in the past. Both values and analysis types must not be mixed or compared to each other. However, it
will just be a question of time until the market deals with quality scores obtained in super-wideband mode
and has adapted to the lower range of quality achievable in narrowband connections.
The most important point for using P.863 ‘POLQA’ in SWB is of course the evaluation of wideband channels
and networks, both for comparison to each other and to traditional narrowband systems.
Today in 2011, only few mobile networks are equipped with wideband transmission capabilities, and it is
often restricted to mobile-to-mobile connections in transcoding-free operational mode. On the other hand,
Chapter 5 | Real field measurements 31
CONFIDENTIAL MATERIALS
Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note
© 2000 - 2012 SwissQual AG

VoIP services have been using wideband and even super-wideband transmission already for a long time and
with P.863 ‘POLQA’ there is now an appropriate objective measure of speech quality for these systems.
The following example for wideband transmission is based on a collection of speech samples obtained in a
mobile network that was equipped with AMR-WB at most locations.

Listening Quality distribution (P863-SWB 'POLQA) PDF

30%
PDF Number of Values

25%

20%

15%

10%

5%

0%
1-1.1

2-2.1

3-3.1

4-4.1
1.2-1.3

1.4-1.5

1.6-1.7

1.8-1.9

2.2-2.3

2.4-2.5

2.6-2.7

2.8-2.9

3.2-3.3

3.4-3.5

3.6-3.7

3.8-3.9

4.2-4.3

4.4-4.5

4.6-4.7

4.8-4.9
Listening Quality

Figure 24: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode in wideband capable networks
It can clearly be observed that the majority of the quality scores are in a range from 3.7 to 3.9, which
represents the achievable quality for AMR-WB at 12.65kbps (as shown in Table 4). The lower scores are
partially caused by transmission errors and a few narrowband connections in this selection. The AMR-NB at
12.2kbps will result in a predicted MS of 3.2 or lower in this dataset.

6 Conclusion
With P.863 ‘POLQA’ a new measure for objective speech quality assessment has been standardized,
serving today’s and future demands of voice quality testing. P.863 ‘POLQA’ is embedded in SwissQual’s
strong SQuad application framework which computes a powerful set of additional information characterizing
the analyzed speech signal.
The actual implementation of P.863 ‘POLQA’ was thoroughly speed optimized for a resource saving
calculation in a fraction of real time on SwissQual’s platform. The SQuad framework that provides P.863
‘POLQA’ can also compute P.862 ‘PESQ’ scores in parallel to P.863 ‘POLQA’ in NB mode as an option for
customers who are a interested in a direct comparison between the two measures.
Of course, the SQuad measurement suite can also be equipped with SQuad08 as a MOS predictor.
SQuad08 is technically compatible to P.863 and serves the same applications as P.863 ‘POLQA’.

Chapter 6 | Conclusion 32
CONFIDENTIAL MATERIALS

You might also like