Professional Documents
Culture Documents
Operating Instructions
No part of the handbook or application may be reproduced in any form (print, photocopy, microfilm or any
other way), or manipulated, duplicated or distributed electronically without our prior written permission. We
hereby point out that all descriptions and brand names used in this book are generally subject to brand mark,
trademark and patent protection of the respective companies.
Stand: 11/2007
Software-Version: 1.0.1.1
INDEX
1 Introduction.............................................................................................................................................................. 4
2 Compensation........................................................................................................................................................... 5
2.1. IRS Receive Filtering ........................................................................................................................................ 5
2.2. Calculation of the Active Speech Time Interval ................................................................................................ 5
2.3. Time-Frequency Decomposition, Time Axis Modification ................................................................................ 5
2.4. Calculation of the Pitch Power Densities ......................................................................................................... 5
2.5. Compensation of Linear Frequency Response .................................................................................................. 5
2.6. Compensation of the Time Varying Gain .......................................................................................................... 6
2.7. Calculation of the Loudness Densities .............................................................................................................. 6
2.8. Calculation of the Disturbance Density ............................................................................................................ 6
2.9. Modelling of the Asymmetrical Effect ............................................................................................................... 6
2.10. Aggregation of the Disturbance Densities over Frequency and Silent Interval Processing ............................. 6
2.11. Realignment of Bad Intervals ............................................................................................................................ 6
2.12. Aggregation of the Disturbances over Time...................................................................................................... 7
2.13. Computation of the PESQ Score ....................................................................................................................... 7
6 Appendix................................................................................................................................................................. 22
A - Index ....................................................................................................................................................................... 22
1 Introduction
Perceptual Evaluation of Speech Quality (PESQ) is a method of objective speech quality evaluation in
telephony in the frequency range 0.3-3400Hz. PESQ is described in ITU recommendation Q.862 and is
based on real conditions for end-to-end speech communication. Amongst those factors taken into account
are packet loss, noise and the audio codec used.
PESQ returns an evaluation of speech quality in the range -0.5 to 4.5. Values nearer to 0.5 mean very bad
speech quality, while values nearer 4.5 mean very good speech quality. Return values lie between 1 and 4.5
in most cases. This is rather surprising initially since the ITU scale for MOS extends up to 5. But the
explanation is surprisingly simple. PESQ simulates a hearing test and is optimised to reproduce the average
result of all listeners. Statistics prove that the best average result one can generally expect from an
audiometry test is not 5 but rather around 4.5. It would appear that testers are always rather cautious about
giving a sample a rating of 5. Even if there is no degradation.
As shown in Diagram 1, a reference signal and a degraded signal are input into the PESQ analysis system.
Both of these signals must be of type *.wav with a sampling rate of 8 kHz and a depth of 16 bit linear PCM. A
substantial amount of comparisons and calculations are performed in the PESQ system – explaining why
calculation of the results can take some time. The result delivered is the PESQ value, described above, as
well as a number of additional parameters which define speech quality.
Many of the stages in the PESQ algorithm are extremely complex and hence a description of the process in
not straightforward. For this reason the algorithm is described just briefly. Please refer to ITU-T
recommendation P.862 for more detailed information.
It is assumed that listening is carried out using a handset with a frequency response that follows an IRS
receive or a modified IRS receive characteristic. A perceptual model of the human evaluation of speech
quality must take account of this for the actual signals to be processed. Therefore IRS like receive filtered
versions of the original speech signal and the degraded speech signal are computed. In PESQ, this is
implemented by an FFT over the entire length of the speech file, resulting in the filtered signals XIRSS(t) and
YIRSS(t).
If the original and degraded output speech file start or end with large silent intervals, this could influence the
computation of certain average distortion values over the files. Therefore, an estimate is made of the silent
parts at the beginning and end of these files. The sum of five successive absolute sample values must
exceed 500 from the beginning and end of the original speech file in order for that position to be considered
as the start or end of the active interval. The interval between this start and end is defined as the active
speech time interval.
The human ear performs a time-frequency transformation. In PESQ this is modelled by a short term FFT with
a Hanning window over 32ms frames. The overlap between successive frames is 50%. The power spectra -
the sum of the squared real and imaginary parts of the complex FFT components - for both signals are
stored separately in real valued arrays. Phase information within a single frame is discarded in PESQ. All
calculations are based on only the power representations PXWIRRSS(f)n and PYWIRRSS(f)n. The start points of
the frames in the degraded speech signal are shifted over a delay. The time axis of the original speech
signal is offset to the left. If the delay increases, parts of the degraded speech signal are omitted from the
process, whilst for decreases in the delay parts of the degraded signal are repeated. This time-axis
modification gave best results in terms of correlation with the subjectively perceived speech quality.
The Bark scale reflects that, at low frequencies, the human hearing has a finer frequency resolution than at
higher frequencies. This is implemented by binning FFT bands and summing the corresponding FFT bands
with a normalisation of the summed parts. The resulting signals are known as the “pitch power densities“,
PPXWIRSS(f)n and PPYWIRRSS(f)n.
To deal with filtering in the system under test, the power spectra of the original and degraded speech signal
are averaged over time. This average is calculated over speech active frames only using time-frequency
cells whose power is more than 30dB above the absolute hearing threshold level (limit of the sound level).
Per modified Bark bin, a partial compensation factor is calculated from the ratio of the degraded spectrum to
the original spectrum. The maximum compensation never exceeds 20dB. The original pitch power density,
PPXWIRSS(f)n, of every frame n is then multiplied with this partial compensation factor to equalise the original
to the degraded signal. This results in a filtered version of the original pitch power density, PPX’WIRSS(f)n.
Short-term variations are partially compensated by processing the pitch power densities frame by frame. For
the original and the degraded pitch power densities, the sum in each frame of all values that exceed the
absolute heating threshold is computed. The ratio of the power in the original and the degraded speech files
is calculated and bounded to the range 3x10-4 to 5. A first order low pass filter is used for the calculation. The
time constant of this filter is approximately 16ms. The distorted pitch power density in each frame is then
multiplied by this ratio, resulting in the distorted pitch power density, PPX’WIRSS(f)n, partially gain
compensated.
After partial compensation for filtering and short-term gain variations, the original and degraded pitch power
densities are transformed to a Sone loudness scale using Zwicker’s law. The resulting two-dimensional
arrays, LX(f)n and LY(f)n , are called loudness densities.
The signed difference between the distorted and original loudness densities is computed. When this
difference is positive, components such as noise have been added to the original signal. When this
difference is negative, components have been omitted from the original speech signal. The difference array
is called the raw disturbance density. The net effect is that the raw differences are pulled towards zero. This
represents a dead zone before an actual time-frequency cell is perceived as being distorted. This models the
process of small differences being inaudible in the presence of loud signals in each time-frequency cell. The
result is a disturbance density, D(f)n, as a function of time and frequency.
The asymmetrical effect is caused by the fact that when a codec distorts the input signal, it will in general be
very difficult to introduce a new time-frequency component that integrates with the input signal, and the
resulting output signal will thus be decomposed into two different percepts, the input signal and the
distortion, leading to clearly audible distortion. When the codec leaves out a time-frequency component, the
resulting signal can not be decomposed in the same way and the distortion is less objectionable. This effect
is modeled by calculating an asymmetrical disturbance density DA(f)n per frame by multiplication of the
disturbance density D(f)n with an asymmetric factor. The asymmetric factor equals the ratio of the distorted
and original pitch power densities raised to the power of 1.2. If the factor is less than 3, it is set to 0. It if
exceeds 12, it is clipped at this value.
The disturbance density, D(f)n, and asymmetric disturbance density, DA(f)n, are summed along the frequency
axis using two different Lp norms and a weighting on soft frames (having low loudness) . After this
multiplication, the frame disturbance values are limited to a maximum of 45. If the distorted signal contains a
decrease in the delay larger than 16ms, the repeat strategy from the time-frequency decomposition is re-
applied. The resulting frame disturbances are called D’n and DA’n.
Consecutive frames with a frame disturbance above a threshold are called bad intervals. In a minority of
cases, the objective measure predicts large distortions over a minimum number of bad frames due to
incorrect time delays observed by the preprocessing. In this case a new delay value is estimated by locating
the maximum of the cross-correlation between the absolute original speech signal and the absolute
degraded speech signal pre-compensated with the delays observed by the preprocessing.
When the maximum value is below a certain threshold, it is concluded that the interval has been
compensated. This interval is then no longer considered “bad”. If the maximum value is not below the
threshold, the process is repeated. The result is the final frame disturbances, D’’n and DA’’n and are used to
calculate the perceived overall speech quality.
First the frame disturbances are aggregated over split second intervals. Next the split second intervals are
aggregated over the complete active time interval. For the split second time aggregation, the frame
disturbance values and the asymmetric frame disturbance values to L6 are aggregated over 20 frames.
These split second intervals also overlap 50% and no window function is used. An L2 norm is used over the
speech file length. The values of the active intervals of the speech files are aggregated using the L2 norm.
The final PESQ score is a linear combination of the average disturbance value and the average
asymmetrical disturbance value. This linear combination was optimized on a large set of subjective
experiments.
3 Speech Samples
A speech sample, conforming to the ITU-T P.839 standard (PESQ), must be sent. According to the
recommendation, the speech sample should consist of simple, short and meaningful sentences and should
be selected so as to be easily understood. The sentences should also be divided into two or three sections
with silent time intervals. These sections should be created in such a way that there are no associations
between the individual sentences within a section. Avoid very short and very long sentences. The aim is for
each spoken sentence to have a duration of 2-3 seconds. The speech samples should be 8-12 seconds in
length. Between 40 and 80% of the time should be speech. If long time intervals are to be tested, it is
reasonable to create several separate recordings of around 8-20 seconds.
When recording speech samples, the speakers should be in a room with a reverberation period of less than
500ms and a room noise level below 30dBA. High quality recording systems should be used for the
recordings. The speakers should articulate the sentences fluently but not dramatically and maintain a
constant, comfortable speech level, avoiding noises such as paper rustling. During the recording, the active
speech level should be between 20dB and 30dB and should be monitored. Every sentence which does not
attain this level should be re-recorded.
The speech sample should be processed such that it is suitable for the “System Under Test“ (SUT). Further
distortion by unnecessary quantisation, amplitude limitation or renewed sampling is to be avoided. The
preferred format for the saving of the original speech sample and the degraded speech sample is a sampling
rate of 8 kHz, 16 bit linear PCM.
Several pre-recorded speech samples, in *.wav format, are included with this software which may be used as
references when sending and comparing.
Minimum:
• 1GHz PC
• CD-ROM drive
• Windows 2000
[1]
[8]
[6] [7]
[10]
[9]
[3]
[4]
[2]
[5]
The VQ108 Scope software for speech quality analysis has a clearly laid out, intuitive and user-friendly
dialogue based graphical user interface from which all functions can be called.
The reference signals and degraded signals described above (wave files, 8 kHz sampling rate, 16 bit linear
PCM) are read into the system in the upper left-hand area of the user interface [1]. Select “Open“ to read in
speech signal files. This opens up the default Windows dialogue from where you can select the files. The
speech sample over time appears in the window in the lower half of the user interface once a valid Wave file
has been loaded. The reference signal is shown in blue, the degraded signal in red. When selected two files
(reference file and degraded file), ensure that they do actually belong together (i.e. the reference file has
been sent and the degraded file has been received at the end of the measuring section).
Diagram 3 shows an error message which is displayed if the two files read in do not belong together or the
transmission quality is so bad that no correlation at all can be detected.
Select the “Listen“ button to output the signals to the sound card so as to subjectively assess the speech
quality.
The speech samples loaded are shown graphically in the lower part of the screen [2]. The signals can be
compared with the naked eye (see Delays, Distortions and Level differences.
Signals may be displayed as “linear“, as “level“ or as a “spectrum“. Switch between the three by changing
the mode on the right of the display area [3]. Use [4] to change the resolution of the time axis (X-axis). With
the resolution set to 10s, the signals are usually visible in their entirety, albeit very imprecise since not all
speech samples can be displayed. With smaller resolutions (5s, 1s and 0.1s) the signals are not visible in
their entirety in one screen, but the accuracy of the display is greater.
The reference and degraded signals can be turned on and off by selecting the boxes in [5]. This is useful
when just one of the signals is to be analysed in detail without being hindered by the other.
PESQ Calculation
Once the reference and degraded signals have been read in, the automatic, objective speech quality
analysis can be started by pressing the “Calculate PESQ“ button [6]. This may take some time as
comprehensive and complex calculations are involved.
Once the calculation is complete, the most important value, the “PESQ value“ is displayed in large, bold
numbers [7]. The colour of the value is dependent on the value. Good results are displayed in green,
medium results are displayed in yellow and bad results in red, allowing the user to gain at once an indication
of the speech quality.
All calculable parameters are displayed in detail in the results window [8]. Select “Show Advanced Results“
[9] to show these values all together in a separate window (Diagram 4).
Values derived from PESQ utterance detector after IRS filter and level alignment:
Snr Speech - SNR between speech
Snr Silent - SNR between silence
Language Selection
The user can select the language used for the interface by pressing the “Select Language“ button. (see
Diagram 5).
This button is not available if VQ108 Scope was started from the VQ108 Analyser analysis software. The
user interface of the VQ108 Scope software then adjusts automatically to the language selected in VQ108
Analyser.
Press the “Help” button, if you require further information on the calculation of PESQ parameters or help on
the VQ108 Scope software.
- Windows XP
- 256 MB RAM
Warning: The warranty expires if the device is opened or is used in another way as described!
Before using the device read the manual.
The measurement system gives the possibility to do an End-to-End-Speech Quality analysis in voice
transmission systems. For it, a standardized Speech sample will be fed into the end device. The incoming
signal will be intercepted and saved in the measurement system. The received Speech sample and the send
one will be compared and an automatic Speech Quality analysis starts. PESQ (ITU-T P.862) The result
reflects a placement on the MOS-Chart from 1 to 5 (1 - bad, 2 - reasonable, 3 - fair, 4 - good, 5 - excellent).
Note
- The „Speech Quality Test Box“ (SQ2500) includes a high quality USB-Audio-Adapter, never
use or install another generic USB-Audio-Adapter
- Never install generic device drivers for the USB-Audio-Adapter
- No other running application should use the soundcard during a measurement. If it is possible, close
all other running applications
- The given PESQ-Values have a inaccuracy of as far as 0.05
- Because of a little quality loss produced by the box, an additional 0.01 per direction is added to the
actual shown PESQ-Value
Power/Busy
[1] Power/Busy – LED: Steady green while powered on and flashes while sending and receiving
signals.
[2] USB-Connector: Connection between Box and PC. The SQ2500 is automatically
recognized by Windows XP, never install device drivers!!!
[3] Handset Connector: Connection between SQ2500 and the Handset-Connector of the
end device. There are two jacks that are differently wired:
On the left it is wired TX/RX/RX/TX (Standard)
On the right it is wired TX/RX/TX/RX (D2500).
Depending on your Handset-Connector of the end device use one of the
jacks.
For the D2500/D2000Pro by Aethra use the jack on the right side.
To make an active measurement with the SQ2500, first start the Software VQ108 Scope. In
the Main-User interface hit the button „SQ2500 Test“ in the down left. That opens a dialog as seen as in
figureDiagram 5-3.
[2]
[4]
[3]
[5]
[6]
[7]
5.4.1 Configuration
Please choose under „Config->Audio“ the Mode „Extern 200..3400Hz“ for all measurements with the
SQ2500-Box connected to the D2500/D2000Pro.
a1) Measurement with the SQ2500 and Loopbox-Function
[S]
After the start the status of the running measurements is shown in the result field. At the end of the
entered number of measurements the results of each measurement, the number of valid and invalid
measurements, the minimum and maximum values and the average are given in the same field
Additionally the average given in the PESQ-field will be colored, whereas a green value a good, a yellow
an average and a red colored value a bad indication of the average is.
[S]
[M]
After the start the status of the running measurements is shown in the result field. At the end of the
entered number of measurements the results of each measurement, the number of valid and invalid
measurements, the minimum and maximum values and the average are given in the same field
Additionally the average given in the PESQ-field will be colored, whereas a green value a good, a yellow
an average and a red colored value a bad indication of the average.
[M] [S]
„Remote station receives reference-signal and sends it back“, if both directions are
supposed to be part of the measurement. The „Master“ sends the reference-speech
sample. The „Slave“ saves it and sends it back, if the „Master“ asks for it.
After the measurement all results will be shown in the result field like described earlier (figure 5-4).
The results of a successful measurement can be printed out by the function „Print results“ or saved in a text
file with „Print in File“.
A
Active Measurement........................................................................................................................................................................12, 16
B
Bad Intervals ...........................................................................................................................................................................................6
C
codec ..................................................................................................................................................................................................4, 6
compensation factor .............................................................................................................................................................................5
D
deadzone ...............................................................................................................................................................................................6
Disturbance Density ................................................................................................................................................................................6
duration ..................................................................................................................................................................................................7
E
evaluation...........................................................................................................................................................................................4, 5
F
frequency .......................................................................................................................................................................................4, 5, 6
G
Graphical output of signals......................................................................................................................................................................9
H
Handset Connector.................................................................................................................................................................14, 17, 18
I
invalid measurements ..................................................................................................................................................................18, 19
L
Language Selection ...............................................................................................................................................................................11
N
number of measurements................................................................................................................................................16, 18, 19, 20
P
PESQ.................................................................................................................................... 4, 5, 7, 9, 10, 11, 12, 13, 16, 18, 19, 20, 21
Print in File ...........................................................................................................................................................................................20
R
Remote-System ..................................................................................................................................................................................19
S
speech quality ........................................................................................................................................................... 4, 5, 6, 8, 9, 12, 20
Speech Quality Test Box .......................................................................................................................................................12, 13, 14
Speech Samples.......................................................................................................................................................................................7
speech signal.................................................................................................................................................................................5, 6, 9
System requirements ...............................................................................................................................................................................8
U
USB-Connector ...................................................................................................................................................................................14
User Interface ....................................................................................................................................................................................8, 15
utterance detector ......................................................................................................................................................................10, 11
V
VQ-Receiver ........................................................................................................................................................................................18
W
warranty................................................................................................................................................................................................12
Z
Zwicker’s law .........................................................................................................................................................................................6