Professional Documents
Culture Documents
Faculty of Engineering
Electrical Engineering Department
Communications & Electronics Section
We believe that good seeds even if they were planted in a fertile soil won’t grow
without care and attention.
We also believe that plants won’t grow overnight and that in order to enjoy what we
have planted; we have to be patient until they are completely grown. And when they are
completely grown and become beautiful, we have to protect them by watering them
continuously and never let them until they are fully covered by dust which destroys
their shine and beauty and leads them to wither and death.
We are not agricultural engineers neither our project is concerned in agriculture; but
simply we wanted to take a moment to express our deep gratitude, thanks and
appreciation to our dear Dr. Noha O. Korany who supported us a lot and helped us a
lot in achieving what we have achieved in this project. She was always encouraging us
to find out our own capabilities, she never obliged us to do something we did not like to
do; she has put us on the right track and has driven us to a real start of a professional
life in the near future.
The project for us was not just an ordinary college task that has to be submitted in a
deadline; but we were waiting for every weekly meeting with great enthusiasm to
discuss general actualities and events with an extremely open minded, hardly
committed and respectful person like our dear Dr. Noha O. Korany.
Now we guess that you have almost understood the analogy of plants we used at the
beginning.
We were the seeds in their way to grow, the project was the fertile soil and those seeds
won’t have grown properly without the care and attention of Dr. Noha O. Korany.
Dr. Noha,
Words cannot tell how much we are grateful to you
Project’s Participants
Abstract
Topic 1: Audiology
Deals with the hearing process
Page
Topic 1
1
AUDIOLOGY
CHAPTER 1
2
HEARING PROCESS
1.1 Structure of the Ear………………………………………………… 2
1.2 How the Ear works………………………………………………… 4
1.3 Computational Models of Ear Functions…………………………... 5
CHAPTER 2
13
FUNDMENTAL PROPERTIES OF HEARING
2.1 Thresholds…………………………………………………………. 13
2.2 Equal Loudness Level Contours…………………………………… 14
2.3 Critical Bandwidth………………………………………………… 14
2.4 Masking……………………………………………………………. 15
2.5 Beat and Combination Tones……………………………………… 15
CHAPTER 3
17
MODELS FOR HEARING AIDS
3.1 History of Development of Hearing Aids…………………………. 17
3.2 First Model (Digital Hearing Aids for Moderate Hearing Losses)... 18
3.2.1 Features of Real Time Binaural Hearing Aid……………………… 18
3.2.2 Speech Processing Algorithm……………………………………... 18
3.2.2.1 Interaural Time Delay and Timer.………………………………… 18
3.2.2.2 Frequency Shaping.………………………………………………... 19
3.2.2.3 Adaptive Noise Cancellation using LMS.…………………………. 20
3.2.2.4 Amplitude Compression…………………………………………… 25
3.3 Second Model (A Method of Treatment for Sensorineeural
26
Hearing Impairment)……………………………………………….
3.3.1 The Conceptual Prosthetic System Architecture..…………………. 26
3.3.2 Human Temporal Bone Vibration…………………………………. 28
3.3.3 Design Guideline for Optimum Accelerometer…………………… 31
3.3.4 Conclusion…………………………………………………………. 31
Topic 2
32
ACOUSTICAL SIMULATION OF ROOM
CHAPTER 1
33
GEOMETRICAL ACOUSTICS
1.1 Introduction………………………………………………………... 33
1.2 Sound Behavior……………………………………………………. 35
1.3 Geometrical Room Acoustics……………………………………… 36
1.3.1 The Reflection of Sound Rays……………………………………... 36
1.3.2 Sound Reflections in Rooms………………………………………. 38
1.3.3 Room Reverberation……………………………………………….. 39
1.4 Room Acoustical Parameters & Objective Measures……………... 40
1.4.1 Reverberation Time………………………………………………... 40
1.4.2 Early Decay Time………………………………………………….. 40
1.4.3 Clarity and Definition……………………………………………… 41
1.4.4 Lateral Fraction and Bass Ratio…………………………………… 41
1.4.5 Speech Transmission Index………………………………………... 41
CHAPTER 2
42
ARTIFICIAL REVERBERATION
2.1 Introduction………………………………………………………... 42
2.2 Shortcomings of Electronic Reverberators………………………… 42
2.3 Realizing Natural Sounding Artificial Reverberation……………... 43
2.3.1 Comb Filter………………………………………………………… 43
2.3.2 All-pass Filter……………………………………………………… 45
2.3.3 Combined Comb and All-pass Filters……………………………... 47
2.4 Ambiophonic Reverberation………………………………………. 49
CHAPTER 3
50
SPATIALIZATION
3.1 Introduction………………………………………………………... 50
3.2 Two-Dimensional Amplitude Panning…………………………….. 51
3.2.1 Trigonometric Formulation………………………………………... 52
3.2.2 Vector Base Formulation…………………………………………... 54
3.2.3 Two-Dimensional VBAP for More Than Two Loudspeakers…….. 55
3.2.4 Implementing 2D VBAP for More Than Two Loudspeakers……... 56
Topic 3
57
NOISE CONTROL
CHAPTER 1
58
SOUND ABSORPTION
1.1 Absorption Coefficient...…………………………………………... 58
1.2 Measurement of Absorption Coefficient of the different materials.. 58
1.2.1 Procedures…………………………………………………………. 59
1.2.2 Laboratory Measurements of Absorption Coefficient…………….. 59
1.3 Sound Absorption by Vibrating or Perforated Boundaries………... 62
CHAPTER 2
65
SOUND TRANSMISSION
2.1 Transmission Coefficient………………………………………….. 65
2.2 Transmission loss………………………………………………….. 65
2.3 Sound Transmission Class STC…………………………………… 65
2.3.1 Determination of STC……………………………………………... 66
2.3.2 Laboratory Measurements of STC………………………………… 67
2.4 Controlling Sound Transmission through Concrete Block Walls…. 69
2.4.1 Single-Leaf Concrete Block Walls………………………………… 69
2.4.2 Double-Leaf Concrete Block Walls……………………………….. 71
2.5 Noise Reduction…………………………………………………… 71
2.5.1 Noise Reduction Determination Method…………………………... 71
2.5.2 The Noise Reduction Determinations of Some Absorbed Materials 72
2.6 The Performance of Some Absorbed Materials…………………… 73
Topic 4
75
SPEECH TECHNOLOGY
CHAPTER 1
76
SPEECH PRODUCTION
1.1 Introduction…………………………………………………........... 76
1.2 The human vocal apparatus………………………………………... 76
1.2.1 Breathing………………………………………………………....... 77
1.2.2 The larynx………………………………………………………….. 77
1.2.3 The vocal tract……………………………………………………... 79
1.3 Speech sounds……………………………………………………... 79
1.3.1 Phonemic representation…………………………………………... 79
1.3.2 Voiced, unvoiced and plosive sounds……………………………... 80
1.4 Acoustics of speech production……………………………………. 80
1.4.1 Formant frequencies……………………………………………….. 80
1.5 Perception………………………………………………………….. 82
1.5.1 Pitch and loudness…………………………………………………. 82
1.5.2 Loudness perception……………………………………………….. 82
CHAPTER 2
PROPERTIES OF SPEECH SIGNALS 83
IN TIME DOMAIN
2.1 Introduction………………………………………………………... 83
2.2 Time-Dependent Processing of Speech……………………………. 83
2.3 Short-Time Average Zero-Crossing Rate………………………….. 84
2.4 Pitch period estimation…………………………………………….. 86
2.4.1 The Autocorrelation Method………………………………………. 86
2.4.2 Average magnitude difference function…………………………… 89
CHAPTER 3
91
SPEECH REPRESENTATION IN FREQUENCY DOMAIN
3.1 Introduction………………………………………………………... 91
3.2 Formant analysis of speech………………………………………... 91
3.3 Formant frequency extraction……………………………………... 91
3.3.1 Spectrum scanning and peak-picking method……………………... 92
3.3.2 Spectrum scanning………………………………………………… 92
3.3.3 Peak-Picking Method……………………………………………… 92
CHAPTER 4
93
SPEECH CODING
4.1 Introduction………………………………………………………... 93
4.2 Overview of speech coding………………………………………... 93
4.3 Classification of speech coding……………………………………. 94
4.4 Linear Predictive Coding (LPC)…………………………………… 97
4.4.1 Basic Principles……………………………………………………. 97
4.4.2 The LPC filter……………………………………………………… 99
4.4.3 Problems in LPC model…………………………………………… 100
4.5 Basic Principles of Linear Predictive Analysis……………………. 101
4.5.1 The autocorrelation method………………………………………... 104
4.5.2 The covariance method……………………………………………. 106
CHAPTER 5
108
APPLICATIONS
5.1 Speech synthesis…………………………………………………… 108
5.1.1 Formant – frequency Extraction…………………………………… 108
5.1.2 LPC..……………………………..………………………………… 114
5.2 Speaker identification using LPC………………………………….. 119
5.3 Introduction to VOIP………………………………………………. 122
5.3.1 VoIP Standards…………………………………………………….. 122
5.3.2 System architecture……...………………………………………… 123
5.3.3 Coding technique in VOIP systems……………………………….. 124
5.3.4 Introduction to G.727……………………………………………… 125
5.3.5 Introduction to G.729 and G.723.1………………………………… 127
_________________________________________________________________________
Page 1 Topic 1 – Audiology
CHAPTER 1
HEARING PROCESS
Introduction:
- The human ear can respond to frequencies from 20Hz up to 20 KHz.
- It is more than sensitive and broad band receiver.
- It acts as a frequency analyzer of impressive selectivity.
- It is one of the most delicate mechanical structures in the human body.
_________________________________________________________________________
Page 2 Topic 1 – Audiology
2- The middle ear:
- It houses chain of three bones (hammer, incus and stapes)
- It contains 3 ossicles (bones), the ear drum is connected to the 1st (hammer), which
communicates to the last (stapes) through the middle one (incus).
- These bones are set into motion by the movement of the ear drum.
- It is also connected to the throat via the "Eustachian tube".
- There is a collection of muscles and ligaments that control the lever ratio of the
system. for high tension, the muscles controlling the motion of the bones change their
tension to reduce the amplitude of motion of the stapes, thereby protecting the inner ear
from damage.(N.B.: it offers no protection from sudden impulsive sounds).
• The vestibule:
- connects with the middle ear through 2 opening, the oval window& the round window
(both prevent the fluid escape in the inner ear).
The Duct:
- Is filled with endolymph (potassium rich, related to intracellular fluid throughout the
body) & perilymph (sodium rich, is similar to the spinal fluid).
- It also contains membranes; one of them is called the "Basilar membrane". (At the top
of this membrane there is organ of corti "contain 4 rows of hair cell").
_________________________________________________________________________
Page 3 Topic 1 – Audiology
1.1.2 How the ear works:
1- When the ear is exposed to a pure tone, sound waves are collected by the outer ear
and funneled through the ear canal to the ear drum.
2- Sound waves cause the ear drum to vibrate.
3- The motion of the ear drum is transmitted and amplified by the 3 bones of the middle
ear to the oval window of the inner ear creating fluid disturbance that travels in the
upper gallery toward the apex, in to the lower gallery, and then propagates in the lower
gallery to the round window which acts as pressure release termination.
NOTES
The basilar membrane is driven into highly damped motion with a peak amplitude
increases slowly with distance away from stapes. Reaches the maximum, and then
diminishes rapidly toward the apex
_________________________________________________________________________
Page 4 Topic 1 – Audiology
1.2 Computational Models for Ear Function
YL ( S ) X ( S ) YL ( S )
= . = G ( S ) . FL ( S )
P (S ) P (S ) X (S )
Where:
P (t): is the sound pressure at the ear drum.
X (t): is the equivalent linear displacement of the stapes.
yl (t): is the linear displacement of the basilar membrane at a distance "l"
from the stapes.
_________________________________________________________________________
Page 5 Topic 1 – Audiology
Because the model is an input-output analog, the response of one
point does not require explicit computation of the activity at the
other points. One therefore has the freedom to calculate the
displacement yl (t) for as many or for as few, values of I as are
desired.
If the curves are normalized with respect to the frequency of the maximum
response, one can find that:
_________________________________________________________________________
Page 6 Topic 1 – Audiology
B=2*PI*50 B=2*PI*2500 B=2*PI*5000
ABS. OF F(S)
400 400 400
0 0 0
-1 0 1 -1 0 1 -1 0 1
10 10 10 10 10 10 10 10 10
FREQ.(W/B) B=2*PI*10000
400
200
0
-1 0 1
10 10 10
(A)
B=2*PI*50
ANGLE OFF(S)
B=2*PI*2500 B=2*PI*5000
1 1 1
0 0 0
-1 -1 -1
-2 -2 -2
-3 -3 -3
-1 -1 -1
10 10 10
B=2*PI*10000
freq.(W/B) 1
0
-1
-2
-3
-1
10 (B)
The real and imaginary parts of the critical frequencies can therefore be related
by a content factor, namely, (βL=2xL)-the imaginary part of the pole frequency (βL)
completely describes the model and the characteristics of the membrane at a place l-
distance from the Stapes.
β L ( t −T )
−
Xe 2
. sin β L (t − T ) + [0.575 − 0.320 β L (t − T )]
β L ( t −T )
−
Xe 2
. cos β L (t − T ) − 0.575 .e − β L ( t −T ) } = 0
_________________________________________________________________________
Page 7 Topic 1 – Audiology
And so from the relation we find that as the frequency increase as the delay of the
maximum response decrease as shown in figure 5.
inv of f(s)
0
0 0
-5 -50 -200
0 5 10 0 5 10 0 5 10
time time time
B=2*PI*10000 B=2*pi*20000
B=2*PI*5000
1500 500
200
1000
0
100 500
-500
0
0
-500 -1000
-100 -1000
0 5 10 0 5 10 -1500 0 5 10
time time time
And we note also from the equation of F(S) that (βL) depend on the l the distance from
the stapes of maximum response of the membrane according to the equation
(35-l)=7.5 log ((βL)/ (2*Ω*20)
So as frequency increase as the l decrease and this note is logical because as the
frequency increase as the basilar membrane respond quickly and maximally (as shown
in figure 1.6).
45
40
35
30
L(IN MM)
25
20
15
10
0 2 4 6 8 10 12 14
B(IN HZ) 4
x 10
Figure (1.6) The relation between the (βL) & I-The distance from the stapes that has responds
maximally
Quantitative physioacoustical data on the operation of the human middle ear are
sparse (few).
All agree that the middle ear transmission is a low-pass function. As shown in
figure in 1.7
_________________________________________________________________________
Page 8 Topic 1 – Audiology
An approximating Function of 3rd Degree of Middle Ear Transmission:
Co
G(S ) =
[
(S + a) (S + a) 2 + b 2 ]
Where Co is a positive real constant.
One might consider Co=a (a2+b2) so that the low frequency transmission of
G(s) is unity .when the pole frequencies of G(s) are related according to b=
2a=2π (1500) rad/sec.
-1 3
x 10
20
15
10
G(S)
-5
0 100 0 2 000 300 0 4 000 500 0 6 000 700 0
freque nc y in c y c les p er s ec ond
e − at Coe −bt / 2
g (t ) = Co. (1 − cos bt ) = (1 − cosbt )
b b
For this middle ear function the response is seen to be heavily damped, so as the
frequency decrease as the response is damped of the middle ear and the velocity of the
displacement of the stapes also damped as shown in the next two figures
_________________________________________________________________________
Page 9 Topic 1 – Audiology
0.7
0.6
0.5
0.4
g(t)
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14
time
* Co.e −bt / 2
g (t ) = .(2 Sin bt+ cos bt − 1)
2
0. 8
0. 6
0. 4
0. 2
velocity
-0. 2
-0. 4
-0. 6
0 2 4 6 8 10 12 14
t im e
The combined response of G(s) and FL (s) in the frequency domain is simply the sum of
the individual curves for amplitude (in dB) and phase (in radians).
When the inverse transform is calculated, the result has the form:
_________________________________________________________________________
Page 10 Topic 1 – Audiology
1
h1 (τ ) = Ae −bτ / 2 + Be −bτ / 2 (Cosb τ − Sin bτ ) + C (e −bτ / 2 Sin bτ )
2
The form of the impulse response is thus seen to depend upon the parameter (η=
βL /b)
On the other hand, values of η> l .0 refer to basal (high frequency) points which
respond maximally at frequencies greater than the critical frequency of the middle ear.
For these points, the middle ear transmission is highly dependent upon
frequency and would be expected to influence strongly the membrane displacement. (as
shown in figure 1.10 as shown in figure(1.10(d,e,f))
-12 -12 -10 B=2*PI*1200
x 10 B=2*PI*50 x 10 B=2*PI*150 x 10
2 10 2
1 5 1
h(t)
0 0 0
-1 -5 -1
0 1 2 0 1 2 0 1 2
time time time
(a) (b) (c)
-10 -10 x 10
-10 B=2*PI*10000
x 10 B=2*PI*3000 x 10 B=2*PI*5000 4
4 5
2
2
0
h(t)
0 0
-2
-2
-4
-4 -5 0 1 2
0 1
time 2 0 1 2
time
(d) time
Figure (10) The response of the ear to the different frequency range
_________________________________________________________________________
Page 11 Topic 1 – Audiology
Note:
The waveform of the impulse response along the basal part of the
membrane is therefore approximately constant in shape (as shown in
figure (1.4).
Along the apical part, however, the impulse response oscillates more
slowly (in time) as the apex is approached (as shown in figure 1.9).
_________________________________________________________________________
Page 12 Topic 1 – Audiology
CHAPTER 2
FUNDAMENTAL PROPERTIES OF HEARING
At the beginning the main 5 properties we are going to talk about are:
1) Thresholds
2) Equal loudness level
3) Critical bandwidth
4) Masking
5) Beats and combinational tones
2.1 Thresholds:
The threshold of audibility is the minimum perceptible of L1 of a tone that can be
detected at each frequency over the entire range of ear. The tone should have duration
of 1S.a representative threshold of audibility for a young undamaged ear is shown as
the lowest
Curve in figure.
Figure (2.1) Threshold of audibility & free field, equal loudness level contour
The frequency of maximum sensitivity is near 4KHZ for high frequencies the threshold
also raises rapidly to a cutoff. It is in this higher freq. Region that the greatest
variability is observed among different listeners, particularly if they are over 30 of age.
The cut-off frequency. For young person may be as high as 20kHZ or even 25kHZ,but
people over 40 or 50 years of age with typical hearing can seldom hear freq. Near or
above 15kHZ.in the range below 1kHZ,the threshold is usually independent of the
range of the listener.
As the intensity of the incident acoustic wave is increased, the sound grows louder and
eventually produces a tickling sensation, this occurs at an intensity level of about
120dB and is called threshold of feeling, and the tickling sensation becomes one of pain
about 140dB.
Since the ear responds relatively slowly to loud sound by reducing the lever action of
the middle ear, the threshold of audibility shifts upwards under exposure, the amount of
shift depends on the intensity and the duration of the sound. After the sound is removed
_________________________________________________________________________
Page 13 Topic 1 – Audiology
the threshold of hearing will begin to reduce and if the ear fully recovers its original
threshold it has experienced temporary threshold shift (TTS). The amount of time
required for a complete recovery increases with increasing intensity and duration of
sound. If the exposure is long enough or the intensity is high enough the recovery of the
ear is not complete. The threshold never returns to its original value and permanent
threshold shift (PTS) has occurred.
It is important to realize that the damage leading to PTS occurs in the inner ear, the hair
cells are damaged.
Also of importance are differential thresholds one of which is the differential threshold
for intensity determination. If two tones of almost identical freq. are sounded together,
one tone much weaker than the other the resultant signal is indistinguishable from a
single freq. whose amplitude fluctuates slightly and sinusoidally.
The amount of fluctuation that the ear can just barely detect, when converted into the
difference in intensity between the stronger and the weaker portions, determines the
differential thresholds. As might be expected, values depend on frequency numbers of
beats per second and intensity level.
Generally the greatest sensitivity to intensity changes is found for about 3 beats per
second sensitivity decreases at frequency extremes, particularly for low frequency but
the effect diminishes with the increasing sound levels.
For sound more than 40dB above threshold, the ear is sensitive to intensity level
fluctuations of less than 2dB at the freq. Extremes and less than about 1dB between 100
and 1000HZ.
Other differential thresholds involve the ability to discriminate between two sequential
signals of nearly the same freq.
The frequency level required to make the discrimination is termed the difference limen.
_________________________________________________________________________
Page 14 Topic 1 – Audiology
In early experiments it was assumed that the signal must equal the noise for detection to
occur (DT=0).on this bases and assuming that the sensitivity of the ear is constant
across each bandwidth Wcr, it follows that Wcr =S/N1 where S: the signal power and N1
the noise power per Hz the band width measured this way are now termed the critical
ratios.
Later experiments based on the perceived loudness of noise have yielded critical
bandwidths Wcb larger than the critical ratios. In some of theses experiments, the
loudness of a band of noise is observed as a unction of bandwidth while the overall
noise level is held constant. For noise bandwidth s less than the critical, the loudness
will be constant but when the bandwidth exceeds theoretical bandwidth, the loudness
will increase.
2.4 M asking:
This is the increase of the level of audibility in the presence of noise. First consider the
masking of one pure tone by another. The subject is exposed to a single tone of fixed
frequency and L1, and then asked to detect another tone of different frequency and
level. Analysis yields the threshold shift, the increase in L1 of the masked tone above its
value for the threshold of audibility before it can be detected. As shown in figure 2.2
gives reprehensive results for masking frequencies of 400 and 2000 HZ , the frequency
range over which there is appreciable masking increases with the L1 of the masker, the
increase being greater for frequencies above that of the masker . This is to be expected
because the region of the basilar membrane excited in to appreciable motion of
moderate values of L1 extends from the maximum further toward the stapes than the
apex.
Figure (2.2) Masking of one pure tone by another (The Abscissa is the frequency of the masked
tone)
As the frequency interval between the two tones increases, the sensation of beating
changes to throbbing and then to roughness gradually diminishes and the sound
_________________________________________________________________________
Page 15 Topic 1 – Audiology
becomes smoother, finally resolving in to two separate tones for frequencies falling in
the midrange of hearing, the transition from beats to throbbing occurs at about 5 to 10
beats per second and this turns into roughness at about 15 to 30 beats per second. These
transitions occur for higher beat frequencies as the frequencies of the primary tones are
increased.
Transition to separate tones occurs when the frequency interval has increased to about
the critical bandwidth.
None of this occurs if each tone is presented to a different ear. When each ear is
exposed to a separate tone, the combined sound does not exhibit intensity fluctuations,
this kind of beating is absent. this suggest that the beats arise because the two tones
generate overlapping regions of excitation on the basilar membrane, and it is not until
these regions become separated by the distance corresponding to the critical bandwidth
that they can be separately sensed by the ear. When the tones are presented one in each
ear, each basilar membrane is separated excited and these effects do not occur if the
two tones are separated far enough and are of sufficient loudness, combination tones
can be detected. These combination tones are not present in the signal sound but are
manufactured by the ear. There is a collection of possible combination tones whose
frequencies are various sums and differences of the original frequencies F1 &F2
Fnm = ABS (m F2 +n F1) n, m=1, 2, 3...
Only a few of these frequencies will be sensed. One of the easiest to detect is the
difference frequency abs (F1 -F2).
_________________________________________________________________________
Page 16 Topic 1 – Audiology
CHAPTER 3
MODELS OF HEARING AIDS
1- A microphone to gather acoustic energy (sound waves in the air) and convert it to
electrical energy.
2- An amplifier to increase the strength of the electrical energy.
3- A receiver, which converts the electrical energy back into acoustic energy (sound
waves).
The main advantage of the analog hearing aids is the accuracy in sound reproduction
with low noise and distortion.
But it was large in size and need high power problem. To overcome it was developed
by ASIC in its design. but this also had its disadvantage which was increasing the cost
five times. To overcome it there was of programmable DSP approach .its advantage
was the reduction in the cost and the improvement in the sound quality. but its
disadvantage that the digital H.A amplifies all the sound even noise.
To overcome it they use the real time binaural digital hearing aid platform
(TM3205000)
_________________________________________________________________________
Page 17 Topic 1 – Audiology
Now, we will discuss the details of the hearing aid of the moderate hearing loss.
2) It sample 2 inputs microphone signal with sampling rate 32 kHz/channel (hearing aid
B.W=10KHZ) and drive a stereo headphone o/p (Need Power source =1.8V)
3) It can be developed by reducing voltage of the power source to 1Vand reduce MIP
for final implementation of hearing aid.
_________________________________________________________________________
Page 18 Topic 1 – Audiology
The Interaural time delay:
Function: used to provide delay to signal going to 1 ear with respect to signal going to
2nd ear on a frequency selective basis.
Uses: it is provided on the theory that if a person has differential hear losing, in
addition to compensating gain to signal going to the two ears, there must be provision
for compensating internal delay between the signals received by the 2 ears.
Now, we are going to discuss the main three points of our study.
Once the filter is selected with all the previous characteristics the therapist improves the
spectral magnitude for subjects hearing loss by adjusting gain of each filter.
_________________________________________________________________________
Page 19 Topic 1 – Audiology
3.2.2.3 The Noise Cancellation using LMS Algorithm:
This method is called Feedback System. This method depends on 2 operations that we
will declare now. According to the shown figure “the LMS Algorithm figure”; initially
the input taps is zero, and so there is nothing coming out of the transversal filter so
there will only be the error e(1) which obviously equal d(1) [which is the desired signal
(the one with no noise)] this error is used to adapt the input signal u(n) by producing
tap weights corresponding to the input taps of u(n) then produce estimate of the desired
signal then generating the error by comparing the estimate we got with the actual value
of the desired signal(that we already know) which will be feedback again to the
adoption of the successive signal this process is called filtering process.
The adjusting of the tap weight by the error generation is called the adaptive process.
So the function of the transversal filter is the filtering process and the adaptive control
of the tap weight is the adapting process.
So this was the basic of the LMS. But now we will learn how this LMS Algorithm
really works.
First, we will use the equation of the least mean square, to show how we update the
taping weight of the adaptive weight control mechanism. Which is:
∇ J ( n ) = − 2 p + 2 R W ( n)
The simplest choice of estimators for R and P is to use instantenous estimates that are
based on the sample value tap input
)
R ( n) = u (n) u H (n)
)
P ( n) = u ( n) d * (n)
And so the first equation becomes:
)
∇ J (n) = − 2u (n) d * (n) + 2u (n) u H (n) w (n)
And if we considered J (n) is viewed as the operator gradient applied to the
instantaneous squared error.
So by substituting in the estimate of equation the steepest descent, we can get a new
recursive relation for updating the tap- weight vector:
_________________________________________________________________________
Page 20 Topic 1 – Audiology
Figure (3.2) The LMS Algorithm
That was just an introduction on how the LMS Algorithm works in general and the
basic block diagram.
_________________________________________________________________________
Page 21 Topic 1 – Audiology
interference contained in the primary signal. Thus, by subtracting the adaptive filter
output from the primary input, the effect of the sinusoidal interference will diminished.
2. Reference input:
u (n) =A COS(ωo n + φ)
In the LMS algorithm, the tape weight update is described by the following equations:
)
y (n) = ∑ iM=0−1w i (n) u (n − i )
e (n)=d(n)-y(n)
W (n+1) =W (n) +ωu (n) [d*(n)-uH (n) W (n)]
Where: m is the total number of the tap weights in the transversal filter.
With a sinusoidal excitation as the input of the interest, we restructure the block
diagram of the adaptive noise cancellers in figure (3.2). According to this new
representation , we may lump the sinusoidal input u(n) ,the transversal filter, and the
weight update equation of the LMS algorithm in to a single (open loop system)defined
by the transfer function G(Z) as shown in figure(3.3)
_________________________________________________________________________
Page 22 Topic 1 – Audiology
Figure (3.4) Equivalent Model in Z-domain
Where:
Y ( z)
G( z) =
E ( z)
Where Y (Z) is the z- transform of the reference input u (n) and the estimation error e
(n), respectively, given E (Z), our task is to find Y (Z), and therefore G (Z).to do so we
use the signal flow graph representation in the figure (3.3) .In this diagram, we have
singled out the ith tap weight for specific attention. The corresponding value of the tap
input is
u (n − i ) = A cos [ wo (n − i ) + φ ]
A j ( won +φi )
=
2
e [ + e − j ( won +φi ) ]
Then getting its z-transform (n-i) e (n)
Z [u (n − i )e(n)] =
A jφi
2
[ A
]
e z e(n)e jwon + e − jφi Z e(n)e − jWo
2
n
[ ]
A j ϕi A
=
2
[ ]
e E Ze − jWo + e − jφi E Ze jWo
2
[ ]
_________________________________________________________________________
Page 23 Topic 1 – Audiology
∧
µA 1
Wi( z ) =
2 Z −1
[e jφi
E ( Ze − jWo ) + e jφi E ( Ze jWo ) ]
Then since y (n)
A M −1 ∧
[
y (n) = ∑ Wi (n) e j ( won+φi ) + e − j ( won+φi )
2 i −o
]
The from them getting Y (Z)
A M −1 jφii ∧ ∧
jwo
y( z) = ∑ e WI ( Ze − jWo
) + e −φi
W i ( Ze )
2 i −o
We find that it consists of 2 components:
1) A time invariant component:
µMA2 1 1
( + ) 1
4 Ze − jwo − 1 Ze jwo − 1
2) A time varying component:
Sin( Mwo )
β (Wo1M ) = 2
Sin wo
And since the value of are too large we can do so:
β (Wo 1M ) Sin( MWO )
= ≈O
M M Sin wO
Thus Y (Z) is
µMA2 1 1
∴Y ( z ) = E ( Z )( − jwo
+ jwo
)
4 Ze −1 Ze −1
Thus the open loop transfer function G (Z) is:
Y ( Z ) µMA2 1 1
G ( z) = = ( − jwo + jwo )
E (Z ) 4 Ze − 1 Ze − 1
µMA2 Z Cos Wo − 1
= ( )
2 Z 2 − 2 Z Cos Wo + 1
The adaptive filter has annul point determined by the angular frequency ωo of the
sinusoidal interference as G (Z) has zeros at z= (1\cosωo) and this prove the 1st
characteristic. Then from G (Z) we get H (Z) [transfer function of a closed loop
feedback]
_________________________________________________________________________
Page 24 Topic 1 – Audiology
E (Z ) 1
H (Z ) = =
D( Z ) 1 + G ( Z )
Where E(Z)is the z-transform of the system output e(n), and D(Z) is the z-transform of
the system input d(n).
Z 2 − 2 Z Cos Wo + 1
H (Z ) = 2
Z − 2(1 − µMA2 / 4) ZCos Wo + (1 − ηMA2 / 2)
So this function is the transfer function of a second order digital notch filter with a notch
at the normalized angular frequency ωo. And finally we find that poles of H (Z) lies
inside the unit circle. This means that the adaptive filter is stable [that is needed for real
time practical life].
Also we find that the zeroes of H (Z) lies on the unit circle that means that the adaptive
noise canceller has a notch of infinite depth at frequency ωo. Also the sharpness of the
notch filter is determined by the closeness of the poles of H (Z) to its zeros. And since
the 3-dB bandwidth is used for this we find that it is equal
µΜΑ 2
Β=
2
The smaller we therefore make ω, the smaller B is and therefore the sharper the notch is
and so finally we satisfy the 2nd characteristic too.
Now, we will discuss the implanted hearing aid; the hearing aid with the severe hearing
loss.
_________________________________________________________________________
Page 25 Topic 1 – Audiology
3.3 Second Model
A method of treatment for a sensorineural hearing impairment
Introduction:
- Cochlear implants, which are implanted through a surgical procedure, are taking
hearing technology to a new level.
- The best candidates for cochlear implants are individuals with profound hearing
loss to both ears who have not received much benefit from traditional hearing
aids and are of good general health.
- Children as young as 14 months have been successfully implanted.
Processing:
• The proposed middle ear sound sensor, based on accelerometer operating
principle, can be attached to the "umbo" to convert the umbo vibration to an
electrical signal representing the input acoustic information.
• This electrical signal can be further processed by the cochlear implant speech
processor, which is followed by a stimulator to drive cochlear electrodes.
• The speech processor, stimulator, power management and control unit,
rechargeable battery and radio frequency (RF) coil will be housed in a
biocompatible package located under the skin to form a wireless network with
external adaptive control and battery charging system.
• Wireless communication between the implant and external system is essential
for post-implant programming of the speech processor.
_________________________________________________________________________
Page 26 Topic 1 – Audiology
Tuning process:
- After implant it is necessary for the patient to go through a tuning procedure for
speech processor optimization so that the cochlear implant can function properly.
- In this tuning procedure, an audiologist will present different auditory stimuli
consisting of basic sounds or words to a patient.
- The acoustic information will be detected by the implanted accelerometer and
converted into an electrical signal.
- The speech processor will then process the signal and filter it into a group of outputs,
which represent the acoustic information in an array corresponding to individual
cochlear electrode bandwidth.
- Then an array of biphasic current pulses with proper amplitude and duty cycle will be
delivered to stimulate the electrodes located inside the cochlea along the auditory
nerve.
- This excitation activates neurotransmitters, which travel to the brain for sound
reception.
- The patient will then provide feedback in terms of speech reception quality to the
audiologist.
- To achieve the optimal performance for the active implant-human interface network,
the audiologist will adaptively tune the speech processor through the RF-coils-based
wireless link.
Re-charging battery:
A- Through the same link, an intelligent power management network for extending the
battery longevity and ensuring patient safety can also be implemented between the
implanted rechargeable battery and external powering electronics.
B- An external coil loop worn in a headset can transmit RF power across the skin to the
receiving loop, and active monitoring and control of incident RF power can be realized.
C- Upon completion of battery charging, the communication and control unit can send
out a wireless command to turn off the external powering system.
_________________________________________________________________________
Page 27 Topic 1 – Audiology
Figure (3.7) umbo primary and secondary axes
Accelerometer's design:
To achieve the optimum sensor design, we've to investigate human temporal bone
vibration characterization.
_________________________________________________________________________
Page 28 Topic 1 – Audiology
B- Temporal bone experimental setup and procedures:
Therefore, any potential misalignment in sensor placement will have a minimal impact
on the output signal amplitude because of the similar acceleration amplitude response,
and also negligible frequency distortion due to the similar frequency response.
_________________________________________________________________________
Page 29 Topic 1 – Audiology
Figure (3.9)
Figure (3.10)
_________________________________________________________________________
Page 30 Topic 1 – Audiology
3.3.3 Design Guideline for optimum accelerometer:
Figure (3.11)
- The previous measurement results can serve as design guideline to help define the
specifications for the prototype accelerometer.
- Audiologists report that audible speech is primarily focused between 500 Hz and 8
KHz, and that the loudness of quite conversation is approx. 55 dB SPL.
- Within the audible speech spectrum, 500 Hz has the lowest acceleration response, and
thus it is the most difficult for detection.
- today's cochlear implants have multiple channels and electrodes to provide an
appropriate stimulus to the correct location within the cochlea, at 500 Hz, the electrode
channel bandwidth is on the order of 200 Hz.
Therefore:
1- To detect sounds at 55dB SPL at 500 Hz, an accelerometer with a sensitivity of 50
micro.g/Hz^1/2 and a bandwidth of 10 kHz is needed.
2- The total device mass is another important design consideration,
- The mass of the umbo and long process of the malleus is about 20 -- 25mg.
- adding a mass greater than 20 mg can potentially result in a significant damping effect
on the frequency response of the middle ear ossicular chain. Therefore, the total mass
of the packaged sensing system needs to be kept below 20 mg.
3.3.4 Conclusion:
An accelerometer with reduced package mass (below 20 mg) and improved
performance, achieving a sensitivity of 50 µ.g/(Hz)½, and bandwidth of 10 kHz, would
be needed to satisfy the requirements for normal conversation detection.
_________________________________________________________________________
Page 31 Topic 1 – Audiology
_________________________________________________________________________
Page 32 Topic 2 – Acoustical Simulation of Room
CHAPTER 1
GEOMETRICAL ACOUSTICS
1.1 Introduction
What is Room Acoustics?
The sound behavior in a room depends significantly on the ratio of the frequency (or
the wavelength) of the sound to the size of the room. Therefore, the audible spectrum
can be divided into four regions (zones) illustrated in the following drawing (for
rectangular room):
1. The first zone is below the frequency that has a wavelength of twice the longest
length of the room. In this zone sound behaves very much like changes in static
air pressure.
4. In the fourth zone, sounds behave like rays of light bouncing around the room.
The first 3 zones constitute what we call “Physical Acoustics” and the fourth zone
is the target of our project is called “Geometrical Acoustics”.
_________________________________________________________________________
Page 33 Topic 2 – Acoustical Simulation of Room
What is Acoustical Simulation?
Human beings hear things by virtue of pressure waves impinging on their eardrums,
OK? All of the information that we need to know about the sound (such as volume,
frequency content, direction, etc.) is contained in those pressure waves (or so we
believe). What an auralization system tries to do is to fool your brain into thinking that
you're listening to a sound source in an acoustical space (i.e., room) that you're not in.
How it does this is to take the original sound source and alter the frequency spectrum
according to both:
Once this is done, you play back two signals (one for each ear) through, for example,
headphones and listen. Hopefully, you get the sense that what you are listening to is
what you would actually hear if you were really in the room with the source. Simple
enough, right?
Acoustical Simulation
of Room
_________________________________________________________________________
Page 34 Topic 2 – Acoustical Simulation of Room
1.2 Sound Behavior
Consider a sound source situated within a bounded space. Sound waves will propagate
away from the source until they encounter one of the room's boundaries where, in
general, some of the energy will be absorbed, some transmitted and the rest reflected
back into the room.
Sound arriving at a particular receiving point within a room can be considered in two
distinct parts. The first part is the sound that travels directly from the sound source to
the receiving point itself. This is known as the direct sound field and is independent of
room shape and materials, but dependant upon the distance between source and
receiver.
After the arrival of the direct sound, reflections from room surfaces begin to arrive.
These form the indirect sound field which is independent of the source/receiver
distance but greatly dependant on room properties.
If the sound source is abruptly switched off, the sound intensity at any point will not
suddenly disappear, but will fade away gradually as the indirect sound field begins to
die off and reflections get weaker. The rate of this decay is a function of room shape
and the amount/position of absorbent material. The decay in highly absorbent rooms
will not take very long at all, whilst in large reflective rooms, this can take quite a long
time.
This gradual decay of sound energy is known as reverberation and, as a result of this
proportional relationship between absorption and sound intensity, it is exponential as a
function of time. If the sound pressure level (in dB) of a decaying reverberant field is
_________________________________________________________________________
Page 35 Topic 2 – Acoustical Simulation of Room
graphed against time, one obtains a reverberation curve which is usually fairly straight,
although the exact form depends upon many factors including the frequency spectrum
of the sound and the shape of the room.
Sound Ray:
As in geometrical optics, we mean by a sound ray a small portion of a spherical wave
with vanishing aperture, which originates from a certain point. It has a well-defined
direction of propagation and is subject to the same laws of propagation as a light ray,
apart from the different propagation velocity, of these laws; only the law of reflection is
of importance in room acoustics. But the finite velocity of propagation must be
considered in all circumstances, since it is responsible for many important effects such
as reverberation, echoes and so on.
Since the lateral extension of a sound ray is vanishingly small, the reflection law is
valid for any part of a plane no matter how small. Therefore it can be applied equally
well to the construction of the reflection of an extended ray bundle from a curved
surface by imagining each ray in turn to be reflected from the tangential plane which it
strikes.
_________________________________________________________________________
Page 36 Topic 2 – Acoustical Simulation of Room
The mirror source concept:
The reflection of sound ray originating from a certain point can be illustrated by the
construction of a mirror source, provided that the reflection surface is plane (see figure
1.2) at some distance from the reflecting plane, there is a sound source A. we are
interested in the sound transmission to another point B. it takes place along the direct
path AB on the one hand (direct sound) and on the other, by reflection from the wall.
To find the path of the reflected ray we make A` the mirror image of A, connect A` to
B and A to the point of intersection of A`B with the plane.
B○
○ ○
A A'
Once we have constructed the mirror source A` associated with a given original source
A, we can disregard the wall altogether, the effect of which is now replaced by that of
the mirror source.
Of course, we must assume that the mirror emits exactly the same sound signal as the
original sound and that its directional characteristics are symmetrical to that of A. if the
extension of the reflecting wall is finite, then we must restrict the directions of emission
of A` accordingly. Usually, not all the energy striking a wall is reflected from it; part of
the energy is absorbed by the wall (or it is transmitted to the other side which amounts
to the same thing as far as the reflected fraction is concerned).
_________________________________________________________________________
Page 37 Topic 2 – Acoustical Simulation of Room
1.3.2 Sound Reflections in Rooms
Suppose we follow a sound ray originating from a sound source on its way through a
closed room. Then we find that it is reflected not once, but many times, from the walls,
the ceiling and perhaps also from the floor. This succession of reflections continues
until the ray arrives at a perfectly absorbent surface. But even if there is no perfectly
absorbent area in our enclosure, the energy carried by the ray will become vanishingly
small after some time, because during its free propagation in air as well as with each
reflection a certain part of it is lost by absorption. If the room is bounded by plane
surface, it may be advantageous to find the paths of the sound rays by constructing the
mirror source.
Let us examine, in a room of arbitrary shape, the position of a sound and a point of
observation. We assume the sound source to emit at a certain time a very short sound
pulse with equal intensity in all directions. This pulse will reach the observation point
(see figure 1.3) not only by the direct path, but also via numerous partly signal, partly
multiple reflections, of which only a few are indicated in (figure 1.3) the total sound
field is thus composed of the 'direct sound' and of many 'reflections'.
In the following we use the term 'reflection' with a two-fold meaning: first to indicate
the process of reflecting sound from a wall and secondly as the name for a sound
component which has been reflected.
These reflections reach the observer from various directions, moreover their strengths
may be quite different and finally they are delayed with respect to the direct sound by
different times, corresponding to the total path length they have covered until they
reach the observation point.
Thus, each reflection must be characterized by three quantities: its direction, its relative
strength and its relative time of arrival, i.e. its delay time.
The sum total of the reflections arriving at a certain point after emission of the original
sound pulse is the reverberation of the room, measured at or calculated for that point.
_________________________________________________________________________
Page 38 Topic 2 – Acoustical Simulation of Room
1.3.3 Room Reverberation
If we mark the arrival times of the various reflections by perpendicular dashes over a
horizontal time axis and choose the heights of the dashes proportional to the relative
strengths of reflections, i.e. to the coefficients An we obtain what is frequently called a
'reflection diagram' or 'echogram'. It contains all significant information on the
temporal structure of the sound field at a certain room point. In (figure 1.4) the
reflection diagram of a rectangular room with dimensions 40m × 25m × 8m is plotted.
After the direct sound ,arriving at t=0, the first strong reflections occur at first
sporadically, later their temporal density increases rapidly, however; at the same time
the reflections carry less and less energy.
10 dB
0 50 100 150ms
t
(Figure 1.4) Reflection diagram for certain positions of sound source and
receiver in a rectangular room of 40 m x 25 m x 8 m. Abscissa is the delay
time of a reflection, ordinate its level, both with respect to the direct sound
arriving at t = 0.
_________________________________________________________________________
Page 39 Topic 2 – Acoustical Simulation of Room
As we shall see later in more detail, the role of the first isolated reflections with respect
to our subjective hearing impression is quite different from that of the very numerous
weak reflections arriving at later times, which merge into what we perceive
subjectively as reverberation. Thus, we can consider the reverberation of a room not
only as the common effect of free decaying vibrational modes, but also as the sum total
of all reflections-except the very first ones. The reverberation time, that is the time in
which the total energy falls to one millionth of its initial value, is thus:
V Eq. (1.1)
T = 0.163
4mV − S ln(1 − α )
If we use the value of the sound velocity in air and express the volume V in m3 and the
wall area S in m2. We have by rather simple geometric considerations the most
important formula of room acoustics, which relates the reverberation time that is the
most characteristics figure with respect to the acoustics of a room, to its geometrical
data and to the absorption coefficient of its walls. We have assumed more or less tacitly
that the latter is the same for all wall portions and that it does not depend on the angle
at which a wall is struck by the sound rays.
Where:
V is the volume of the enclosure (m3) and A is the total absorption coefficient (a) of
each material used within the enclosure (Sabine).
The term A is calculated as the sum of the surface area (in m2) times the absorption
coefficient (a) of each material used within the enclosure.
_________________________________________________________________________
Page 40 Topic 2 – Acoustical Simulation of Room
Research “Kuttruff 1973”, however; shows that this is not always the case. It has shown
that this exponential decay is the initial portion of the sound decay curve process which
is responsible for our subjective impression of reverberation as the later portion is
usually masked by new sounds. To account for this, the Early Decay Time (EDT) is
used. This is measured in the same way as the normal reverberation time but over only
the first 10 - 15 dB of decay, depending on the work being referenced.
• Definition:
The definition is a measure of the clarity of speech, specifying how good a speaker can
be understood at a given position from the listener for example.
• Clarity:
Clarity is a comparable measure for the clarity of music. It is used for large rooms like
chamber music and concert halls. It is not appropriate for an evaluation of music
reproduced with loudspeakers in a recording studio or listening room.
Another standard defines a method for computing a physical measure that is highly
correlated with the intelligibility of speech; this measure is called the Speech
Intelligibility Index, or SII.
_________________________________________________________________________
Page 41 Topic 2 – Acoustical Simulation of Room
CHAPTER 2
ARTIFICIAL REVERBERATION
2.1 Introduction
Reverberation:
Reverberation is the result of many reflections of sound in a room. For any sound,
there’s a direct path to reach the ear of the audience but it’s not the only path; sound
waves may take slightly longer paths by reflecting off walls and ceiling before arriving
to our ears.
This sound will arrive later than the direct sound; and weaker because of the absorption
of walls of sound energy. It may also reflect again before arriving to our ears.
So these delays and attenuations in sound waves are called “Reverberation”.
Summary:
Reverberation occurs when copies of the audio signal reach the ear with different
delays and amplitudes. It depends on the room geometry and its occupants.
Reverberation is a natural phenomenon.
Artificial Reverberation:
Artificial Reverberation is added to sound signals requiring additional reverberation for
optimum listening enjoyment.
Our Aim:
To generate an artificial reverberation which is indistinguishable from the natural
reverberation of real rooms.
2) Fluttering:
Fluttering of reverberated sound occurs because the echo density* is too low compared
to the echo density of real room, especially for short transients.
*Echo density: the number of echoes per second at the output of the reverberator for a
single pulse at the input.
_________________________________________________________________________
Page 42 Topic 2 – Acoustical Simulation of Room
How to avoid the above degradations in artificial reverberators?
The problem of coloration can be solved by making an artificial reverberator with a flat
amplitude-frequency response. This can be achieved by passing all frequency
components equally by means of an all-pass filter which will be described below.
Concerning the problem of low echo density, it was found that for a flutter-free
reverberation; approximately 1000 echoes per second are required. Unfortunately, echo
densities of 1000 per second are not easily achieved practically by one dimensional
delay device.
Many researchers have suggested multiple feedbacks to produce a higher echo density.
However, multiple feedbacks have severe stability problems. Also, it leads to non flat
frequency responses and non-exponential decay characteristics.
We can simply treat the problem of low echo density by having a basic reverberating
unit that can be connected in series any number of times.
Now the question that must tackle our minds is why this simple remedy of echo density
problem has not been used from the beginning?
In fact, the answer is quite simple because the existing reverberators have highly
irregular frequency responses.
Here we have to mention that if the basic reverberator unit has a flat frequency
response. The series connection of any number of them will have a flat response too.
Conclusion:
All-pass reverberators (Reverberators with flat frequency response) can remove the two
main defects in artificial reverberators which are: Coloration and Fluttering.
IN
+ Delay, τ
OUT
Feedback
Gain, g
_________________________________________________________________________
Page 43 Topic 2 – Acoustical Simulation of Room
Impulse Response of Comb filter:
Analysis:
The impulse response of Comb Filter is given by the following equation:
h (t) = δ (t-τ) + g δ (t-2τ) + g2 δ (t-3τ) + g3 δ (t-4τ) + . . . Eq. (2.1)
Where:
δ (t- τ) is the time response of a simple echo produced by a delay line.
g is the gain that must be less than one to guarantee the stability of the filter.
_________________________________________________________________________
Page 44 Topic 2 – Acoustical Simulation of Room
H(ω) = e-jωτ + g e-2jωτ + g2 e-3jωτ + g3 e-4jωτ + . . . Eq. (2.2)
e − j ωτ
H (ω ) = j ωτ Eq. (2.3)
1 − ge −
-g
IN + OUT
Gain, g
_________________________________________________________________________
Page 45 Topic 2 – Acoustical Simulation of Room
Impulse Response of All-pass filter:
Analysis:
The impulse response of the above All-pass Filter is given by the following equation:
h (t) = -g δ(t) + (1- g2)[δ (t-τ) + g δ (t-2τ) + . . .] Eq. (2.7)
e − j ωτ
H ( ω ) = − g + (1 − g 2 ) − j ωτ Eq. (2.8)
1 − ge
_________________________________________________________________________
Page 46 Topic 2 – Acoustical Simulation of Room
Or
1 − ge j ωτ
H ( ω ) = e − j ωτ − j ωτ Eq. (2.10)
1 − ge
Conclusion:
By implementing the all-pass reverberator discussed above, we can say that we possess
a basic reverberating unit that passes all frequencies with equal gain and thus avoids the
problem of sound coloration. Moreover, if we connect in series any desired number of
such units; we can increase the echo density.
_________________________________________________________________________
Page 47 Topic 2 – Acoustical Simulation of Room
reverberation time). This can be done by adding a simple RC circuit in the feedback
loop of each all-pass reverberator.
+ τ g7 + OUT
1
g1
-g5 -g6
+ τ2
1-g52 1-g62
+ τ + + τ +
g2
5 6
+
IN g5 g6
+ τ3
g3
+ τ4
g4
Description:
Here, as we can see we used a set of 4 comb filters connected in parallel; however,
comb filters have irregular frequency responses, the human ear cannot distinguish
between a flat response and an irregular response of a room that fluctuates about
approximately 10 dB. Studies have shown that such irregularities are unnoticed
when the density of peaks and valleys is high enough and this is the case used in
our 4 comb filters.
Several all-pass reverberators are connected in series with the comb filters to
increase the echo density.
_________________________________________________________________________
Page 48 Topic 2 – Acoustical Simulation of Room
2.4 Ambiophonic Reverberation
We can achieve a highly diffuse reverberation by adding one more modification to our
artificial reverberator; that is to make it “Ambiophonic”.
In order to create the spatially diffuse character of real reverberation, we need to
generate several different reverberated signals and to feed them into a number of
loudspeakers distributed around the listener.
1
C1 -1
MATRIX
C2 -1
IN
A1 A2
C3 -1
C4 -1
16
Description:
• The schema illustrates an Ambiophonic installation where the order of comb filters
and all-pass filters has been inverted compared with the diagram of section (2.5)
• The outputs and the inverted outputs of the 4 comb filters are connected to a
resistance matrix which forms up to 16 different combinations of the comb filter
outputs. Each combination uses each comb filter output or its negative exactly once.
• The matrix outputs are supplied to power amplifiers and loudspeakers distributed
around the listening area.
_________________________________________________________________________
Page 49 Topic 2 – Acoustical Simulation of Room
CHAPTER 3
SPATIALIZATION
3.1 Introduction
The acoustical sound field around us is very complex. Direct sounds, reflections, and
refractions arrive at the listener's ears, which then analyzes incoming sounds and
connects them mentally to sound sources. Spatial hearing is an important part of the
surrounding world.
The perception of the direction of the sound source relies heavily on the two main
localization cues: Interaural level difference (ILD) and Interaural time difference (ITD).
These frequency-dependent differences occur when the sound arrives at the listener's
ears after having traveled paths of different lengths or being shadowed differently by
the listener's head. In addition to ILD and ITD some other cues, such as spectral
coloring, are used by humans in sound source localization.
Various attempts to enlarge the sound field have been proposed. Horizontal-only
(pantophonic) sound fields have been created with various numbers of loudspeakers
and with various systems of encoding and decoding and matrixing. In most systems the
loudspeakers are situated in a two-dimensional (horizontal) plane. Some attempts to
produce periphonic (full-sphere) sound fields with three-dimensional loudspeaker
placement exist, such as holophony or three dimensional Ambisonics.
In most systems the positions of the loudspeakers are fixed. In the Ambisonics systems,
the number and placement of the loudspeakers may be variable. However, the best
possible localization accuracy is achieved with orthogonal loudspeakers is greater; the
accuracy is not improved appreciably.
A natural improvement would be a virtual sound source positioning system that would
be independent of the loudspeaker arrangement and could produce virtual sound
sources with maximum accuracy using the current loudspeaker configuration.
_________________________________________________________________________
Page 50 Topic 2 – Acoustical Simulation of Room
The vector base amplitude panning (VBAP) is a new approach to the problem. The
approach enables the use of an unlimited number of loudspeakers in an arbitrary two-
or three-dimensional placement around the listener. The loudspeakers are required to be
nearly equidistant from the listener, and the listener room is assumed to be not
reverberant. Multiple moving or stationary sounds can be positioned in any direction in
the sound field spanned by the loudspeakers.
In VBAP the amplitude panning method is reformulated with vectors and vector bases.
The reformulation leads to simple equations for amplitude panning, and the use of
vectors makes the panning methods computationally efficient.
In the simple amplitude panning method two loudspeakers radiate coherent signals,
which may have different amplitudes. The listener perceives an illusion of a single
auditory event (virtual sound source, phantom sound source), which can be placed on a
two-dimensional sector defined by locations of the loudspeakers and the listener by
controlling the signal amplitudes of the loudspeakers. A typical loudspeaker
configuration is illustrated in figure 3.1.
_________________________________________________________________________
Page 51 Topic 2 – Acoustical Simulation of Room
(Figure 3.1) Two-channel stereophonic configuration.
Two loudspeakers are positioned symmetrically with respect to the median plane.
Amplitudes of the signals are controlled with gain factors g1 and g2 respectively. The
loudspeakers are typically positioned at φ 0 = 30ο angles.
The direction of the virtual source is dependent on the relation of the amplitudes of the
emanating signals. If the virtual source is moving and its loudness should be constant,
the gain factors that control the channel levels have to be normalized. The sound power
can be set to a constant value C, whereby the following approximation can be stated:
The parameter C > 0 can be considered a volume control of the virtual source. The
perception of the distance of the virtual source depends within some limits on C ̶ the
louder the sound, the closer it is located. To control the distance accurately, some
psycho-acoustical phenomena should be taken into account, and some other sound
elements should be added, such as reflections and reverberations.
When the distance of the virtual source is left unattended, the virtual source can be
placed on an arc between the loudspeakers, the radius of which is defined by the
distance between the listener and the loudspeakers. The arc is called the active arc.
In the ideal panning process only the direction where the virtual source should appear is
defined and the panning tool performs the gain factor calculation. In the next two
subsections some different ways of calculating the factors will be presented.
sin ϕ g − g2
= 1 . Eq. (3.2)
sin ϕ 0 g1 + g 2
Where 0ο < φ0 < 90ο , - φ0 ≤ φ ≤ φ0, and g1, g2 ∈[0, 1]. In Eq. (3.2) φ represents the
angle between the x axis and the direction of the virtual source; ± φ0 is the angle
between the x axis and the loudspeakers. This equation is valid if the listener's head is
_________________________________________________________________________
Page 52 Topic 2 – Acoustical Simulation of Room
pointing directly forward. If the listener turns his or her head following the virtual
source, the tangent law is more correct,
tan ϕ g − g2
= 1 . Eq. (3.3)
tan ϕ 0 g1 + g 2
Where 0ο < φ0 < 90ο , - φ0 ≤ φ ≤ φ0, and g1, g2 ∈[0, 1]. Eqs. (3.2) and (3.3) have been
calculated with the assumption that the incoming sound is different only in magnitude,
which is valid for frequencies below 500-600 Hz. When keeping the sound power level
constant, the gain factors can be solved using Eqs. (3.2) and (3.1) or using Eqs. (3.3)
and (3.1). The slight difference between Eqs. (3.2) and (3.3) means that the rotation of
the head causes small movements of the virtual sources. However, in subjective tests it
was shown that this effect is negligible.
Some kind of amplitude panning method is used in the Ambisonics encoding system. In
pantophonic Ambisonics the entire sound field is decoded to three channels using a
modified amplitude panning method. Two of the channels, X and Y, contain the
components of the sound on the x axis and the y axis, respectively. The third, W,
contains a monophonic mix of the sound material. The signal to be stored on the
channel is calculated by multiplying the input signal samples by the channel-specific
gain factor. The gain factors gx, gy, and gw are formulated as
This method differs from the standard amplitude panning method in that the gain
factors gx and gy may have negative values. The negative values imply that the signal is
stored on the recorder in antiphase when compared with the monophonic mix in the W
channel. When the decoded sound field is encoded, the antiphase signals on a channel
are applied to the loudspeakers in a negative direction of the respective axis. The
decoding stage is performed with matrixing equations. In the equations some additions
or subtractions are performed between the signal samples on the W channel and on the
X and Y channels. Equations for various loudspeaker configurations can be formulated.
_________________________________________________________________________
Page 53 Topic 2 – Acoustical Simulation of Room
The absolute values of the gain factors used in two-dimensional Ambisonics satisfy the
tangent law [Eq. (3.3)], which the reader may verify, for example, for values of 0ο< Ө <
90ο, by setting Ө = φ0 + φ, φ0 = 45ο, g2 = gx, and g1 = gy, and by substituting Eqs. (3.4)
and (3.5) into the relation (gy – gx)/ (gy+gx).
loudspeakers 1 and 2, respectively, as seen in figure (3.3). The super script T denotes
the matrix transposition. The unit-length vector p = [ p1 p 2 ] , which points toward the
T
In Eq. (3.7) g1 and g2 are gain factors, which can be treated as nonnegative scalar
variables. We may write the equation in matrix form,
where g = [g1 g 2 ] and L12 = [l1 l 2 ] . This equation can be solved if L12
T −1
exists,
−1
l l12
g = p L = [ p 1 p 2 ] 11
T −1
12 . Eq. (3.9)
l 21 l 22
−1 −1 −1
The inverse matrix L12 satisfies L12 L12 = I, where I is the identity matrix. L12 exists
ο ο
when φ0 ≠ 0 and φ0 ≠ 90 , both problem cases corresponding to quite uninteresting
stereophonic loudspeaker placements. For such cases the one-dimensional VBAP can
be formulated.
_________________________________________________________________________
Page 54 Topic 2 – Acoustical Simulation of Room
Gain factors g1 and g2 calculated using Eq. (3.9) satisfy the tangent law of Eq. (3.3).
When the loudspeaker base is orthogonal, φ0 = 45ο, the gain factors are also equivalent
to those calculated for the Ambisonics encoding system, with the exception that the
gain factors in Ambisonics may have negative values. In such cases, however, the
absolute values of the factors are equal.
When φ0 ≠ 45ο, the gain factors have to be normalized using the equation
Cg
g scaled = . Eq. (3.10)
g + g 22
2
1
The virtual source can be produced by the loudspeaker on the active arc of which the
virtual source is located. Thus the sound field that can be produced with VBAP is a
union of the active arcs of the available loudspeaker bases. In two-dimensional cases
_________________________________________________________________________
Page 55 Topic 2 – Acoustical Simulation of Room
the best way of choose the loudspeaker bases is to let the adjacent loudspeakers from
them. In the loudspeaker system illustrated in figure (3.4) the selected bases would
be L12 , L23 , L34 , L45 , and L51 . The active arcs of the bases are thus nonoverlapping.
The use of the nonoverlapping active arcs provides continuously changing gain factors
when moving virtual sources are applied. When the sound moves from one pair to
another, the gain factor of the loudspeaker, which is not used after the change, becomes
gradually zero before the change-over point.
The fact that all other loudspeakers except the selected pair are idle may seem a waste
of resources. In this way, however, good localization accuracies can be achieved for the
principal sound, whereas the other loudspeakers may produce reflections and
reverberation as well as other elements.
When the tool is initialized, the directions of the loudspeakers are measured relative to
the best listening position and loudspeaker pairs are formed from adjacent
loudspeakers. L−nm1 matrices are calculated for each pair and stored in the memory of the
panning system.
During the run time the system performs the following steps in an infinite loop:
• New direction vectors p(1,….,n) are defined.
• The right pairs are selected.
• The new gain factors are calculated.
• The old gain factors are cross faded to new ones and the loudspeaker bases are
changed if necessary.
The pair can be selected by calculating unscaled gain factors with Eq. (3.9) using all
selected vector bases, and by selecting the base that does not produce any negative
factors. In practice it is recommended to choose the pair with the highest smallest
factor, because a lack of numerical accuracy during calculation may produce slightly
negative gain factors in some cases. The negative factor must be set to zero before
normalization.
_________________________________________________________________________
Page 56 Topic 2 – Acoustical Simulation of Room
_________________________________________________________________________
Page 57 Topic 3 – Noise Control
CHAPTER 1
SOUND ABSORPTION
To have a standing wave (incident and reflected waves) inside a tube we should put
speaker with certain frequency and amplitude at one end of a tube and rigid (absorber
material) on the other end.
Using the microphone inside the tube useful to change the length of the standing wave
and to detect the positions of maximum and minimum values.
_________________________________________________________________________
Page 58 Topic 3 – Noise Control
1.2.1 Procedures:
1- Adjust the speaker at certain frequency and amplitude then put the material which
needed to be measured its absorption coefficient at one end of the tube.
2- Move the microphone inside the tube which connects to the speaker at the other end
of the tube using the oscilloscope we can detect maximum minimum values Vmax &
Vmin.
σ =Vmax/Vmin
α = 4* σ /( σ+1)^2
α average=0.383
_________________________________________________________________________
Page 59 Topic 3 – Noise Control
Carpet underlay
α average=0.554
α average=0.813
α average=0.875
_________________________________________________________________________
Page 60 Topic 3 – Noise Control
_________________________________________________________________________
Page 61 Topic 3 – Noise Control
1.3 Sound Absorption by Vibrating or Perforated Boundaries
For the acoustics of a room it does not make any difference whether the apparent
absorption of a wall is physically brought about by dissipated processes, by conversion
of sound energy into heat, or by part of the energy penetrating through the wall into the
outer space. In this respect an open window is a very effective absorber, since it acts as
a sink for all the arriving sound energy.
A less trivial case is that of a wall or some part of a wall forced by a sound field into
vibration with substantial amplitude. Then a part of the wall's vibrational energy is re-
radiated into the outer space. This part is withdrawn from the incident sound energy,
viewed from the interior of the room. Thus the effect is the same as if it were really
absorbed it can therefore also be described by an absorption coefficient. In practice this
sort of absorption occurs with doors, windows, light partition wall, suspended ceiling,
circus tents and similar walls.
_________________________________________________________________________
Page 62 Topic 3 – Noise Control
p1
p3
p2
This process, which may be quite involved especially for oblique sound incidence, is
very important in all problems of sound insulation. From the viewpoint of room
acoustics, it is sufficient, however, to restrict discussions to the simplest case of a plane
sound wave impinging perpendicularly onto the wall, whose dynamic properties are
completely characterized by its mass inertia. Then we need not consider the
propagation of bending waves on the wall.
Let us denote the sound pressures of the incident and the reflected waves on the surface
of a wall by p1 and p2, and the sound pressure of the transmitted wave by p3. The total
pressure acting on the wall is then p1+p2-p3. it is balanced by the inertial force iωMυ
where M denotes the mass per unit area of the wall and υ the velocity of the wall
vibrations. This velocity is equal to the particle velocity of the wave radiated from the
rare side, for which p3=ρcυ holds. Therefore we have p1+p2- ρcυ = iωMυ, from which
we obtain:
Z=iωM+ ρc
For the wall impedance
α= (1+ (ωM/2 ρc)²)¯ ¹~ (2 ρc/ ωM)²
Where c: air velocity
ρ: density of air
ω= 2*pi*f
f: resonance frequency
_________________________________________________________________________
Page 63 Topic 3 – Noise Control
It is a fact of great practical interest that a rigid perforated plate or panel has essentially
the same properties as a mass-loaded wall or foil. Each hole in a plate may be
considered as a short tube or channel with length b, the mass of air contained on it,
divided by the cross section is ρb.
Because of the contraction of the air stream passing through the hole, the air vibrates
with a greater velocity than that in the sound wave remote from the wall, and hence the
inertial forces of the air included in the hole are increased. The increase is given by the
ratio s2/s1, where s1 is the area of the hole and s2 is the plate area per hole. Hence the
equivalent mass of the perforated panel per unit area is M=ρb`/σ with σ=s1/s2
The geometrical tube length b has been replaced by the effective length b` which is
some times larger than b, b`=b+2δb
The correction term 2δb known as the end correction accounts for the fact that the
streamlines cannot contract or diverge abruptly but only gradually when interning or
leaving a hole. For circular apertures with radius (a) and with relatively large lateral
distances it is given by δb=0.8 a.
_________________________________________________________________________
Page 64 Topic 3 – Noise Control
CHAPTER 2
SOUND TRANSMISSION
=1
τ= all sound is transmitted T.L =0
=0
τ= No sound is transmitted T.L= ∞
=0
τ=
= 0.2
τ= 20 % of incident sound is transmitted
_________________________________________________________________________
Page 65 Topic 3 – Noise Control
Due to the definition of the rating, STC is only applicable to air-borne sound, and is a
poor guideline for construction to contain mechanical equipment noise, music,
transportation noise, or any other noise source which is more heavily weighted in the
low frequencies than speech. The standard thus only considers frequencies above 125
Hz. Composite partitions composed of elements such as doors and windows as well as
walls will tend to have an STC close to the value of the lowest STC of any component.
In laboratory tests, the STC rating of that particular wall section varies from STC 47 to
STC 51. Any cracks in the wall or holes for electrical or mechanical servicing will
further reduce the actual result; rigid connections between wall surfaces can also
seriously degrade the wall performance. The higher the target STC, the more critical
are the sealing and structural isolation requirements. The builder's best options for
getting a satisfactory STC result are to specify partitions with a laboratory rating of
STC 54 or better.
_________________________________________________________________________
Page 66 Topic 3 – Noise Control
The next step
Is to put the material on the separated wall between two rooms and sound transmission
between the rooms is measured again in the received room. The sound level from the
''receiver'' is subtracted from the sound level from ''source''. The resulting difference is
the transmission loss or ''TL.'‘
Now we can use this value of TL to determine the sound transmission class STC value
graphical or by numerical procedure.
By graphical procedure:
the TL is plotted on a graph of 1/3-octave band center frequency versus level (in
dB). Now this is where it can get confusing. To get the STC, the measured curve is
compared to a reference STC curve.
By numerical procedure:
the numerical procedure for determining the STC is easier unlike the graphical
procedure. The numerical procedure consists of the following steps.
Step 2: Write the adjustment factor against each frequency in column 3.these
factors are based on the STC contour, assuming the adjustment as zero at 500 HZ.
Step 3: add the TL and adjustment factors in column 4.this is the adjusted TL
(TLadj).
Step 4: Note the least TLadj value and circle it, at column 4.
Step 5: Add eight to the least TLadj .this is the first trial STC (STCtrial) write this
value in column 5.
Step 6: subtract STC trial from TL adj, i.e., determine (TL adj – STC trial).
If this value is positive, enter zero in column 5: if negative enter the actual value.
_________________________________________________________________________
Page 67 Topic 3 – Noise Control
Single layer of cork:
4000 87 72 15
_________________________________________________________________________
Page 68 Topic 3 – Noise Control
Frequency TL Adjustment Adjusted Trial STCs
(HZ) dB factor TL(TLadj) (STCtrial= 27)
500 22.3 0 22.3 -4.7
1000 22.5 -3 19.5 -7.5
2000 23 -4 19 -8
4000 23.1 -4 19.1 -7.9
Table 2-5 the calculation of STC for double layer of cork
We discuss the various factors that affect sound transmission through different types of
concrete block walls, including single-leaf walls and double-leaf walls. Knowledge of
these factors will assist construction practitioners to design and build walls with high
levels of acoustic performance economically.
Concrete block walls are commonly used to separate dwelling units from each other
and to enclose mechanical equipment rooms in both residential and office buildings
because of their inherent mass and stiffness. Neither of these fundamental properties
(mass and stiffness) can be altered by users. However, there are additional factors that
need to be considered in building high quality walls.
Figure 2-1 shows measured STC ratings for single-leaf concrete block walls from a
number of published sources. The considerable scatter demonstrates that, while
important, block weight is not the only factor that determines the STC for this type of
wall. In the absence of measured data, the regression line in Figure 2-1 can be used to
estimate the STC from the block weight. Alternatively, Table 2-6 provides
representative STC values for 50% solid blocks that have been sealed on at least one
side. It shows the relatively modest effects of significant increases in wall thickness on
STC.
Adding materials such as sand or grout to the cores of the blocks simply increases the
weight; the increase in STC can be estimated from Figure 2-1.
_________________________________________________________________________
Page 69 Topic 3 – Noise Control
Figure 2-1 also shows that using heavier block to get an STC rating of much more than
50 leads to impracticably heavy constructions.
Figure 2-1 Effect of block weight on STC for single- layer concrete block walls.
Table 2-6 STC ratings for 50% solid normal- weight and lightweight block walls sealed on at
least one side
Effect of Porosity
When the concrete block is porous, sealing the surface with plaster or block sealer
significantly improves the sound insulation; the more porous the block, the greater the
improvement. Improvements of 5 to 10 STC points, or even more, are not uncommon
for some lightweight block walls after sealing. Conversely, normal-weight blocks
usually show little or no improvement after sealing. This improvement in STC in
lightweight blocks is related to the increased airflow resistivity of these blocks.
_________________________________________________________________________
Page 70 Topic 3 – Noise Control
2.4.2 Double-Leaf Concrete Block Walls
In principle, double-leaf masonry walls can provide excellent sound insulation. They
appear to meet the prescription for an ideal double wall: two independent, heavy layers
separated by an air space. In practice, constructing two block walls that are not solidly
connected somewhere is very difficult. There is always some transmission of sound
energy along the wire ties, the floor, the ceiling and the walls abutting the periphery of
the double-leaf wall, and through other parts of the structure. This transmitted energy,
known as flanking transmission, seriously impairs the effectiveness of the sound
insulation. Flexible ties and physical breaks in the floor, ceiling and abutting walls are
needed to reduce it.
Even if such measures are considered in the design, mortar droppings or other debris
can bridge the gap between the layers and increase sound transmission. Such errors are
usually concealed and impossible to rectify after the wall has been built.
A double-leaf wall that was expected to attain an STC of more than 70 provided an
STC of only 60 because of mortar droppings that connected the two leaves of the wall.
In general, the greater the T.L the better the performance of the block wall, it is
possible to construct high-quality walls that can meet the most acoustically
demanding situations it can be achieved by changing (mass of the wall & the
amount of sound-absorbing material ) .
Noise reduction at work areas inside production facilities is essential not only to
conserve hearing of employee, but also help employee to accomplish work efficiently
In order to achieve noise reduction between two rooms, the wall (or floor) separating
them must transmit only a small fraction of the sound energy that strikes it , it will be
achieved by using good absorbed material on the separated wall to have less sound
energy transmitted, then higher the transmission loss. In other words, the greater the
TL, the better the wall is at reducing noise. Now we can get the relation between the
noise reduction of the wall and the transmission loss as shown below.
_________________________________________________________________________
Page 71 Topic 3 – Noise Control
1. Generating noise by using the noise source in our experiment we will use dc-
motor as shown in figure 2-2.
Figure 2-2
2. We prepare some equipment (internal microphone, external microphone, and
sound level meter).
Figure 2-3
4. Then by the internal microphone& the external microphone and the sound level
meter we can get the internal sound level and the external sound level.
5. By subtract the two levels of the sound we can get the noise reduction of the
used material.
So the noise reduction due to using open cell foam as absorbed material is equal
Carpet underlay
The sound level without using the absorbed material = 79.5 (dBA).
So the noise reduction due to using Carpet underlay as absorbed material is equal
_________________________________________________________________________
Page 72 Topic 3 – Noise Control
Mineral wool slab (low density)
The sound level without using the absorbed material = 79.5 (dBA).
So the noise reduction due to using Mineral wool slab (low density) as absorbed
material is equal
The sound level without using the absorbed material = 79.5 (dBA).
So the noise reduction due to using Mineral wool slab (high density) as absorbed
material is equal
0.383
Open cell foam 5 dBA
0.554
Carpet underlay 6 dBA
0.813
Mineral wool slab (low density) 7 dBA
_________________________________________________________________________
Page 73 Topic 3 – Noise Control
Now, from the results which shown we conclude for the shown absorbed materials that
for the material has high absorption coefficient it will be good insulation material due
to high noise reduction value ,so for the above four materials the best one should be
used is Mineral wool slab (high density).
_________________________________________________________________________
Page 74 Topic 3 – Noise Control
____________________________________________________________________________
Page 75 Topic 4 – Speech Technology
CHAPTER 1
SPEECH PRODUCTION
1.1 Introduction
In this chapter, we take a look at the physiological and acoustic aspects of speech
production and of speech perception, which will help to prepare the ground for later
chapters on the electronic processing of speech signals.
The human apparatus concerned with speech production and perception is complex and
uses many important organs—the lungs, mouth, nose, ears and their controlling
muscles and the brain.
When we consider that most of these organs serve other purposes such as breathing or
eating it is remarkable that this apparatus has developed to enable us to make such a
wide variety of easily distinguishable speech utterances.
____________________________________________________________________________
Page 76 Topic 4 – Speech Technology
1.2.1 Breathing
The use of exhaled breath is essential to the production of speech. In quiet breathing, of
which we are not normally aware, inhalation is achieved by increasing the volume of
the lungs by lowering the diaphragm and expanding the rib-cage. This reduces the air
pressure in the lungs which causes air from outside at higher pressure to enter the lungs.
Expiration is achieved by relaxing the muscles used in inspiration so that the volume of
the lungs is reduced due to the elastic recoil of the muscles, the reverse movement of
the rib-cage and gravity, thus increasing air pressure in the lungs and forcing air out.
The form of expiration achieved by relaxing the inspiratory muscles cannot be
controlled sufficiently to achieve speech or singing. For these activities, the inspiratory
muscles are used during exhalation to control lung pressure and prevent the lungs from
collapsing suddenly; when the volume is reduced below that obtained by elastic recoil,
expiratory muscles are used. Variations in speech intensity needed, for example, to
stress certain words are achieved by varying the pressure in the lungs; in this respect
speech differs from the production of a note sung at constant intensity.
____________________________________________________________________________
Page 77 Topic 4 – Speech Technology
The vocal cords are at rest when open. Their tension and elasticity can be varied; they
can be made thicker or thinner, shorter or longer and then can be either closed, open
wide or held in some position between.
When the vocal cords are held together for voicing they are pushed open for each
glottal pulse by the air pressure from the lungs; closing is due to the cords’ natural
elasticity and to a sudden drop in pressure between the cords (the Bernoulli principle).
Considered in vertical cross-section the cords do not open and close uniformly, but
open and close in a rippling movement from bottom to top as shown in Fig. 1.2.
The frequency of vibration is determined by the tension exerted by the muscles, the
mass and the length of the cords. Men have cords between 17 and 24mm in length;
those of women are between 13 and 17mm. The average fundamental or voicing
frequency (the frequency of the glottal pulses) for men is about 125 Hz, for women
about 200 Hz and for children more than 300 Hz.
When the vocal cords vibrate harmonics are produced at multiples of the fundamental
frequency; the amplitude of the harmonies decreases with increasing frequency. Figure
1.3 shows the range of human voice.
____________________________________________________________________________
Page 78 Topic 4 – Speech Technology
1.2.3 The vocal tract
For both voiced and unvoiced speech sound that is radiated from the speaker’s face is a
modification of the original vibration caused by the resonances of the vocal tract. The
oral tract is highly mobile and the position of the tongue, pharynx, palate, lips and jaw
will all affect the speech sounds made which we hear as radiation from the lips or
nostrils. The nasal tract is immobile, but can be coupled in to form part of the vocal
tract depending on the position of the velum. Combined voiced and unvoiced sounds
can also he produced as voiced consonants.
The major speech articulators are shown in Fig. 1.1. When the velum is closed the oral
and pharyngeal cavities combine to form the voice resonator. The tongue can move
both up and down and forward and back, thus altering the shape of the vocal tract; it
can also be used to constrict the tract for the production of consonants. By moving the
lips outward the length of the vocal tract can be increased. The nasal cavity is coupled
in when the velum is opened for sounds such as /m/, in ‘hum’; here the vocal tract is
closed at the lips and acts as a side branch resonator.
____________________________________________________________________________
Page 79 Topic 4 – Speech Technology
1.3.2 Voiced, unvoiced and plosive sounds
As we have seen, voiced sounds, for example the vowel sounds /a /,/e/ and /I/, are
generated by vibration of the vocal cords which are stretched across the top of the
trachea. The pressure of air flow from the lungs causes the vocal cords to vibrate. The
fundamental pitch of the voicing is determined by the air flow, but mainly by the
tension exerted on the cords.
Unvoiced sounds are produced by frication caused by turbulence of air at a constriction
in the vocal tract. The nature of the sound is determined by the site of the constriction
and the position of the articulators (e.g. the tongue or the lips). Examples of unvoiced
sounds arc /f/, /s/ or /sh/. Mixed voiced and unvoiced sounds occur where frication and
voicing arc simultaneous. For example if voicing is added to the /f/ sound it becomes
/v/; if added to /sh/ it becomes /xh/ as in ‘azure’.
Silence occurs within speech, but in fluent speech it does not occur between words
where one might expect it. It most commonly occurs just before the stop in a plosive
sound. The duration of these silences is of the order of 30 to 50 ms.
____________________________________________________________________________
Page 80 Topic 4 – Speech Technology
A common approach to understanding speech production and the processing of speech
signals is to use a source filter model of the vocal tract. The model is usually
implemented in electronic form but has also been implemented mechanically. In the
electronic form an input signal is produced either by a pulse generator offering a
harmonic rich repetitive waveform or a broadband noise signal is generated digitally by
means if a pseudorandom binary sequence generator. The input signal is passed through
a filter which has the same characteristics as the vocal tract.
The parameters of the filter can clearly not be kept constant, but must be varied to
correspond to the modification of the vocal tract made by movement of the speech
articulators. The filter thus has time- variant parameters; in practice the rate of variation
is slow, with parameters being updated at intervals of between 5 and 25 ms.
The pitch of the voiced excitation is subject to control as is the amplitude of the output,
in order to provide a fairly close approximation to real speech. The pitch period may
vary from 20ms for a deep-voiced male to 2 ms for a high-pitched child or female.
In the case of the pulse waveform it will consist of a regular pattern of lines which are
spaced apart by the pitch frequency. For a noise waveform considered as the
summation of a large number of randomly arriving impulses the distribution will
approximate to a continuous function. In both cases the energy distribution decreases
with an increase of frequency, but there are significant levels up to 15 to 20 kHz.
Frequency shaping is provided by the filter characteristic which is applied to the signal
in the frequency domain. Typically the filter characteristic will consist of a curve where
the various resonances of the vocal tract appear as peaks or poles of transmission. The
frequency at which these poles occur represents the formant frequencies and will
change for the various speech sounds.
____________________________________________________________________________
Page 81 Topic 4 – Speech Technology
1.5 Perception
1.5.1 Pitch and loudness
The pitch at which speech is produced depends on many factors such as the frequency
of excitation of the vocal cords, the size of the voice box or larynx and the length of the
vocal cords. Pitch also varies within words to give more emphasis to certain syllables.
The loudness of speech will generally depend on the circumstances, such as the
emotions of the speaker. Variations in loudness are produced by the muscles of the
larynx which allow a greater flow of air, thus producing the ‘sore throat’ feeling when
the voice has to be raised for a period to overcome noise. Loudness is also affected by
the flow of air from the lungs, which is the principal means of control in singing.
____________________________________________________________________________
Page 82 Topic 4 – Speech Technology
CHAPTER 2
PROPERTIES OF SPEECH SIGNALS IN TIME DOMAIN
2.1 Introduction
We are now beginning to see how digital signal processing methods can be applied to
speech signals .Our goal in processing the speech signal is to obtain a more convenient
or more useful representation of the information carried by the speech signal.
In this chapter we shall be interested in a set of processing techniques that are
reasonably termed time-domain methods. By this we mean simply that the processing
methods involve the waveform of the speech signal directly.
Some examples of representations of the speech signal in terms of time- domain
measurements include average zero-crossing rate, energy, and the auto- correlation
function. Such representations are attractive because the required digital processing is
simple to implement, and, in spite of this simplicity, the resulting representations
provide a useful basis for estimating important features of the speech signal.
____________________________________________________________________________
Page 83 Topic 4 – Speech Technology
2.3 Short-Time Average Zero-Crossing Rate
In the context of discrete-time signals, a zero-crossing is said to occur if successive
samples have different algebraic signs. The rate at which zero crossings occur is a
simple measure of the frequency content of a signal. This is particularly true of
narrowband signals. For example, a sinusoidal signal of frequency F0 sampled at a rate
Fs has Fs/F0 samples per cycle of the sine wave. Each cycle has two zero crossings so
that the long-time average rate of zero-crossings is Z = 2F0 /Fs crossings/sample;
Thus, the average zero-crossing rate gives a reasonable way to estimate the frequency
of a sine wave.
Speech signals are broadband signals and the interpretation of average zero-crossing
rate is therefore much less precise. However, rough estimates of spectral properties can
be obtained using a representation based on the short- time average zero-crossing rate.
Let us see how the short-time average zero-crossing rate applies to speech signals. The
model for speech production suggests that the energy of voiced speech is concentrated
below at 3 kHz because of the spectrum fall- off introduced by the glottal wave,
whereas for unvoiced speech, most of the energy is found at higher frequencies. Since
high frequencies imply high zero- crossing rates, and low frequencies imply low zero-
crossing rates, there is a strong correlation between zero-crossing rate and energy
distribution with frequency. A reasonable generalization is that if the zero-crossing rate
is high, the speech signal is unvoiced, while if the zero-crossing rate is low, the speech
signal is voiced.
Figure 2.11 shows a histogram of average zero-crossing rates (averaged. over 10 msec)
for both voiced and unvoiced speech. Note that a Gaussian curve provides a reasonably
good fit to each distribution. The mean short-time average zero-crossing rate is 49 per
10 msec for unvoiced and 14 per 10 msec for voiced. Clearly the two distributions
overlap so that an unequivocal voiced/unvoiced decision is not possible based on short-
time average zero- crossing rate alone. Nevertheless, such a representation is quite
useful in making this distinction.
Unvoiced
Voiced
____________________________________________________________________________
Page 84 Topic 4 – Speech Technology
An appropriate definition is for the zero crossing rates are:
∞
Zn = ∑
m =−∞
sgn[x (m )] − sgn[x (m − 1)] w ( n − m ) (2.1)
Where
sgn[x (n )] = 1 x (n ) ≥ 0
(2.2)
= −1 x ( n ) < 0
And
1
w (n ) = 0 ≤ n ≤ N −1
2N (2.3)
=0 otherwise
This can be achieved by a simple Matlab code. This measure could allow the
discrimination between voiced and unvoiced regions of speech, or between speech and
silence. Unvoiced speech has in general, higher zero-crossing rate. The signals in the
graphs are normalized.
wRect = rectwin(winLen);
ZCR = STAzerocross(speechSignal, wRect, winOverlap);
subplot(1,1,1);
plot(t, speechSignal/max(abs(speechSignal)));
title('speech: He took me by surprise');
hold on;
delay = (winLen - 1)/2;
plot(t(delay+1:end-delay), ZCR/max(ZCR),'r');
xlabel('Time (sec)');
legend('Speech','Average Zero Crossing Rate');
hold off;
____________________________________________________________________________
Page 85 Topic 4 – Speech Technology
2.4 Pitch period estimation
One of the most important parameters in speech analysis, synthesis, and coding
application is fundamental frequency, or pitch of voiced speech.
Pitch frequency is directly related to the speaker and sets the unique characteristic of a
person.
Voicing is generated when the airflow from the lungs is periodically interrupted by
movement of the vocal cords. The time between successive vocal cords openings is
called the fundamental period, or pitch period.
For men, the possible pitch frequency range is usually found somewhere between 50
and 250 Hz, while for women the range usually falls between 120 and 500 Hz. In terms
of period, the range for male is 4 to 40 ms, while for female it is 2 to 8 ms.
Pitch period must be estimate at every frame. By comparing a frame with past sample,
it is possible to identify the period in which the signal repeats itself, resulting in an
estimate of the actual pitch period. note that the estimation procedure make sense only
for voiced frames .meaningless results are obtained for unvoiced frames due to their
random nature.
Reflects the similarity between the frame S[n], n=m-N+1 to m with respect to the time-
shifted version S [n-1], where l is a positive integer representing a time lag. The range
of lag is selected so that it covers a wide range of pitch period values. for instance, for
l=20 to 147/92.5 to 18.3 ms) , the possible pitch frequency values range from 54.4 to
400 Hz at 8 KHz sampling rate. This range of l is applicable for most speakers and can
be encoded using 7 bits, since there are 27 =128 values of pitch period.
____________________________________________________________________________
Page 86 Topic 4 – Speech Technology
By calculating the autocorrelation values for the entire range of lag, it is possible to find
the value of lag associated with the highest autocorrelation representation the pitch
period estimation, since, in theory, autocorrelation is maximized when the lag is equal
to the pitch period.
The method is summarized with the following pseudo code:
PITCH (m, N)
1. Peak←0
2. For 1←20 to 150
3. Autocorrelation ←0
4. For n←m-N+1 to m
5. Autocorrelation← Autocorrelation + S[n] S [n-1]
6. If auto > peak
7. Peak← Autocorrelation
8. Lag←1
9. Return lag
It is important to mention that , the speech signal is often low pass filter before used as
input for pitch period estimation. Since the fundamental frequency associated with
voicing is located in low frequency region (500Hz), low pass filtering eliminate the
interfacing hi Autocorrelation-frequency components as well as out-of-band noise,
leading to a more accurate estimate.
MATLAB Example
[s, Fs, bits] = wavread ('sample6');
autoc = xcorr(s,'unbiased')
Plot (autoc)
x=s (1000:1320);
Plot (x)
____________________________________________________________________________
Page 87 Topic 4 – Speech Technology
Figure 2.3 A voiced portion of a speech waveform used in pitch period estimation
Figure 2.4 Autocorrelation values obtained from the waveform of figure 2.3
____________________________________________________________________________
Page 88 Topic 4 – Speech Technology
2.4.2 Average magnitude difference function
Function of magnitude difference function is defined by
m
MDF [L,m] =∑ |s[n]-S[n-L]|
n=m-N+1
For short segments of voiced speech it is reasonable to expect that S [n] –S [n-1] is
small for L=0, ±T, ±2T … with T being the signal's period.
Thus, by computing the magnitude difference function for the lag range of interest, one
can estimate the period by location the lag value associated with the minimum
magnitude difference.
Note that no products are needed for the implementation of the present method. The
following pseudo code summarizes the procedure
PITCH_MD (m, N)
1. min∞ ←
2. for 1← 20 to 150
3. mdf ← 0
4. for n ← m-N+1 to m
5. mdf ← mdf +|S [n] –S [n-1|[
6. if mdf <min
7. min ←mdf
8. lag←1
9. return lag
____________________________________________________________________________
Page 89 Topic 4 – Speech Technology
MATLAB Example
[s, Fs, bits] = wavread ('sample6');
x=s (1000:1320);
for k = 1: 240,
amdf (k) = 0;
for n = 1:240-k+1,
amdf (k) = amdf (k) + abs (x (n) – x (n+k-1));
end
amdf (k) = amdf (k)/ (240-k+1);
end
plot (amdf)
Figure 2.5 Magnitude difference values obtained from the waveform of figure 2.4
____________________________________________________________________________
Page 90 Topic 4 – Speech Technology
CHAPTER 3
SPEECH REPRESENTATION IN FREQUENCY DOMAIN
3.1 Introduction
As discussed in chapter 1 the vibration of the vocal cords in voicing produces sound at
a sequence of frequencies, the natural harmonics, each of which is a multiple of the
fundamental frequency. Our ears will judge the pitch of the sound from the
fundamental frequency. And since the smallest element of speech sound that indicates a
difference in meaning is a phoneme. The formant frequencies for each phoneme are
quite distinct but for each phoneme they usually have similar values regardless who is
speaking.
The fundamental frequency will vary depending on the person speaking, mood, and
emphasis, but the relationship of the formant frequencies which make each voiced
sound easily recognizable.
In this chapter we will concentrate on the formant analysis of the speech signal, and the
extraction of the formant frequencies of different speech sounds.
____________________________________________________________________________
Page 91 Topic 4 – Speech Technology
3.3.1 Spectrum scanning and peak-picking method
One approach to real-time automatic formant tracking is simply the detection and
measurement of prominences in the short-time amplitude spectrum. At least two
methods of this type have been designed and implemented. One is based upon locating
points of zero slopes, and the other is the detection of local spectral maxima by
magnitude comparison
____________________________________________________________________________
Page 92 Topic 4 – Speech Technology
CHAPTER 4
SPEECH CODING
4.1 Introduction
In general speech coding is a procedure to represent a digitized speech signal using as
few bits as possible, maintaining at the same time a reasonable level of speech quality.
A not so popular name having the same meaning is speech compression.
Speech coding has matured to the point where it now constitutes an important
application area of signal processing. Due to the increasing demand for speech
communication, speech coding technology has received augmenting levels of interest
from the research, standardization, and business communities. Advances in
microelectronic and the vast availability of low-cost programmable processors and
dedicated chips have enable rapid technology transfer from research to product
development ;this encourages the research community to investigate alternative
schemes for speech coding , with the objectives of overcoming deficiencies and
limitations. The standardization community pursues the establishment of standard
speech coding methods for various applications that will be widely accepted and
implemented by the industry. The business communities capitalize on the ever-
increasing demand and opportunities in the consumer, corporate, and network
environments for speech processing products.
Speech coding is performed using numerous steps or operation specified as a speech
coding is performed using numerous steps or operations specified as an algorism. An
algorithm is any well-defined computational procedure that takes some value, or set of
values, as input and produces some values, or set of values, as output. An algorithm is
thus a sequence of computational steps that transform the input into the output. Many
signal processing problems-including speech coding- can be formulated as a well-
specified computational problem; hence, a particular coding scheme can be defined as
an algorithm. In general, an algorithm is specified a task. With these instructions, a
computer or processor can execute them so as complete the coding task. The
instructions can also be translated to the structure of a digital circuit, carrying out the
computation directly at the hardware level
____________________________________________________________________________
Page 94 Topic 4 – Speech Technology
2. Parametric coders
Within the framework of parametric coders, the speech signal is assumed to be
generated from a model, which is controlled by some parameters. During encoding,
parameters of the model are estimate from the input speech signal, with the parameters
transmitted as the encoded bit-stream. This type of coder makes no attempt to preserve
the original shape of the waveform and hence SNR is a useless quality measure.
Perceptual quality of the decoded speech is directly related to the accuracy and
sophistication of the underlying model. Due to this limitation, the coder is signal
specific, having poor performance for non speech signals. There are several proposed
models in the literature. The most successful, however, is based on linear prediction. In
this approach, the human speech production mechanism is summarized using a time-
varying filter, with the coefficients of the filter found using the linear predication
analysis procedure.
3. Hybrid coders
As its name implies, a hybrid coder combines the strength of a waveform coder with
that of a parametric coder. Like a parametric coder, it relies on a speech production
model; during encoding, parametric of the model are located. Additional parameters of
the model are optimized in such a way that decoded speech is as close as possible to the
original waveform, with the closeness often measured by a perceptually weighted error
signal. As in waveform coders, an attempt is made to match the original signal with the
decoded signal in the time domain.
This class dominates the medium bit-rate coders, with the code-excited linear
prediction (CELP) algorithm and its variants the most outstanding representatives.
From a technical perspective, the difference between a hybrid coder and a parametric
coder is that the former attempts to quantize or represent the excitation signal to the
speech production model, which is transmitted as part of the encoder bit-stream. The
latter, however, achieves low bit-rate by discarding all detail information of the
excitation signals; only coarse parameters are extracted.
A hybrid coder tends to behave like a waveform coder for high bit-rate, and like a
parametric coder at low bit-rate, with fair to good quality for medium bit-rate.
Multimode coders were invented to take advantage of the dynamic nature of the speech
signal, and to adapt to the time-varying network conditions. In this configuration, one
of several distance coding modes is selected, with the selection done by source control,
when the switching obeys some external commands in response to network needs or
channel conditions.
____________________________________________________________________________
Page 95 Topic 4 – Speech Technology
According to bit rate
1. High >15 kbps
2. Medium 5 to 15 kbps
3. Low 2 to 5 kbps
4. Very low <2 kbps
Since LPC offers a good quality vs. bit rate trade off, it is the most commonly used
coding technique in various applications as.
Applications
FS1015 LPC (1984)
To provide secure communication in military application
TIA IS54 VSEIP (1989)
For TDMA
ETSI AMR ACELP (1999)
For UMTS: Universal mobile telecommunication system
In 3GPP: 3rd generation partnership project.
Voip
Voice over IP
So our focus will be linear predictive coding (LPC).
____________________________________________________________________________
Page 96 Topic 4 – Speech Technology
4.4 Linear Predictive Coding (LPC)
Linear Predictive Coding (LPC)
A. Physical Model:
B. Mathematical Model:
LPC analyzes the speech signal by estimating the formants, removing their effects from
the speech signal, and estimating the intensity and frequency of the remaining buzz.
The process of removing the formants is called inverse filtering, and the remaining
signal is called the residue. The numbers which describe the formants and the residue
can be stored or transmitted somewhere else.
LPC synthesizes the speech signal by reversing the process: use the residue to create a
source signal, use the formants to create a filter (which represents the tube), and run the
source through the filter, resulting in speech. Because speech signals vary with time,
this process is done on short chunks of the speech signal, which are called frames.
Usually 30 to 50 frames per second give intelligible speech with good compression.
____________________________________________________________________________
Page 97 Topic 4 – Speech Technology
Figure (4.3) Block diagram of simplified model for speech production.
Air (Innovations)
Vocal Cord Vibration (voiced)
Vocal Cord Vibration Period (pitch period)
Fricatives and Plosives (unvoiced)
Air Volume (gain)
____________________________________________________________________________
Page 98 Topic 4 – Speech Technology
4.4.2 The LPC filter
S (z ) G
H (z ) = = p
U (z )
1- ∑ ak z - k
k =1
• The digital speech signal is divided into frames of size 20 ms. There are
50 frames/second.
is equivalent to
It may seem surprising that the signal can be characterized by such a simple linear
predictor. It turns out that, in order for this to work, the tube must not have any side
branches.
(In mathematical terms, side branches introduce zeros, which require much more
complex equations.)
For ordinary vowels, the vocal tract is well represented by a single tube. However, for
nasal sounds, the nose cavity forms a side branch. Theoretically, therefore, nasal sounds
require a different and more complicated algorithm. In practice, this difference is partly
ignored and partly dealt with during the encoding of the residue .
If the predictor coefficients are accurate, and everything else works right, the speech
signal can be inverse filtered by the predictor, and the result will be the pure source
(buzz). For such a signal, it's fairly easy to extract the frequency and amplitude and
encode them.
However, some consonants are produced with turbulent airflow, resulting in a hissy
sound (fricatives and stop consonants). Fortunately, the predictor equation doesn't care
if the sound source is periodic (buzz) or chaotic (hiss).
This means that for each frame, the LPC encoder must decide if the sound source is
buzz or hiss; if buzz, estimate the frequency; in either case, estimate the intensity; and
encode the information so that the decoder can undo all these steps. This is how LPC-
10e, the algorithm described in federal standard 1015, works: it uses one number to
represent the frequency of the buzz, and the number 0 is understood to represent hiss.
LPC-10e provides intelligible speech transmission at 2400 bits per second.
Here is a sample of LPC-10e encoded speech. Sound files are in Sun/NeXT 8 bit u-law
format, and should be playable on all browsers.
Another problem is that, inevitably, any inaccuracy in the estimation of the formants
means that more speech information gets left in the residue. The aspects of nasal
sounds that don't match the LPC model (as discussed above), for example, will end up
in the residue.
____________________________________________________________________________
Page 100 Topic 4 – Speech Technology
There are other aspects of the speech sound that don't match the LPC model; side
branches introduced by the tongue positions of some consonants, and tracheal (lung)
resonances are some examples.
Therefore, the residue contains important information about how the speech should
sound, and LPC synthesis without this information will result in poor quality speech.
For the best quality results, we could just send the residue signal, and the LPC synthesis
would sound great. Unfortunately, the whole idea of this technique is to compress the
speech signal, and the residue signal takes just as many bits as the original speech
signal, so this would not provide any compression.
Various attempts have been made to encode the residue signal in an efficient way,
providing better quality speech than LPC-10e without increasing the bit rate too much.
The most successful methods use a codebook, a table of typical residue signals, which
is set up by the system designers. In operation, the analyzer compares the residue to all
the entries in the codebook, chooses the entry which is the closest match, and just sends
the code for that entry. The synthesizer receives this code, retrieves the corresponding
residue from the codebook, and uses that to excite the formant filter. Schemes of this
kind are called Code Excited Linear Prediction (CELP).
k =1
This system is excited by an impulse train for voiced speech or a random noise
sequence for unvoiced speech. Thus, the parameters of this model are:
Voiced/unvoiced classification, pitch period for voiced speech, gain parameter G, and
the coefficients {ak} of the digital filter. These parameters, of course, all vary slowly
with time.
The pitch period and voiced/unvoiced classification can be estimated using one of the
many methods. This simplified all-pole model is a natural representation of non- nasal
voiced sounds, but for nasals and fricative sounds, the detailed acoustic theory calls for
both poles and zeros in the vocal tract transfer function. We shall see, however, that if
the order p is high enough, the all-pole model provides a good representation for almost
all the sounds of speech.
____________________________________________________________________________
Page 101 Topic 4 – Speech Technology
The major advantage of this model is that the gain parameter, G, and the filter
coefficients {ak} can be estimated in a very straightforward and computationally
efficient manner by the method of linear predictive analysis.
For the system of Fig. 4.3, the speech samples s(n) are related to the excitation u(n) by
the simple difference equation
p
s (n ) = ∑ ak s (n − k ) + Gu (n ) (4.2)
k =1
Such systems to reduce the variance of the difference signal in differential quantization
schemes. The system function of a pth order linear predictor is the polynomial
p
P (z ) = ∑ α k z − k (4.4)
k =1
From Eq. (4.5) it can be seen that the prediction error sequence is the output of a
system whose transfer function is
p
A (z ) = 1 − ∑ α k z − k (4.6)
k =1
It can be seen by comparing Eqs. (4.2) and (4.5) that if the speech signal obeys the
model of Eq. (4.2) exactly, and if α k = ak , then e(n) = Gu(n).
Thus, the prediction error filter, A(z), will be an inverse filter for the system, H(z), of
Eq. (4.1), i.e.,
G
H (z ) = (4.7)
A (z )
The basic problem of linear prediction analysis is to determine a set of predictor
coefficients (ak) directly from the speech signal in such a manner as to obtain a good
estimate of the spectral properties of the speech signal through the use of Eq. (4.7).
Because of the time-varying nature of the speech signal the predictor coefficients must
be estimated from short segments of the speech signal. The basic approach is to find a
set of predictor coefficients that will minimize the mean-squared prediction error over a
short segment of the speech waveform. The resulting parameters are then assumed to be
the parameters of the system function, H(z), in the model for speech production.
____________________________________________________________________________
Page 102 Topic 4 – Speech Technology
That this approach will lead to useful results may not be immediately obvious, but it
can be justified in several ways. First, recall that if α k = ak , then e(n) = Gu(n). For
voiced speech this means that e(n) would consist of a train of impulses; i.e., e(n) would
be small most of the time. Thus, finding ak’s that minimize prediction error seems
consistent with this observation. A second motivation for this approach follows from
the fact that if a signal is generated by Eq. (4.2) with non-time-varying coefficients and
excited either by a single impulse or by a stationary white noise input, then it can be
shown that the predictor coefficients that result from minimizing the mean squared
prediction error (over all time) are identical to the coefficients of Eq. (4.2). A third very
pragmatic justification for using the minimum mean-squared prediction error as a basis
for estimating the model parameters is that this approach leads to a set of linear
equations that can be efficiently solved to obtain the predictor parameters. More
importantly the resulting parameters comprise a very useful and accurate representation
of the speech signal as we shall see in this chapter.
The short-time average prediction error is defined as
E n = ∑ e n 2 (m ) (4.8)
m
= ∑ (s (n ) − s% (n )) 2 (4.9)
m
2
p
= ∑ s n (m ) − ∑ α k s n (m − k ) (4.10)
m k =1
Where sn(m) is a segment of speech that has been selected in the vicinity of sample n,
i.e.,
s n (m ) = s (m + n ) (4.11)
The range of summation in Eqs. (4.8)-(4.10) is temporarily left unspecified, but since
we wish to develop a short-time analysis technique, the sum will always be over a finite
interval. Also note that to obtain an average we should divide by the length of the
speech segment. However, this constant is irrelevant to the set of linear equations that
we will obtain and therefore is omitted. We can find the values of ak that minimize En
in Eq. (4.10) by setting ∂E n / ∂α i = 0 , i= 1, 2… p, thereby obtaining the equations
p
∑ s n (m − i )s n (m ) = ∑ αˆ k ∑ s n (m − i )s n (m − k )
m k =1 m
1≤ i ≤ p (4.12)
Where αˆ k are the values of ak that minimize En (Since αˆ k is unique, we will drop the
caret and use the notation αˆ k to denote the values that minimize En.) If we define
φn (i , k ) = ∑ s n (m − i )s n (m − k ) (4.13)
m
____________________________________________________________________________
Page 103 Topic 4 – Speech Technology
then Eq. (4.12) can be written more compactly as
p
∑α φ
k =1
k n (i , k ) = φn (i , 0) i = 1, 2,...., p (4.14)
This set of p equations in p unknowns can be solved in an efficient manner for the
unknown predictor coefficients {ak} that minimize the average squared prediction error
for the segment sn(m) Using Eqs. (4.10) and (4.12), the minimum mean-squared
prediction error can be shown to be
p
E n = ∑ s n 2 (m ) − ∑ α k ∑ s n (m )s n (m − k ) (4.15)
m k =1 m
Thus the total minimum error consists of a fixed component, and a component which
depends on the predictor coefficients.
To solve for the optimum predictor coefficients, we must first compute the
quantities φn (i , k ) for 1 ≤ i ≤ p and 0 ≤ k ≤ p . Once this is done we only have to solve
Eq. (4.14) to obtain the ak’s. Thus, in principle, linear prediction analysis is very
straightforward. However, the details of the computation of φn (i , k ) and the
subsequent solution of the equations are somewhat intricate and further discussion is
required.
So far we have not explicitly indicated the limits on the sums in Eqs. (4.8)-(4.10) and in
Eq. (4.12); however it should be emphasized that the limits on the sum in Eq. (4.12) are
identical to the limits assumed for the mean squared prediction error in Eqs. (4.8)-
(4.10). As we have stated, if we wish to develop a short-time analysis procedure, the
limits must be over a finite interval. There are two basic approaches to this question,
and we shall see below that two methods for linear predictive analysis emerge out of a
consideration of the limits of summation and the definition of the waveform segment
sn(m).
____________________________________________________________________________
Page 104 Topic 4 – Speech Technology
The effect of this assumption on the question of limits of summation for the expressions
for En can be seen by considering Eq. (4.5). Clearly, if sn(m) is nonzero only
for 0 ≤ m ≤ N − 1 , then the corresponding prediction error, en(m), for a pth order
predictor will be nonzero over the interval 0 ≤ m ≤ N − 1 + p . Thus, for this case En is
properly expressed as
N + p −1
En = ∑
m =0
e n 2 (m ) (4.18)
Alternatively we could have simply indicated that the sum should be over alt nonzero
values by summing from −∞to + ∞ .
Returning to Eq. (4.5), it can be seen that the prediction error is likely to be large at the
beginning of the interval (specifically 0 ≤ m ≤ p − 1 ) because we are trying to predict
the signal from samples that have arbitrarily been set to zero. Likewise the error can be
large at the end of the interval (specifically N ≤ m ≤ N + 1 − p ) because we are trying
to predict zero from samples that are nonzero. For this reason, a window which tapers
the segment, s to zero is generally used for w(m) in Eq. (4.17).
The limits on the expression for φn (i , k ) in Eq. (4.13) are identical to those of Eq.
(4.18). However, because sn(m) is identically zero outside the interval 0 ≤ m ≤ N − 1 , it
is simple to show that
N + p +1
1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m − i )s n (m − k )
0≤k ≤ p
(4.19a)
can be expressed as
N −1− ( i − k )
1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m )s n (m + i − k )
0≤k ≤ p
(4.19b)
Furthermore it can be seen that in this case φn (i , k ) is identical to the short- time
autocorrelation function evaluated for (i-k). That is
φn (i , k ) = R n (i − k ) (4.20)
Where
N −1− k
R n (k ) = ∑
m =0
s n (m )s n (m + k ) (8.21)
∑α
k =1
k R n ( i − k ) = R n (i ) 1≤ i ≤ p (4.23)
Similarly, the minimum mean squared prediction error of Eq. (4.16) takes the
____________________________________________________________________________
Page 105 Topic 4 – Speech Technology
form
p
E n = R n (0) − ∑ α k R n (k ) (4.24)
k =1
The set of equations given by Eqs. (4.23) can be expressed in matrix form as
The pxp matrix of autocorrelation values is a Toeplitz matrix; i.e., it is symmetric and
all the elements along a given diagonal are equal. This special property will be
exploited in Section 4.3 to obtain an efficient algorithm for the solution of Eq. (4.23).
then φn (i , k ) becomes
N −1 1≤ i ≤ p
φn (i , k ) = ∑ s n (m − i )s n (m − k ) (4.27)
m =0 0≤k ≤ p
In this case, if we change the index of summation we can express φn (i , k ) as either
N − i −1 1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m )s n (m + i − k )
0≤k ≤ p
(4.28a)
or
N − k −1 1≤ i ≤ p
φn (i , k ) = ∑
m =0
s n (m )s n (m + k − i )
0≤k ≤ p
(4.28b)
Although the equations look very similar to Eq. (4.19b), we see that the limits of
summation are not the same. Equations (4.28) call for values of sn(m) out side the
interval 0 ≤ m ≤ N − 1 . Indeed, to evaluate. For all of the required values of i and k
requires that we use values of sn(m) in the interval ( − p ≤ m ≤ N − 1 ). If we are to be
consistent with the limits on En in Eq. (4.26) then we have no choice but to supply the
____________________________________________________________________________
Page 106 Topic 4 – Speech Technology
required values. In this case it does not make sense to taper the segment of speech to
zero at the ends as in the autocorrelation method since the necessary values are made
available from outside the interval ( 0 ≤ m ≤ N − 1 ). Clearly, this approach is very
similar to what was called the modified autocorrelation function in Chapter 2, this
approach leads to a function which is not a true autocorrelation function, but rather, the
cross-correlation between two very similar, but not identical, finite length segments of
the speech wave. Although the differences between Eq. (4.28) and Eq. (4.19b) appear
to be minor computational details, the set of equations
p
∑α φ
k =1
k n (i , k ) = φn (i , 0) i = 1, 2,..., p (4.29a)
Have significantly different properties that strongly affect the method of solution and
the properties of the resulting optimum predictor.
In matrix form these equations become
____________________________________________________________________________
Page 107 Topic 4 – Speech Technology
CHAPTER 5
APPLICATIONS
____________________________________________________________________________
Page 109 Topic 4 – Speech Technology
• The range of fundamental frequency (f0) for male
between(150:250)Hz and foe female between(250:400)Hz
Then we extract the part of the frame containing the highest energy content (the
part around the three-formant frequencies) and we neglect the rest of the
frame.(we called it info).
By that we reduce signal which we send it
We try to reconstruct the voice signal by using the extracted part.
At receiver, we reconstructing the voice signal using the smallest part of the
frame that gives a moderately good quality that allows for voice recognition.
We try to create a synthetic frame using the pitch, the first three-formant
frequencies.
By using this synthetic frame, we create a syntheses voice signal.
By using this synthetic frame, we create a syntheses voice signal.
Figure (5.4) the part of the frame containing the highest energy content
____________________________________________________________________________
Page 110 Topic 4 – Speech Technology
Figure(5.5) Info--- previous figure in time domain
-First synthetic frame reconstruct by convolution info with deltas at pitch period
-Second synthetic frame reconstruct by convolution info with triangles its vertices
at pitch period.
____________________________________________________________________________
Page 113 Topic 4 – Speech Technology
The synthesis signal is build and has good quality
____________________________________________________________________________
Page 114 Topic 4 – Speech Technology
• Note when increase numbers of parameters the quality increase too.
S (z ) G
H (z ) = = p
U (z )
1- ∑ ak z - k
k =1
We can transmit this glottal pulse and reconstruct the signal at receiver.
At receiver, we reconstructing the voice signal using the LPC parameters and
glottal pulse for voice recognition.
However, we need to reduce the data, which we send so we use the
next approach.
____________________________________________________________________________
Page 115 Topic 4 – Speech Technology
-First synthetic frame reconstruct by the part of the glottal pulse containing the
highest energy content
Figure (5.14) part of the glottal pulse containing the highest energy content
____________________________________________________________________________
Page 116 Topic 4 – Speech Technology
-Second approach generate triangles at the pitch period
____________________________________________________________________________
Page 117 Topic 4 – Speech Technology
-Third approach generate hamming windows at the pitch period
____________________________________________________________________________
Page 118 Topic 4 – Speech Technology
5.2 Speaker identification using LPC
Speaker recognition is the task of recognizing people from their voices. Such systems
extract features from speech, model them and use them to identify the person from
his/her voice.
In this application we first inspect the effective portion of the LPC filter analysis is
more effective in speaker identification the LPC parameters or the glottal pulse .
Part 1
a. We take two voice samples from two different speakers but both samples for the
same phoneme preferably a male sample and a female sample to emphasis the
difference in perception.
b. We pass each sample through an LPC filter getting the LPC parameters, and the
glottal pulse for both speakers.
c. The next step is to swap the glottal pulse of the two speakers and reconstruct the
voice signal of each speaker using his LPC parameters and the glottal pulse of
the other speaker.
d. After the reconstruction of the new voice signals we will find that the glottal
pulse is the effective parameter in voice recognition.
e. The next figure shows that the LPC parameters of both speakers have very close
values which assures our conclusion.
____________________________________________________________________________
Page 119 Topic 4 – Speech Technology
f. While comparing the LPC parameters did not show a big difference between
two speakers; the comparison between the glottal pulses of the two speakers
showed a much bigger difference leading to confirm our conclusion that the
glottal pulse has a much greater weight in the identification of the speaker. This
is shown in the next figure .
____________________________________________________________________________
Page 120 Topic 4 – Speech Technology
Part 2
The second portion of our application handles the identification part after
concluding that the glottal pulse is the effective parameter.
a. First we construct a code book containing voice samples from different
speakers but for the same phoneme.
b. Secondly we take a voice sample from one of the speakers for the same
phoneme used in the code book construction. And we consider this
signal as our input signal for which the identification of the speaker
needs to be made.
c. The identification process is done using the distortion measure
technique.
d. Distortion measure = ∑(Sn – Si )2
Where
Sn: is one of the samples saved in the code book ; n:1,2,… ,N
N : is the number of samples saved in the code book (no. of
speakers)
Si : is the input signal .
e. When we get two signals with the least distortion .then we identify the
speaker as the speaker of the signal Sn saved in the code book.
f. We can also use an input signal for a speaker not found in the code in
this case the program will calculate the distortion measure between this
signal and the signals saved in the code book and choose a speaker from
the code book with the least distortion measure .
____________________________________________________________________________
Page 121 Topic 4 – Speech Technology
5.3 Introduction to VOIP
VoIP (voice over IP - that is, voice delivered using the Internet Protocol) is a term used
in IP telephony for a set of facilities for managing the delivery of voice information
using the Internet Protocol (IP). In general, this means sending voice information in
digital form in discrete packets rather than in the traditional circuit-committed protocols
of the public switched telephone network (PSTN). A major advantage of VoIP and
Internet telephony is that it avoids the tolls charged by ordinary telephone service.
VoIP, now used somewhat generally, derives from the VoIP Forum, an effort by major
equipment providers, to promote the use of ITU-T H.323, the standard for sending
voice (audio) and video using IP on the public Internet and within an intranet. The
Forum also promotes the user of directory service standards so that users can locate
other users and the use of touch-tone signals for automatic call distribution and voice
mail.
In addition to IP, VoIP uses the real-time protocol (RTP) to help ensure that packets get
delivered in a timely way. Using public networks, it is currently difficult to guarantee
Quality of Service (QoS). Better service is possible with private networks managed by
an enterprise or by an Internet telephony service provider (ITSP).
____________________________________________________________________________
Page 122 Topic 4 – Speech Technology
5.3.2 System architecture
____________________________________________________________________________
Page 123 Topic 4 – Speech Technology
5.3.3 Coding technique in VOIP systems
Codecs are software drivers that are used to encode the speech in a compact enough
form that they can be sent in real time across the Internet using the bandwidth available.
Codecs are not something that VOIP users normally need to worry about, as the VOIP
clients at each end of the connection negotiate between them which one to use.
VOIP software or hardware may give you the option to specify the codecs you prefer to
use. This allows you to make a choice between voice quality and network bandwidth
usage, which might be necessary if you want to allow multiple simultaneous calls to be
held using an ordinary broadband connection. Your selection is unlikely to make any
noticeable difference when talking to PSTN users, because the lowest bandwidth part
of the connection will always limit the quality achievable, but VOIP-to-VOIP calls
using a broadband Internet connection are capable of delivering much better quality
than the plain old telephone system.
The bit rate is an approximate indication of voice quality or fidelity, however it is only
approximate. Codecs that use pulse code modulation all give high fidelity, and you will
detect little or no difference between any of them. The G.728 codec will give much
better quality than the only nominally lower rate GSM codec, because the algorithm it
uses is much more sophisticated. However, the GSM codec uses less computing power,
and so will run on simpler devices.
The bandwidth gives an indication of how much of the capacity of your broadband
Internet connection will be consumed by each VOIP call. The bandwidth usage is not
directly proportional to the bit rate, and will depend on factors such as the protocol
used. Each chunk of voice data is contained within a UDP packet with headers and
____________________________________________________________________________
Page 124 Topic 4 – Speech Technology
other information. This adds a network overhead of some 15 - 25Kbit/s, more than
doubling the bandwidth used in some cases. However, most VOIP implementations use
silence detection, so that no data at all is transmitted when nothing is being said.
Insufficient bandwidth can result in interruptions to the audio if VOIP uses the same
Internet connection as other users who may be downloading files or listening to music.
For this reason, it is desirable to enable the Quality of Service "QoS" option in the
TCP/IP Properties of any computer running a software VOIP client, and to use a router
with QoS support for your Internet connection. This will ensure that your VOIP traffic
will be guaranteed a slice of the available bandwidth so that call quality does not suffer
due to other heavy Internet usage.
The characteristics following are recommended for the conversion of 64-kbps A-law or
µ-law PCM channels to or from variable rate-embedded ADPCM channels. The
recommendation defines the transcoding law when the source signal is a pulse code
modulation signal at a pulse rate of 64 kbps developed from voice frequency analog
signals as specified in ITU- T G .7 11. Figure 5.23 shows a simplified block diagram of
the encoder and the decoder.
____________________________________________________________________________
Page 125 Topic 4 – Speech Technology
Figure 5.23 shows a simplified block diagram of the encoder and the decoder.
____________________________________________________________________________
Page 126 Topic 4 – Speech Technology
Embedded ADPCM Algorithms
Embedded ADPCM algorithms are variable-bit-rate coding algorithms with the
capacity of bit-dropping outside the encoder and decoder blocks. They consist of a
series of algorithms such that the decision levels of the lower rate quantizes are subsets
of the quantize at the highest rate. This allows bit reductions at any point in the network
without the need for coordination between the transmitter and the receiver. In contrast,
the decision levels of the conventional ADPCM algorithms, such as those in
Recommendation G.726, are not subsets of one another; therefore, the transmitter must
inform the receiver of the coding rate and the encoding algorithm.
Embedded algorithms can accommodate the unpredictable and bursty characteristics of
traffic patterns that require congestion relief. This might be the case in IP-like
networks, or in ATM net works with early packet discard. Because congestion relief
may occur after the encoding is performed embedded ceding is different from the
variable-rate coding where the encoder and decoder must use the same number of bits
in each sample. In both cases, however the decoder must be told the number of bits to
use in each sample.
Embedded algorithms produce code words that Contain enhancement bits and core bits.
The feed-forward (FF) path utilizes enhancement and core bits, while the feedback
(FB) path uses core bits only. The inverse quantizer and the predictor of both the
encoder and the decoder use the core bits. With this structure, enhancement bits can be
discarded or dropped during network congestion.’ However, the number of core bits in
the FB paths of both the encoder and decoder must remain the same to avoid mist
racking.
The four embedded ADPCM rates are 40, 32, 24, and 16 kbps, where the decision
levels for the 32-, 24-, and 16-kbps quantizes are subsets of those for 40 k bits per
quantize. Embedded ADPCM algorithms are referred to by (x,y) pairs, where x refers
to the FE (enhancement and core) ADPCM bits and y refers to the FB (core) ADPCM
bits. For example, if y is set to 2 bits, (5, 2) represents the 24-kbps embedded algorithm
and (2, 2) the I 6-kbps algorithm, the bit rate is never less than 16 kbps because the
minimum number of core bits is 2. Simplified block diagrams of both the embedded
ADPCM encoder and decoder are shown in Figure 5.23.
The excitation signal (e.g., ACELP) and the partitioning of the excitation space ( the
algebraic codebook , G.729 and G. 723.1 can be differentiated in the manner , although
both assume that all pulses have the same amplitudes and that the sign information will
be transmitted .The two vocoders also chow major differences in terms of delay .
____________________________________________________________________________
Page 127 Topic 4 – Speech Technology
Differentiations
G.729 has excitation frames of 5 ms and allows four pulses to be selected. The 40-
sample frame is partitioned into four subsets. The Technology and standards for low-
Bit-rate vocoding methods.
MIPS 20 10 14-20
First three subsets have eight possible locations for pluses, the fourth has sixteen. One
pulse must be chosen from each subset .This is a four- pulse ACELP excitation
codebook method (see figure 5.2)
G.723.1 has excitation frames of 7.5 ms, and also uses a four-pulse ACELP excitation
codebook for the 5.3-kbps mode .For the mum likelihood quantizer (MP-MLQ) is
employed. Here the frame positions are grouped into even – numbered and odd-
numbered subsets .A sequential multipulse search is used for a fixed number of pulses
from the even subset (either five or six, depending on whether the frame itself is odd-or
even- numbered); a similar search is repeated for the odd-numbered subset. Then the
set resulting in the lowest total distortion is selected for the excitation (1).
At the decoder stage, the linear predication coder (LPC) information and adaptive and
fixed codebook information are demultiplexed and then used to reconstruct the output
signal .An adaptive post filter is used .In the case of the G.723.1 vocoders, the LT post
filter is applied to the excitation signal before it is passed through the LPC synthesis
filter and the ST post filter.
G. 723 .1
G. 732.1 specifies a coded representation that can be used for com- pressing the speech
or other audio signal component of multimedia services a very low bit-rate. In the
design of this coder, the principal application considered by the study Group was very
low bit rate visual telephony as part o the overall H. 324 families of standards.
____________________________________________________________________________
Page 128 Topic 4 – Speech Technology
This coder has two bit rates associated with it, 5.3 and 6.3 kbps .The higher bit rate
gives greater quality. The lower bi rate gives good quality and provides system
designers with additional flexibility. Both rates are a mandatory part of the encoder and
decoder .It is possible to switch between the two rates a any 30 – ms frame boundary.
An option for variable rate operation using discontinuous transmission and noise fill
during nonspeech intervals is also possible.
The G 723.1 coder was optimized to represent speech with a high quality at the stated
rates, using a limited amount of complexity. Music and other audio signals are no
represented as faithfully as speech, but can be compressed and decompressed using his
coder.
The G.723.1 coder encodes speech or other audio signals in 30 –ms frames .In addition,
here is look ahead of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All
additional delay in the implementation and operation of this coder is due o he
following:
1- Actual time spent processing the data in the encoder and decoder.
2- Transmission time on the communication link
3- Additional buffering delay for the multiplexing protocol.
Encoder / Decoder
The G .723 .1 coder is designed to operate with a digital signal by first performing
telephone bandwidth filtering ( Recommendation G. 712 ) of the analog input , then
sampling at 8000 Hz , and then Hz, and hen converting to 16 – bit linear PCM for thee
input to the encoder .
The output of the decoder is converted back to analog by similar means.
Other input / output characteristics , such as those specified by Recommendation G 711
for 64 – kbps PCM data, should be converted to 16–bit linear PCM before encoding or
from 16 – bit linear PCM to the appropriate format after decoding .
The coder is based on the principles of linear prediction analysis- by – synthesis coding
and attempts to minimize a perceptually weighted error signal. The encoder operates on
blocks (frames) of 240 samples each. That is equal to 30 ms an 8 – k Hz sampling rate.
Each block is first high – pass filtered to remove the DC component and then is divided
into four sub frames of 60 samples each .For every sub frame , a tenth – order linear
prediction coder filter is computed using the unprocessed input signal . The LPC filter
for the last sub frame is quantized using a predictive split vector quantizer ( PSVQ )
.The quantized LPC coefficients are use to construct the short – term perceptual
weighing filter, which is used to filter he enter frame and o obtain the perceptually
weighted speech signal .
For every two sub frames [120 samples ] , the open – loop pitch period L Lo is compute
using the weighted speech signal .This pitch estimation is performed on blocks of 120
samples. The pitch period is searched in the range from 18 to 142 samples.
From this point, the speech is processed on a basis of 60 samples per sub-frame.
Using the estimated pitch period computed previously, a harmonic noise shaping filter
is constructed. The combination of the LPC synthesis filter, the format perceptual
weighting filter, and the harmonic noise shaping filter is used to create an impulse
response.
The impulse response is then used for further computations.
Using the estimated pitch period estimation L Lo and the impulse response, a closed
pitch predictor is computed A fifth – order pitch predictor is used .The pitch period is
____________________________________________________________________________
Page 129 Topic 4 – Speech Technology
computed as a small differential value around the open – loop pitch estimate. The
contribution of the pitch predictor is then subtracted from the initial target vector. Both
the pitch period and the differential values are transmitted to he decoder .
Finally, the nonperiodic component of the excitation is approximated. For the high bit
rate, multipulse maximum likelihood quantization ( MP – MLQ ) excitation is used ,
and for the low bit rate , an algebraic code excitation is used .
The block diagram of the encoder is shown in figure 5.25.
• Framer
• High – pass filter
• LPC analysis
• Lin spectral pair ( LSP) quantizer
• LSP interpolation
• Forman perceptual weighting filer
• Pitch estimation
• Sub frame processing
• Harmonic noise shaping
• Impulse response calculator
• Zero – input response and ringing subtraction
• Pitch predictor
• High – rate excitation (MP-MLQ )
• Excitation decoder
• Pitch information decoding
____________________________________________________________________________
Page 130 Topic 4 – Speech Technology
References
Topic 1: Audiology
[1] Architectural Acoustics, McGraw Hill Inc., New York, 1988, p.18
[2] Source: U.S. Dept. of Commerce / National Bureau of Standards.
Handbook 119, July, 1976: Quieting: A Practical Guide to Noise Control; Page 61.
[3] Kutruff, K. H. (1991) Room Acoustics, Elsevier Science Publishers, Essex.
[4] (http://www.STC ratings.com).
[1] L.R Rabiner and B.Gold Theory and Application of Digital Signal Processing
Prentice-Hall, Englewood cliffs, nj.
[2] C, J.Weinstein A Linear Predictive Vocoder with Voice Excitation Proc. Eascon.
[3] Daniel Minoli, Emma Minoli Delivering Voice Over IP Networks Wile Computer
Publishing john wiley & sons,inc.
[4] (http://www.data-compression.com/speech.html).
[5] (http://www.otolith.com/otolith/olt/lpc.html).