You are on page 1of 24

New Speech Coding

Techniques

Mr. L.Ramesh
AP/ECE
Introduction

Efficient speech-coding techniques


Advantages for VoIP
Digital streams of ones and zeros
The lower the bandwidth, the lower the
quality
RTP payload types
Processing power
The better quality (for a given bandwidth)
uses a more complex algorithm
A balance between quality and cost
Voice Quality

Bandwidth is easily quantified


Voice quality is subjective
MOS, Mean Opinion Score
ITU-T Recommendation P.800
 Excellent – 5
 Good – 4
 Fair – 3
 Poor – 2
 Bad – 1
A minimum of 30 people
Listen to voice samples or in conversations
 P.800 recommendations
 The selection of participants
 The test environment
 Explanations to listeners
 Analysis of results
 Toll quality
 A MOS of 4.0 or higher
About Speech

Speech
Air pushed from the lungs past the vocal
cords and along the vocal tract
The basic vibrations – vocal cords
The sound is altered by the disposition of the
vocal tract ( tongue and mouth)
Model the vocal tract as a filter
The shape changes relatively slowly
The vibrations at the vocal cords
The excitation signal
Speech sounds
Voiced sound
 The vocal cords vibrate open and close
 Quasi-periodic pulses of air
 The rate of the opening and closing – the pitch
Unvoiced sounds
 Forcing air at high velocities through a constriction
 Noise-like turbulence
 Show little long-term periodicity
 Short-term correlations still present
Plosive sounds
 A complete closure in the vocal tract
 Air pressure is built up and released suddenly
Voice Sampling
Discrete Time LTI Systems: The Convolution
Sum

 
x[n]   x[k ] [n  k ]
k  
y[n]   x[k ]h[n  k ]
k  

1
h[n]

0 1 2 n
2.5
2 2
x[n] y[n]
0.5 0.5

0 1 n 0 1 2 3 n
 Nyquist sampling theorem
X c ( j )


s (t )    (t  nT )
n  
 N N 
xs (t )  xc (t ) s (t )

 xc (t )   (t  nT )
 S 0 X c ( j ) S  n  

2 
S ( j ) 
T
  (  k )
k  
s

S  N N S 

( S   N )
Quantization (Scalar
Quantization)
v1 v2 vk+1 vL

m0= -A m1 m2 …… mk mk+1 mL1 mL=A


J
· Assume | x[n] |  A k+1

divide the range [ A , A ] into L quantization levels


{ J1 , J2 , …… Jk ,….. JL }
Jk : [mk-1,mk ]
R
L=2

each quantization level Jk is represented by a value vk


S = U Jk , V = { v1 , v2 , …… vk ,….. vL }
Non-Uniform Quantization

m0 = -A m1 m2 …… 0 mL=A

Concept : small quantization levels for small x


large quantization levels for large x

Goal: constant SNRQ for all x


Companding

x[n] ^
x[n]
F(x) Uniform Uniform F1(x)
Quantization Decoder

Compressor …1101…1101… Expandor

Compressor + Expandor  Compandor


F(x) is to specify the non-uniform quantization
characteristics
Non-Uniform Quantization
 - law

 A-law log 1  μ x 
F ( x)  ,0  x  1
log( 1  μ)

 Ax 1
 ,0  x 

F ( x )   1  lnA A
1  ln[ A x ] , 1  x  1

 1  lnA A

 Typical values in practice


 = 255 , A = 87.6
Types of Speech Codecs
Waveform codecs,source codecs (also
known as vocoders),and hybrid codecs.
Speech Source Model and
Source Coding

unvoiced G(z), G(), g[n]


random Excitation parameters
sequence u[n] 1 x[n]v/u : voiced/ unvoiced
G(z) =
generator  P N : pitch for voiced
periodic 1  akz-k
pulse
G G : signal gain
k=1
train v/u
generator voiced Vocal Tract Model  excitation signal u[n]
N
Vocal Tract parameters
Excitation {ak} : LPC coefficients

formant structure of
speech signals
A good approximation,
though not precise enough
LPC Vocoder(Voice Coder)

x[n] { ak }
LPC Encoder
Analysis N,G
…11011…
v/u

N by pitch detection
v/u by voicing detection
receiver

{ ak } x[n]
Decoder Ex g[n]
N,G G(z)
…11011…
v/u

{ak} can be non-uniform or vector


quantized to reduce bit rate further
G.711

 The most commonplace codec


 Used in circuit-switched telephone
network
 PCM, Pulse-Code Modulation
 If uniform quantization
 12 bits * 8 k/sec = 96 kbps
 Non-uniform quantization

  law
 65 kbps DS0 rate

 North America
 A-law
 Other countries, a little friendlier to
lower signal levels
 An MOS of about 4.3
ADPCM(adaptive differential
PCM)
DPCM and ADPCM.
ADPCM : Adaptive Prediction in DPCM
Adaptive Quantization
Adaptive Quantization
 Quantization level  varies with local signal level
 [n] = ax[n]
 x[n] : locally estimated standard deviation of x[n]

G.721:ADPCM-coded speech at 32Kbps.


G.726(A-law or )
16,24,32,40Kbps
  law
MOS 4.0 , at 32Kbps
Analysis-by-Synthesis (AbS)
Codecs
 Hybrid codec
Fill the gap between waveform and source
codecs
The most successful and commonly used
 Time-domain AbS codecs
 Not a simple two-state, voiced/unvoiced
 Different excitation signals are attempted
 Closest to the original waveform is selected
 MPE, Multi-Pulse Excited
 RPE, Regular-Pulse Excited
 CELP, Code-Excited Linear Predictive
G.728 LD-CELP
 CELP codecs
 A filter; its characteristics change over time
 A codebook of acoustic vectors
 A vector = a set of elements representing various
char. of the excitation
 Transmit
 Filter coefficients, gain, a pointer to the vector
chosen
 Low Delay CELP
 Backward-adaptive coder
 Use previous samples to determine filter coefficients
 Operates on five samples at a time
 Delay < 1 ms
 Only the pointer is transmitted
 1024 vectors in the code book
 10-bit pointer (index)
 16 kbps
 LD-CELP encoder
 Minimize a frequency-weighted mean-square error
 LD-CELP decoder

 An MOS score of about 3.9


 One-quarter of G.711 bandwidth
G.723.1 ACELP
 6.3 or 5.3 kbps
 Both mandatory
 Can change from one to another during a conversation
 The coder
 A band-limited input speech signal
 Sampled at 8 KHz, 16-bit uniform PCM quantization
 Operate on blocks of 240 samples at a time
 A look-ahead of 7.5 ms
 A total algorithmic delay of 37.5 ms + other delays
 A high-pass filter to remove any DC component
 G.723.1 Annex A
 Silence Insertion Description (SID) frames
of size four octets
 The two lsbs of the first octet
 00 6.3kbps 24 octets/frame
 01 5.3kbps 20
 10 SID frame 4
 An MOS of about 3.8
 At least 37.5 ms delay
G.729
 8 kbps
 Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
 5 ms look-ahead
 Algorithmic delay of 15 ms
 An 80-bit frame for 10 ms of speech
 A complex codec
 G.729.A (Annex A), a number of simplifications
 Same frame structure
 Encoder/decoder, G.729/G.729.A
 Slightly lower quality

You might also like