Professional Documents
Culture Documents
MobileASR Tan
MobileASR Tan
1
Introduction NSR DSR ESR Applications
2
Introduction NSR DSR ESR Applications
Outline
1 Introduction
5 Applications
3
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
4
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
Mobile technology
The prevalence of mobile devices: being used as digital
assistants, for communication or simply for fun.
Mobile phones: 3.5 billion by 2010
PDAs, MP3 players, GPS devices, digital cameras
5
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
Mobile technology
6
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
Technical challenges
Imperfection of networks
Data compression
Transmission impairments
8
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
Transmission impairments:
Landline Wireless
Circuit-switched reliable bit error
Packet-switched packet loss packet loss
Both bit error and packet loss tend to be burst-like, difficult
to recover from
11
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
12
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
y (t ) O W'
Grammar Text
Front-End Back-End Corpora
14
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
ASR components
Feature extraction
Mel-frequency cepstral coefficients (MFCC)
Signal processing for robustness
ASR decoding
Calculation of observation likelihood (based on acoustic
models with millions of parameters)
Search (in an HMM network formed by language model,
lexicon and sub-phonetic units)
15
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
16
Introduction NSR DSR ESR Applications Devices and networks Automatic speech recognition
Grammar Text
Front-End Back-End Corpora
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
18
Introduction NSR DSR ESR Applications Speech coding Transmission errors
Pros:
Ubiquitous presence of codec on mobile devices
Plug and play, without touching the massive clients
The only possibility for devices like a telephone
Cons:
Network dependency and error-prone channels
Inter-frame dependency in coding
Distortion introduced by low bit-rate coding
Linear prediction coding (LPC) vs. MFCCs
19
Introduction NSR DSR ESR Applications Speech coding Transmission errors
(a)
(b)
20
Introduction NSR DSR ESR Applications Speech coding Transmission errors
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
21
Introduction NSR DSR ESR Applications Speech coding Transmission errors
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
25
Introduction NSR DSR ESR Applications Speech coding Transmission errors
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
27
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Pros:
The absence of coding and transcoding problems
Robustness against comm. channel & acoustic noise
Thin client, easy to update, no limits in ASR complexity
Sever-side playback, semi-automatic transcription
Speech data collection for AM/LM adaptation (like search engines)
Cons:
Front-end must be implemented in the device
(Not an issue if the application requires a client-side installation anyway.)
W'
Error Source
Applications ASR Decoder
Concealment Decoding
ASR-Decoder Feature
Language Lexicon Acoustic EC Reconstruction
Model Model Server-based error concealment
29
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Ot
Log energy logE
Feature
compression
Frame-pair architecture
Frame 1 Frame 2 CRC 1-2 ... CRC 23-24
<44 bits > <44 bits> <4 bits> ... <4 bits>
< 138 octets / 1104 bits > for 12 frame-pairs
Multiframe
Sync Seq Header Frame packet
2 octets 4 octets 138 octets
< 144 octets / 1152 bits > for 240 ms
Bitrate
4.8 kbps with a payload of 4.4 kbps
31
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
DSR processing
32
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
33
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Ot
Log energy logE
ms 00 10 20 25 35 45
Acoustics 2008, Paris July 1, 2008 1
15 ms overlap
10 ms frame shift
25 ms frame length
34
the Aurora-2 database This improvement
(Hirsch and Pearce 2000). ThewillMFCCs
be become
consist of 13apparent when comparing the results between
Introduction NSR DSR ESR Applications
cepstral
MFCC Quantization ER.EC Standards Systems
coefficients, {cthe
i }i 0 scalar
12
. The log quantizer andlogthe
energy coefficient block
E, which quantizer.
is often concatenated
Standard deviation of c0
60
Square root of covariance coefficient
50
60 60
40
30
40 40
Standard deviation of c
1
20
20 Standard deviation of c2 20
10
0 0
0
1
2
5 10
3
4 10 13
25
30 35
12
5 11
6
15 9
10
20 20
7
8 7
8
15 25
20
(a) b)
9 6
10
11 20 4
5
10 30 15
Row number of covariance matrix 12
25 2
3
Column number of covariance matrix 10
13 1
5 5
35
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
36
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Source coding
37
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Quantization
38
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Quantization
39
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Histogram-based quantization
Acoustic noise may move feature vectors to a different
quantization cell in a fixed VQ codebook, introducing extra
distortion!
Histogram-based quantization
A dynamic quantization, based on signal local statistics, not on
any distance measure, nor related to any pretrained codebook.
42
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
43
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Error-robustness techniques
Error-robustness techniques
Active Passive
Parity check Checksum CRC Block Convolutional Insertion Interpolation Statistical Soft-feature decoding
44
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Error detection
17
Error detection methods
CRC (cyclic redundancy check), linear block codes
consistancy test
Data block size
35
Frame-pair
30
One-frame
25 Sub-vector1
Error Rate (%)
Sub-vector2
20
15
10
5
0
EP1 EP2 EP3 GSM EPs
45
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Channel coding:
Forward error correction (FEC) [Borgstrom et al., 2008]
media-specific FEC
media-independent FEC: e.g. (n, k) block encoding
(Reed-Solomon, BCH, Golay)
Multiple description coding (MDC): encoding a source
into 2+ substreams to be delivered on separate channels
Joint source and channel coding: UEP (unequal error
protection)
Packetization
Interleaving: to counteract burst errors at the cost of
delay [Milner and James, 2006]
46
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
47
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
EC techniques
Feature-reconstruction EC: create a substitution as close
to the original as possible.
ASR-decoder EC: modify ASR decoder to handle
degradations introduced by transmission errors - unique
to DSR
48
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Feature-reconstruction EC:
Insertion-based techniques: splicing, mean value
substitution, repetition
Interpolation-based techniques: linear, cubic
Soft-feature decoding based techniques
[Peinado et al., 2003]
Statistical-based techniques: use a priori info about
speech features [Gomez et al., 2003]
ASR-decoder EC:
Weighted Viterbi decoding [Cardenal-Lopez et al., 2004],
[Tan et al., 2007]
Uncertainty decoding [Ion and Haeb-Umbach, 2006]
49
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
⎢ A A +1 A+ 2 A + 2 N −1 A+ 2 N B⎥
⎢ S1 S1 S1 . S1 S1 S1 ⎥
⎢S A S2
A +1
S2
A+ 2
. S2
A + 2 N −1
S2
A+ 2 N
S2 ⎥
B
⎢ 2A A +1 A+ 2 A + 2 N −1 A+ 2 N B
⎥
⎢ S3 S3 S3 . S3 S3 S3 ⎥
⎢S A S4
A +1
S4
A+ 2
. S4
A + 2 N −1
S4
A+ 2 N
S4 ⎥
B
⎢ 4A A +1 A+ 2 A + 2 N −1 A+ 2 N
⎥
⎢ S5 S5 S5 . S5 S5 S5 ⎥
B
Subvector concealment
S S . S (cont.)
S
⎢ A
⎣ S6 6
A +1
6
A+ 2
6
A + 2 N −1
6
A+ 2 N B⎥
S6 ⎦
Consistency test
Consistency matrix and subvector concealment
(d ( S j t +1 (0) − S j t (0)) > T j (0)) OR (d ( S j t +1 (1) − S j t (1)) > T j (1))
S0 ⎡ 1 1 1 0 0 0 0 1
Advanced 1 ⎤
1Speech Processing, MM7, Zheng-Hua Tan, 2007 1
S1 ⎢ 1 1 1 1 1 0 0 1 1 1 ⎥⎥
⎢
S2 ⎢
⎢
1 1 1 1 1 1 1 1 1 1⎥
⎥
0 for inconsistent
C= S3 ⎢ 1 0 0 1 1 1 1 1 1 1⎥
S4 ⎢ 1 1 1 1 1 1 1 0 0 1⎥ 1 for consistent
⎢ ⎥
S5 ⎢ 1 1 1 1 1 1 1 1 1 1⎥
S6 ⎢ 1 ⎥⎦
⎣ 1 0 0 1 1 0 0 1 1
50
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
αn , n = 1...N/2
γ(t) =
1 − αN−n+1 , n = N/2 + 1...N
t t+1
αd(o (k),o (k))/Tk , Sjt consistent
γk (t) =
γk (t + p).β |p| , o t (k) substituted by o t+p (k)
51
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Uncertainty decoding
In standard form, state emission prob. (modelled by GMM) is
K
X −1
t t
bj (O ) = p(O |sj ) = wjk N(Ot ; µjk , Σjk )
k=0
t
where O is the observing vector, and sj is the state.
In uncertainty decoding, Ot is considered corrupted and the
uncorrupted, unobservable vector X is a random variable with
a distribution p(X|Ot ).
Remarks:
No requirement for modifying the client-side of DSR,
compatible with the ETSI-DSR standards
Repetition EC works pretty well with short burst length
Statistical based techniques benefit from a priori
knowledge of speech and is useful in particular when burst
length is long
ASR-decoder based techniques are unique for DSR and
can be applied in combination with other EC
Computational cost is of concern
53
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
A frame-rate perspective
Strong temporal correlation in speech features
ASR performance is intact with a frame loss rate (short
burst-length) of 50% (From [James and Milner, 2004])
55
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
13 14 15 16 17 18 19 20 21 22 23 24
(a) ETSI-DSR front-end frame-pair scheme
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
(b) One-frame scheme
1 3 5 7 9 11 13 15 17 19 21 23
(c) HFR scheme
1 3 5 7 9 11 2 4 6 8 10 12
13 15 17 19 21 23 14 16 18 20 22 24
2 4 6 8 10 12 14 16 18 20 22 24
58
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
59
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
60
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
Advanced front-end
Extended front-ends
62
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
63
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
1 Introduction
Devices and networks
Automatic speech recognition
2 Network Speech Recognition
Speech coding
Transmission errors
3 Distributed Speech Recognition
Properties of MFCCs
Quantization
Error recovery and concealment
Standards
Systems
4 Embedded Speech Recognition
5 Applications
64
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
65
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
66
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
67
server side. The choice is done at systemMFCC
Introduction NSR DSR ESR Applications
initialisation, and specific settings
Quantization ER.EC Standards Systems
can be changed at any time. The setting may be different across a group of
A configurable DSR system
end-users.
69
Introduction NSR DSR ESR Applications MFCC Quantization ER.EC Standards Systems
70
Introduction NSR DSR ESR Applications
3GPP TS 26.243
“ANSI C Code for the fixed-point distributed speech recognition extended advanced front-end.” 2004.
3GPP TR 26.943
“Recognition performance evaluations of codecs for Speech Enabled Services(SES).” 2004.
Besacier et al.
“The effect of speech and audio compression on speech recognition performance.”
in IEEE Multimedia Signal Processing Workshop, Cannes, France, October 2001.
Bryant, R.
“Data-intensive supercomputing: The case for DISC.”
CMU Technical Report CMU-CS-07-128, May 2007.
Cardenal-Lopez et al.
“Soft decoding strategies for distributed speech recognition over IP newtworks.”
in Proc. ICASSP, Montreal, Canada, 2004.
169
Introduction NSR DSR ESR Applications
Hirsch, H.G.
“The influence of speech coding on recognition performance in telecommunication networks.”
in Proc. ICSLP, Denver, USA, September 2002.
170
Introduction NSR DSR ESR Applications
Kim, H.K.
“Speech recognition over IP networks”
in Z.-H. Tan, and B. Lindberg (eds.), Automatic speech recognition on mobile devices and over
communication networks, Springer, 2008.
Kiss, I.
“A comparison of distributed and network speech recognition for mobile communication systems.”
in Proc. ICSLP, Beijing, China, October 2000.
171
Introduction NSR DSR ESR Applications
172
Introduction NSR DSR ESR Applications
173