Professional Documents
Culture Documents
CS8691-AI-UNIT-V-Speech Recognition
1of 68
Outline
• Conventional Speech Recognition
• How to use Deep Learning in acoustic modeling?
• Why Deep Learning?
• Speaker Adaptation
• Multi-task Deep Learning
• New acoustic features
• Convolutional Neural Network (CNN)
• Applications in Acoustic Signal Processing
CS8691-AI-UNIT-V-Speech Recognition
2of 68
Conventional
Speech Recognition
CS8691-AI-UNIT-V-Speech Recognition
3of 68
Machine Learning helps
Speech
天氣真好
Recognition
X
This is a structured learning problem.
Inference:
CS8691-AI-UNIT-V-Speech Recognition
4of 68
Input Representation
• Audio is represented by a vector sequence
X:
X: ……
x1 x2 x3 …… 39 dim MFCC
CS8691-AI-UNIT-V-Speech Recognition
5of 68
Input Representation - Splice
• To consider some temporal information
MFCC …… ……
…… ……
Splice
CS8691-AI-UNIT-V-Speech Recognition
6of 68
Phoneme
• Phoneme: basic unit
Each word corresponds to a
sequence of phonemes.
Lexicon
hh w aa t d uw y uw th ih ng k
CS8691-AI-UNIT-V-Speech Recognition
7of 68
State
• Each phoneme correspond to a sequence of states
what do you think
Phone:
hh w aa t d uw y uw th ih ng k
Tri-phone:
…… t-d+uwd-uw+y uw-y+uwy-uw+th ……
t-d+uw1 P(x|”t-d+uw1”)
d-uw+y3 P(x|”d-uw+y3”)
CS8691-AI-UNIT-V-Speech Recognition
9of 68
State
• Each state has a stationary distribution for acoustic
features
Tied-state
pointer pointer
P(x|”d-uw+y3”) P(x|”t-d+uw1”)
Same Address
CS8691-AI-UNIT-V-Speech Recognition
10of 68
Acoustic Model
P(X|W) = P(X|S)
W what do you think?
:
S: a b c d e ……
……
s1 s2 s3 s4 s5
X: ……
x1 x2 x3 x4 x5
transition
CS8691-AI-UNIT-V-Speech Recognition
11of 68 emission
Acoustic Model
P(X|W) = P(X|S)
W what do you think?
:
S: a b c d e ……
Actually, we don’t know the alignment.
s1 s2 s3 s4 s5
X: ……
x1 x2 x3 x4 x5
CS8691-AI-UNIT-V-Speech Recognition
13of 68
People imagine ……
…… ……
Acoustic features
…… ……
xi
CS8691-AI-UNIT-V-Speech Recognition
15of 68
Low rank approximation
Output layer W: M X N
M
N is the size of the last hidden
W
layer
N
M is the size of output layer
Number of states
M can be large if the
outputs are the states of
Input layer tri-phone.
CS8691-AI-UNIT-V-Speech Recognition
16of 68
Low rank approximation
N K N
K V
M W ≈ M U
K < M,N
Less parameters
CS8691-AI-UNIT-V-Speech Recognition
18of 68
How to use Deep Learning?
Way 1: Tandem
CS8691-AI-UNIT-V-Speech Recognition
19of 68
Way 1: Tandem system
P(a|xi) P(b|xi) P(c|xi) ……
……
Size of output layer Input of your original speech
= No. of states DNN recognition system
…… ……
xi
Last hidden layer or bottleneck layer are also possible.
CS8691-AI-UNIT-V-Speech Recognition
20of 68
How to use Deep Learning?
CS8691-AI-UNIT-V-Speech Recognition
21of 68
Way 2: DNN-HMMHybrid
From DNN
……
DNN
Count from
CS8691-AI-UNIT-V-Speech Recognition
training data
22of 68
Way 2: DNN-HMMHybrid
DNN
……
DNN
increase
decrease
CS8691-AI-UNIT-V-Speech Recognition
24of 68
How to use Deep Learning?
Way 3: End-to-end
CS8691-AI-UNIT-V-Speech Recognition
25of 68
Way 3: End-to-end - Character
HIS FRIEND’S
~~ ~~ ~ ~ ~~ ~~ ~~ ~~ ~ ~~
Graves, Alex, and NavdeepJaitly. "Towards end-to-end speech recognition with recurrent neural
networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14).
2014.
CS8691-AI-UNIT-V-Speech Recognition
27of 68
Way 3: End-to-end –Word?
apple bog cat ……
Output layer
~50k words in the
lexicon DNN Size = 200 times of
acoustic features
Input layer
padding with zero!
… …
Use other systems to get the word boundaries
Ref: Bengio, Samy, and Georg Heigold. "Word embeddingsfor speech recognition.“, Interspeech.
CS8691-AI-UNIT-V-Speech Recognition
28of 68
2014.
Why Deep Learning?
CS8691-AI-UNIT-V-Speech Recognition
29of 68
Seide, Frank, Gang Li, and Dong Yu.
Deeper is better? "Conversational Speech Transcription Using
Context-Dependent Deep Neural
Networks." Interspeech. 2011.
Deeper is
Better
CS8691-AI-UNIT-V-Speech Recognition
30of 68
Seide, Frank, Gang Li, and Dong Yu.
Deeper is better? "Conversational Speech Transcription Using
Context-Dependent Deep Neural
Networks." Interspeech. 2011.
Deeper is
Better
low http://www.ipachart.com/
CS8691-AI-UNIT-V-Speech Recognition
34of 68
What does DNN front back
do? high
Vu, Ngoc Thang, JochenWeiner, and
Tanja Schultz. "Investigating the
Learning Effect of Multilingual Bottle-
Neck Features for ASR." Interspeech.
2014.
CS8691-AI-UNIT-V-Speech Recognition
36of 68
Speaker Adaptation
CS8691-AI-UNIT-V-Speech Recognition
37of 68
Categories of Methods
parameter close
initialization
Input layer Input layer
layer i+1
layer i+1
W
extra layer
W Wa
layer i layer i
A little data from
target speaker
Input layer Input layer
CS8691-AI-UNIT-V-Speech Recognition
40of 68
Transformation methods
• Add the extra layer between the input and first layer
• With splicing
…… ……
layer 1 layer 1
extra layers
extra layer
Wa Wa Wa Wa
Output layer
M
U
Wa
linear K K is usually small
V
N
CS8691-AI-UNIT-V-Speech Recognition
42of 68
Speaker-aware Training
Speaker
information
Can also be noise-
aware, devise aware Fixed length low
Data of dimension vectors
training, ……
Speaker 1
Text transcription is not
needed for extracting the
Lots of mismatched Data of vectors.
data Speaker 2
Data of
Speaker 3
CS8691-AI-UNIT-V-Speech Recognition
43of 68
Speaker-aware Training
Training data:
train
Speaker 1 Output layer
Speaker 2
CS8691-AI-UNIT-V-Speech Recognition
45of 68
Multitask Learning
• The multi-layer structure makes DNN suitable for
multitask learning
Task A Task B
Task A Task B
Input feature
Input feature for Input feature for
task A 46of 68
CS8691-AI-UNIT-V-Speech Recognition task B
Multitask Learning - Multilingual
45
Character Error Rate
40 Mandarin
only
35
30
With
European
Language
25
1 10 100 1000
A B
A= state B = phoneme
A= state B = gender
A= state B = grapheme
(character)
acoustic
features
CS8691-AI-UNIT-V-Speech Recognition
49of 68
Deep Learning
for Acoustic Modeling
CS8691-AI-UNIT-V-Speech Recognition
50of 68
MFCC
spectrogram
DFT
Waveform
DCT log
Input of
DNN
……
…
MFCC filter
CS8691-AI-UNIT-V-Speech Recognition
51of 68 bank
Filter-bank Output
spectrogram
DFT
Waveform
log
Input of
DNN
……
…
Kind of standard now filter
CS8691-AI-UNIT-V-Speech Recognition
52of 68 bank
Spectrogram
spectrogram
DFT
Waveform
Input of
common today
DNN
CS8691-AI-UNIT-V-Speech Recognition
54of 68
Waveform?
If success, no Signal &
Input of
Systems
DNN
Waveform
Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling
with deep neural networks using raw time signal for LVCSR,” In
INTERPSEECH 2014
Still need to take Signal & Systems ……
CS8691-AI-UNIT-V-Speech Recognition
55of 68
Waveform?
CS8691-AI-UNIT-V-Speech Recognition
56of 68
Convolutional
Neural Network (CNN)
CS8691-AI-UNIT-V-Speech Recognition
57of 68
CNN
• Speech can be treated as images
Spectrogram
Frequency
Time
CS8691-AI-UNIT-V-Speech Recognition
58of 68
CNN
CNN
Probabilities of states
Image
CS8691-AI-UNIT-V-Speech Recognition
59of 68
CNN
Max Maxout
a1 b1
Max
Max
a2
Max pooling b2
CS8691-AI-UNIT-V-Speech Recognition
60of 68
CNN
Tóth, László. "Convolutional Deep Maxout Networks for Phone
Recognition”, Interspeech, 2014.
CS8691-AI-UNIT-V-Speech Recognition
61of 68
Applications in
Acoustic Signal Processing
CS8691-AI-UNIT-V-Speech Recognition
62of 68
DNN for Speech Enhancement
Clean Speech
Noisy Speech
Female
DNN
Male
CS8691-AI-UNIT-V-Speech Recognition
65of 68
Concluding Remarks
CS8691-AI-UNIT-V-Speech Recognition
67of 68
More Researches related to Speech
lecture recordings
Find the lectures
related to “deep
learning”
Summary
Spoken Content Retrieval Speech Summarization
Speech
你好
Recognition core
techniques