Speech Recoginition

Speech Recognition
CS8691-AI-UNIT-V-Speech Recognition
1of 68
Outline
• Conventional Speech Recognition
• How to use Deep Learning in acoustic modeling?
• Why Deep Learning?
• Speaker Adaptation
• Multi-task Deep Learning
• New acoustic features
• Convolutional Neural Network (CNN)
• Applications in Acoustic Signal Processing
2of 68
Conventional
Speech Recognition
3of 68
Machine Learning helps
Speech
天氣真好
Recognition
X
This is a structured learning problem.
Inference:
4of 68
Input Representation
• Audio is represented by a vector sequence
X:
X: ……
x1 x2 x3 …… 39 dim MFCC
5of 68
Input Representation - Splice
• To consider some temporal information
MFCC …… ……
…… ……
Splice
6of 68
Phoneme
• Phoneme: basic unit
Each word corresponds to a
sequence of phonemes.
Lexicon
what do you think

Different words can correspond to the
Lexicon
same phonemes
hh w aa t d uw y uw th ih ng k
7of 68
State
• Each phoneme correspond to a sequence of states
what do you think
Phone:
hh w aa t d uw y uw th ih ng k
Tri-phone:
…… t-d+uwd-uw+y uw-y+uwy-uw+th ……
t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3

State: 8of 68
State
• Each state has a stationary distribution for acoustic
features
Gaussian Mixture Model (GMM)
t-d+uw1 P(x|”t-d+uw1”)
d-uw+y3 P(x|”d-uw+y3”)
9of 68
State
• Each state has a stationary distribution for acoustic
features
Tied-state
pointer pointer
P(x|”d-uw+y3”) P(x|”t-d+uw1”)
Same Address
10of 68
Acoustic Model
P(X|W) = P(X|S)
W what do you think?
:
S: a b c d e ……
……
s1 s2 s3 s4 s5
X: ……
x1 x2 x3 x4 x5
transition
11of 68 emission
Acoustic Model
P(X|W) = P(X|S)
W what do you think?
:
S: a b c d e ……
Actually, we don’t know the alignment.
s1 s2 s3 s4 s5
X: ……
x1 x2 x3 x4 x5
CS8691-AI-UNIT-V-Speech Recognition (Viterbi

12of 68algorithm)
How to use Deep Learning?
13of 68
People imagine ……
…… ……
Acoustic features
This can not be true! DNN

DNNcan only take fixed length
vectors as input.
“Hello”
14of 68
What DNN can do is ……
P(a|xi) P(b|xi) P(c|xi) ……  DNN input:

One acoustic feature
……
Size of output layer  DNN output:
= No. of states DNN Probability of each state
…… ……
xi
15of 68
Low rank approximation
Output layer W: M X N
M
N is the size of the last hidden
W
layer
N
M is the size of output layer
 Number of states
M can be large if the
outputs are the states of
Input layer tri-phone.
16of 68
Low rank approximation
N K N
K V
M W ≈ M U
K < M,N
Less parameters
Output layer Output layer

M M
U
W linear K
V
N 17of 68 N
How we use deep learning
• There are three ways to use DNN for acoustic modeling
– Way 1. Tandem
– Way 2. DNN-HMMhybrid Efforts for
exploiting deep
– Way 3. End-to-end learning
18of 68
Way 1: Tandem
19of 68
Way 1: Tandem system
P(a|xi) P(b|xi) P(c|xi) ……
……
Size of output layer Input of your original speech
= No. of states DNN recognition system
…… ……
xi
Last hidden layer or bottleneck layer are also possible.
20of 68
Way 2: DNN-HMM hybrid
21of 68
Way 2: DNN-HMMHybrid
From DNN
……
DNN
Count from
training data
22of 68
DNN
……
DNN
From original Count from

HMM training data
This assembled vehicle works ……. 23of 68
• Sequential Training
increase
decrease
24of 68
Way 3: End-to-end
25of 68
Way 3: End-to-end - Character
Input: acoustic features

(spectrograms)
Output: characters (and

space)
+ null (~)
No phoneme and lexicon
(No OOV problem)
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A.

Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.
26of 68
Way 3: End-to-end - Character
HIS FRIEND’S
~~ ~~ ~ ~ ~~ ~~ ~~ ~~ ~ ~~
Graves, Alex, and NavdeepJaitly. "Towards end-to-end speech recognition with recurrent neural
networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14).
2014.
27of 68
Way 3: End-to-end –Word?
apple bog cat ……
Output layer
~50k words in the
lexicon DNN Size = 200 times of
acoustic features
Input layer
padding with zero!
… …
Use other systems to get the word boundaries
Ref: Bengio, Samy, and Georg Heigold. "Word embeddingsfor speech recognition.“, Interspeech.
28of 68
2014.
Why Deep Learning?
29of 68
Seide, Frank, Gang Li, and Dong Yu.
Deeper is better? "Conversational Speech Transcription Using
Context-Dependent Deep Neural
Networks." Interspeech. 2011.
• Word error rate (WER)

multiple layers 1 hidden layer
Deeper is
Better
30of 68
Seide, Frank, Gang Li, and Dong Yu.
Deeper is better? "Conversational Speech Transcription Using
Context-Dependent Deep Neural
Networks." Interspeech. 2011.
• Word error rate (WER)

multiple layers 1 hidden layer
Deeper is
Better
For a fixed number of parameters, a deep model is clearly

better than the shallow one.
31of 68
A. Mohamed, G. Hinton, and G.
What does DNN do? Penn, “Understanding how Deep Belief
Networks Perform Acoustic Modelling,” in
ICASSP, 2012.
• Speaker normalization is automatically done in DNN
Input Acoustic Feature (MFCC) 1-st Hidden Layer

32of 68
A. Mohamed, G. Hinton, and G.
What does DNN do? Penn, “Understanding how Deep Belief
Networks Perform Acoustic Modelling,” in
ICASSP, 2012.
• Speaker normalization is automatically done in DNN
Input Acoustic Feature (MFCC) 8-th Hidden Layer

33of 68
What does DNN do?
• In ordinary acoustic models, all the states are

modeled independently
– Not effective way to model human voice
front back
high The sound of vowel is only
controlled by a few factors.
low http://www.ipachart.com/
34of 68
What does DNN front back
do? high
Vu, Ngoc Thang, JochenWeiner, and
Tanja Schultz. "Investigating the
Learning Effect of Multilingual Bottle-
Neck Features for ASR." Interspeech.
2014.
Output of hidden layer

reduce to two
dimensions low
/i/  The lower layers detect the
/u/
manner of articulation
 All the states share the results
from the same set of detectors.
/e/ /o/
/a/  Use parameters effectively
35of 68
Speaker Adaptation
36of 68
Speaker Adaptation
• Speaker adaptation: use different models to

recognition the speech of different speakers
– Collect the audio data of each speaker
• A DNN model for each speaker
– Challenge: limited data for training
• Not enough data for directly training a DNN model
• Not enough data for just fine-tune a speaker
independent DNN model
37of 68
Categories of Methods
Conservative • Re-train the whole DNN

training with some constraints
Transformation • Only train the parameter

methods of one layer
Speaker-aware • Do not really change the

Training DNN parameters
Need less
CS8691-AI-UNIT-V-Speech Recognition training
38of 68 data
Conservative Training
Output layer output close Output layer
parameter close
initialization
Input layer Input layer
A little data from

Audio data of target speaker
Many speakers
39of 68
Transformation methods
Add an extra layer Output layer
Output layer Fix all the other
parameters
layer i+1
layer i+1
W
extra layer
W Wa
layer i layer i
A little data from
target speaker
Input layer Input layer
40of 68
• Add the extra layer between the input and first layer
• With splicing
…… ……
layer 1 layer 1
extra layers
extra layer
Wa Wa Wa Wa
Larger Wa More data Smaller Wa less data

41of 68
• SVD bottleneck adaptation
Output layer
M
U
Wa
linear K K is usually small
V
N
42of 68
Speaker-aware Training
Speaker
information
Can also be noise-
aware, devise aware Fixed length low
Data of dimension vectors
training, ……
Speaker 1
Text transcription is not
needed for extracting the
Lots of mismatched Data of vectors.
data Speaker 2
Data of
Speaker 3
43of 68
Speaker-aware Training
Training data:
train
Speaker 1 Output layer
Speaker 2
Acoustic features are appended with speaker

information features
Testing data: test Acoustic Speaker
feature Information
All the speaker use the same DNN model
Different speaker augmented different features44of 68
byRecognition
CS8691-AI-UNIT-V-Speech
Multi-task Learning
45of 68
Multitask Learning
• The multi-layer structure makes DNN suitable for
multitask learning
Task A Task B
Task A Task B
Input feature
Input feature for Input feature for
task A 46of 68
CS8691-AI-UNIT-V-Speech Recognition task B
Multitask Learning - Multilingual
states of states of states of states of states of

French German Spanish Italian Mandarin
Human languages share

some common
characteristics.
acoustic
features CS8691-AI-UNIT-V-Speech Recognition
47of 68
Multitask Learning - Multilingual
50
45
Character Error Rate
40 Mandarin
only
35
30
With
European
Language
25
1 10 100 1000
Hours of training data for Mandarin

Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep
neural network with shared hidden layers."Acoustics, Speech and Signal Processing
(ICASSP), 2013 CS8691-AI-UNIT-V-Speech Recognition
48of 68
Multitask Learning
- Different units
A B
A= state B = phoneme
A= state B = gender
A= state B = grapheme
(character)
acoustic
features
49of 68
Deep Learning
for Acoustic Modeling
New acoustic features
50of 68
MFCC
spectrogram
DFT
Waveform
DCT log
Input of
DNN
……
…
MFCC filter
51of 68 bank
Filter-bank Output
spectrogram
DFT
Waveform
log
Input of
DNN
……
…
 Kind of standard now filter
52of 68 bank
Spectrogram
spectrogram
DFT
Waveform
Input of
 common today
DNN
 5% relative improvement over fitlerbankoutput

 Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., &
Ramabhadran, B., “Learning filter banks within a deep neural network
framework,” In Automatic Speech Recognition and
Understanding (ASRU), 2013
53of 68
Spectrogram
54of 68
Waveform?
 If success, no Signal &
Input of
Systems 
DNN
Waveform
 People tried, but not better than spectrogram yet
 Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling
with deep neural networks using raw time signal for LVCSR,” In
INTERPSEECH 2014
 Still need to take Signal & Systems ……
55of 68
Waveform?
56of 68
Convolutional
Neural Network (CNN)
57of 68
CNN
• Speech can be treated as images
Spectrogram
Frequency
Time
58of 68
CNN
CNN
Probabilities of states
CNN Replace DNN by CNN
Image
59of 68
CNN
Max Maxout
a1 b1
Max
Max
a2
Max pooling b2
60of 68
CNN
Tóth, László. "Convolutional Deep Maxout Networks for Phone
Recognition”, Interspeech, 2014.
61of 68
Applications in
Acoustic Signal Processing
62of 68
DNN for Speech Enhancement
Clean Speech
for mobile commination or

DNN
speech recognition
Noisy Speech
 Demo for speech enhancement:

http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html
63of 68
DNN for Voice Conversion
Female
DNN
Male
 Demo for Voice Conversion: http://research.microsoft.com/en-

us/projects/vcnn/default.aspx
64of 68
Concluding Remarks
65of 68
Concluding Remarks
• Conventional Speech Recognition

• How to use Deep Learning in acoustic modeling?
• Why Deep Learning?
• Speaker Adaptation
• Multi-task Deep Learning
• New acoustic features
• Convolutional Neural Network (CNN)
• Applications in Acoustic Signal Processing66of 68
Thank you for
your attention!
67of 68
More Researches related to Speech
lecture recordings
Find the lectures
related to “deep
learning”
Summary
Spoken Content Retrieval Speech Summarization
Speech
你好
Recognition core
techniques
I would like to leave Hi

Taipei on November
2nd Hello
Computer Assisted Information
Dialogue
Language Learning Extraction
68of 68

Speech Recoginition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recoginition

Uploaded by

Copyright:

Available Formats

Speech Recognition

what do you think

t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3

CS8691-AI-UNIT-V-Speech Recognition (Viterbi

This can not be true! DNN

P(a|xi) P(b|xi) P(c|xi) ……  DNN input:

Output layer Output layer

Way 2: DNN-HMM hybrid

From original Count from

Input: acoustic features

Output: characters (and

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A.

• Word error rate (WER)

• Word error rate (WER)

For a fixed number of parameters, a deep model is clearly

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 1-st Hidden Layer

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer

• In ordinary acoustic models, all the states are

Output of hidden layer

• Speaker adaptation: use different models to

Conservative • Re-train the whole DNN

Transformation • Only train the parameter

Speaker-aware • Do not really change the

A little data from

Larger Wa More data Smaller Wa less data

Acoustic features are appended with speaker

states of states of states of states of states of

Human languages share

Hours of training data for Mandarin

New acoustic features

 5% relative improvement over fitlerbankoutput

 People tried, but not better than spectrogram yet

CNN Replace DNN by CNN

for mobile commination or

 Demo for speech enhancement:

 Demo for Voice Conversion: http://research.microsoft.com/en-

• Conventional Speech Recognition

I would like to leave Hi

You might also like