You are on page 1of 68

Speech Recognition

CS8691-AI-UNIT-V-Speech Recognition
1of 68
Outline
• Conventional Speech Recognition
• How to use Deep Learning in acoustic modeling?
• Why Deep Learning?
• Speaker Adaptation
• Multi-task Deep Learning
• New acoustic features
• Convolutional Neural Network (CNN)
• Applications in Acoustic Signal Processing

CS8691-AI-UNIT-V-Speech Recognition
2of 68
Conventional
Speech Recognition

CS8691-AI-UNIT-V-Speech Recognition
3of 68
Machine Learning helps
Speech
天氣真好
Recognition
X
This is a structured learning problem.

Inference:

CS8691-AI-UNIT-V-Speech Recognition
4of 68
Input Representation
• Audio is represented by a vector sequence

X:

X: ……

x1 x2 x3 …… 39 dim MFCC
CS8691-AI-UNIT-V-Speech Recognition
5of 68
Input Representation - Splice
• To consider some temporal information

MFCC …… ……

…… ……
Splice

CS8691-AI-UNIT-V-Speech Recognition
6of 68
Phoneme
• Phoneme: basic unit
Each word corresponds to a
sequence of phonemes.

Lexicon

what do you think


Different words can correspond to the
Lexicon
same phonemes

hh w aa t d uw y uw th ih ng k
CS8691-AI-UNIT-V-Speech Recognition
7of 68
State
• Each phoneme correspond to a sequence of states
what do you think
Phone:
hh w aa t d uw y uw th ih ng k
Tri-phone:
…… t-d+uwd-uw+y uw-y+uwy-uw+th ……

t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3


State: 8of 68
CS8691-AI-UNIT-V-Speech Recognition
State
• Each state has a stationary distribution for acoustic
features
Gaussian Mixture Model (GMM)

t-d+uw1 P(x|”t-d+uw1”)

d-uw+y3 P(x|”d-uw+y3”)

CS8691-AI-UNIT-V-Speech Recognition
9of 68
State
• Each state has a stationary distribution for acoustic
features
Tied-state
pointer pointer
P(x|”d-uw+y3”) P(x|”t-d+uw1”)

Same Address
CS8691-AI-UNIT-V-Speech Recognition
10of 68
Acoustic Model

P(X|W) = P(X|S)
W what do you think?
:
S: a b c d e ……
……
s1 s2 s3 s4 s5
X: ……
x1 x2 x3 x4 x5
transition

CS8691-AI-UNIT-V-Speech Recognition
11of 68 emission
Acoustic Model

P(X|W) = P(X|S)
W what do you think?
:
S: a b c d e ……
Actually, we don’t know the alignment.
s1 s2 s3 s4 s5
X: ……
x1 x2 x3 x4 x5

CS8691-AI-UNIT-V-Speech Recognition (Viterbi


12of 68algorithm)
How to use Deep Learning?

CS8691-AI-UNIT-V-Speech Recognition
13of 68
People imagine ……

…… ……
Acoustic features

This can not be true! DNN


DNNcan only take fixed length
vectors as input.
“Hello”
CS8691-AI-UNIT-V-Speech Recognition
14of 68
What DNN can do is ……

P(a|xi) P(b|xi) P(c|xi) ……  DNN input:


One acoustic feature
……
Size of output layer  DNN output:
= No. of states DNN Probability of each state

…… ……

xi
CS8691-AI-UNIT-V-Speech Recognition
15of 68
Low rank approximation

Output layer W: M X N
M
N is the size of the last hidden
W
layer
N
M is the size of output layer

 Number of states
M can be large if the
outputs are the states of
Input layer tri-phone.
CS8691-AI-UNIT-V-Speech Recognition
16of 68
Low rank approximation
N K N

K V
M W ≈ M U
K < M,N
Less parameters

Output layer Output layer


M M
U
W linear K
V
N 17of 68 N
CS8691-AI-UNIT-V-Speech Recognition
How we use deep learning
• There are three ways to use DNN for acoustic modeling
– Way 1. Tandem
– Way 2. DNN-HMMhybrid Efforts for
exploiting deep
– Way 3. End-to-end learning

CS8691-AI-UNIT-V-Speech Recognition
18of 68
How to use Deep Learning?

Way 1: Tandem

CS8691-AI-UNIT-V-Speech Recognition
19of 68
Way 1: Tandem system
P(a|xi) P(b|xi) P(c|xi) ……

……
Size of output layer Input of your original speech
= No. of states DNN recognition system

…… ……

xi
Last hidden layer or bottleneck layer are also possible.
CS8691-AI-UNIT-V-Speech Recognition
20of 68
How to use Deep Learning?

Way 2: DNN-HMM hybrid

CS8691-AI-UNIT-V-Speech Recognition
21of 68
Way 2: DNN-HMMHybrid

From DNN

……

DNN

Count from
CS8691-AI-UNIT-V-Speech Recognition
training data
22of 68
Way 2: DNN-HMMHybrid

DNN

……

DNN

From original Count from


HMM training data
This assembled vehicle works ……. 23of 68
CS8691-AI-UNIT-V-Speech Recognition
Way 2: DNN-HMMHybrid
• Sequential Training

increase

decrease

CS8691-AI-UNIT-V-Speech Recognition
24of 68
How to use Deep Learning?

Way 3: End-to-end

CS8691-AI-UNIT-V-Speech Recognition
25of 68
Way 3: End-to-end - Character

Input: acoustic features


(spectrograms)

Output: characters (and


space)
+ null (~)
No phoneme and lexicon
(No OOV problem)

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A.


Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.
CS8691-AI-UNIT-V-Speech Recognition
26of 68
Way 3: End-to-end - Character

HIS FRIEND’S

~~ ~~ ~ ~ ~~ ~~ ~~ ~~ ~ ~~

Graves, Alex, and NavdeepJaitly. "Towards end-to-end speech recognition with recurrent neural
networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14).
2014.

CS8691-AI-UNIT-V-Speech Recognition
27of 68
Way 3: End-to-end –Word?
apple bog cat ……

Output layer
~50k words in the
lexicon DNN Size = 200 times of
acoustic features
Input layer
padding with zero!

… …
Use other systems to get the word boundaries
Ref: Bengio, Samy, and Georg Heigold. "Word embeddingsfor speech recognition.“, Interspeech.
CS8691-AI-UNIT-V-Speech Recognition
28of 68
2014.
Why Deep Learning?

CS8691-AI-UNIT-V-Speech Recognition
29of 68
Seide, Frank, Gang Li, and Dong Yu.
Deeper is better? "Conversational Speech Transcription Using
Context-Dependent Deep Neural
Networks." Interspeech. 2011.

• Word error rate (WER)


multiple layers 1 hidden layer

Deeper is
Better

CS8691-AI-UNIT-V-Speech Recognition
30of 68
Seide, Frank, Gang Li, and Dong Yu.
Deeper is better? "Conversational Speech Transcription Using
Context-Dependent Deep Neural
Networks." Interspeech. 2011.

• Word error rate (WER)


multiple layers 1 hidden layer

Deeper is
Better

For a fixed number of parameters, a deep model is clearly


better than the shallow one.
CS8691-AI-UNIT-V-Speech Recognition
31of 68
A. Mohamed, G. Hinton, and G.
What does DNN do? Penn, “Understanding how Deep Belief
Networks Perform Acoustic Modelling,” in
ICASSP, 2012.

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 1-st Hidden Layer


CS8691-AI-UNIT-V-Speech Recognition
32of 68
A. Mohamed, G. Hinton, and G.
What does DNN do? Penn, “Understanding how Deep Belief
Networks Perform Acoustic Modelling,” in
ICASSP, 2012.

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer


CS8691-AI-UNIT-V-Speech Recognition
33of 68
What does DNN do?

• In ordinary acoustic models, all the states are


modeled independently
– Not effective way to model human voice
front back
high The sound of vowel is only
controlled by a few factors.

low http://www.ipachart.com/

CS8691-AI-UNIT-V-Speech Recognition
34of 68
What does DNN front back
do? high
Vu, Ngoc Thang, JochenWeiner, and
Tanja Schultz. "Investigating the
Learning Effect of Multilingual Bottle-
Neck Features for ASR." Interspeech.
2014.

Output of hidden layer


reduce to two
dimensions low
/i/  The lower layers detect the
/u/
manner of articulation
 All the states share the results
from the same set of detectors.
/e/ /o/
/a/  Use parameters effectively
CS8691-AI-UNIT-V-Speech Recognition
35of 68
Speaker Adaptation

CS8691-AI-UNIT-V-Speech Recognition
36of 68
Speaker Adaptation

• Speaker adaptation: use different models to


recognition the speech of different speakers
– Collect the audio data of each speaker
• A DNN model for each speaker
– Challenge: limited data for training
• Not enough data for directly training a DNN model
• Not enough data for just fine-tune a speaker
independent DNN model

CS8691-AI-UNIT-V-Speech Recognition
37of 68
Categories of Methods

Conservative • Re-train the whole DNN


training with some constraints

Transformation • Only train the parameter


methods of one layer

Speaker-aware • Do not really change the


Training DNN parameters
Need less
CS8691-AI-UNIT-V-Speech Recognition training
38of 68 data
Conservative Training
Output layer output close Output layer

parameter close

initialization
Input layer Input layer

A little data from


Audio data of target speaker
Many speakers
CS8691-AI-UNIT-V-Speech Recognition
39of 68
Transformation methods
Add an extra layer Output layer
Output layer Fix all the other
parameters

layer i+1
layer i+1
W
extra layer
W Wa
layer i layer i
A little data from
target speaker
Input layer Input layer
CS8691-AI-UNIT-V-Speech Recognition
40of 68
Transformation methods
• Add the extra layer between the input and first layer
• With splicing
…… ……

layer 1 layer 1
extra layers
extra layer
Wa Wa Wa Wa

Larger Wa More data Smaller Wa less data


CS8691-AI-UNIT-V-Speech Recognition
41of 68
Transformation methods
• SVD bottleneck adaptation

Output layer
M
U

Wa
linear K K is usually small
V
N

CS8691-AI-UNIT-V-Speech Recognition
42of 68
Speaker-aware Training
Speaker
information
Can also be noise-
aware, devise aware Fixed length low
Data of dimension vectors
training, ……
Speaker 1
Text transcription is not
needed for extracting the
Lots of mismatched Data of vectors.
data Speaker 2

Data of
Speaker 3
CS8691-AI-UNIT-V-Speech Recognition
43of 68
Speaker-aware Training
Training data:
train
Speaker 1 Output layer

Speaker 2

Acoustic features are appended with speaker


information features
Testing data: test Acoustic Speaker
feature Information
All the speaker use the same DNN model
Different speaker augmented different features44of 68
byRecognition
CS8691-AI-UNIT-V-Speech
Multi-task Learning

CS8691-AI-UNIT-V-Speech Recognition
45of 68
Multitask Learning
• The multi-layer structure makes DNN suitable for
multitask learning
Task A Task B
Task A Task B

Input feature
Input feature for Input feature for
task A 46of 68
CS8691-AI-UNIT-V-Speech Recognition task B
Multitask Learning - Multilingual

states of states of states of states of states of


French German Spanish Italian Mandarin

Human languages share


some common
characteristics.
acoustic
features CS8691-AI-UNIT-V-Speech Recognition
47of 68
Multitask Learning - Multilingual
50

45
Character Error Rate

40 Mandarin
only
35

30
With
European
Language
25
1 10 100 1000

Hours of training data for Mandarin


Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep
neural network with shared hidden layers."Acoustics, Speech and Signal Processing
(ICASSP), 2013 CS8691-AI-UNIT-V-Speech Recognition
48of 68
Multitask Learning
- Different units

A B

A= state B = phoneme

A= state B = gender

A= state B = grapheme
(character)
acoustic
features

CS8691-AI-UNIT-V-Speech Recognition
49of 68
Deep Learning
for Acoustic Modeling

New acoustic features

CS8691-AI-UNIT-V-Speech Recognition
50of 68
MFCC
spectrogram

DFT

Waveform

DCT log
Input of
DNN
……


MFCC filter
CS8691-AI-UNIT-V-Speech Recognition
51of 68 bank
Filter-bank Output
spectrogram

DFT

Waveform

log
Input of
DNN
……


 Kind of standard now filter
CS8691-AI-UNIT-V-Speech Recognition
52of 68 bank
Spectrogram
spectrogram

DFT

Waveform
Input of
 common today
DNN

 5% relative improvement over fitlerbankoutput


 Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., &
Ramabhadran, B., “Learning filter banks within a deep neural network
framework,” In Automatic Speech Recognition and
Understanding (ASRU), 2013
CS8691-AI-UNIT-V-Speech Recognition
53of 68
Spectrogram

CS8691-AI-UNIT-V-Speech Recognition
54of 68
Waveform?
 If success, no Signal &
Input of
Systems 
DNN

Waveform

 People tried, but not better than spectrogram yet

 Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling
with deep neural networks using raw time signal for LVCSR,” In
INTERPSEECH 2014
 Still need to take Signal & Systems ……
CS8691-AI-UNIT-V-Speech Recognition
55of 68
Waveform?

CS8691-AI-UNIT-V-Speech Recognition
56of 68
Convolutional
Neural Network (CNN)

CS8691-AI-UNIT-V-Speech Recognition
57of 68
CNN
• Speech can be treated as images

Spectrogram
Frequency

Time
CS8691-AI-UNIT-V-Speech Recognition
58of 68
CNN
CNN

Probabilities of states

CNN Replace DNN by CNN

Image
CS8691-AI-UNIT-V-Speech Recognition
59of 68
CNN
Max Maxout

a1 b1
Max
Max
a2
Max pooling b2

CS8691-AI-UNIT-V-Speech Recognition
60of 68
CNN
Tóth, László. "Convolutional Deep Maxout Networks for Phone
Recognition”, Interspeech, 2014.

CS8691-AI-UNIT-V-Speech Recognition
61of 68
Applications in
Acoustic Signal Processing

CS8691-AI-UNIT-V-Speech Recognition
62of 68
DNN for Speech Enhancement

Clean Speech

for mobile commination or


DNN
speech recognition

Noisy Speech

 Demo for speech enhancement:


http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html
CS8691-AI-UNIT-V-Speech Recognition
63of 68
DNN for Voice Conversion

Female

DNN

Male

 Demo for Voice Conversion: http://research.microsoft.com/en-


us/projects/vcnn/default.aspx
CS8691-AI-UNIT-V-Speech Recognition
64of 68
Concluding Remarks

CS8691-AI-UNIT-V-Speech Recognition
65of 68
Concluding Remarks

• Conventional Speech Recognition


• How to use Deep Learning in acoustic modeling?
• Why Deep Learning?
• Speaker Adaptation
• Multi-task Deep Learning
• New acoustic features
• Convolutional Neural Network (CNN)
• Applications in Acoustic Signal Processing66of 68
CS8691-AI-UNIT-V-Speech Recognition
Thank you for
your attention!

CS8691-AI-UNIT-V-Speech Recognition
67of 68
More Researches related to Speech
lecture recordings
Find the lectures
related to “deep
learning”
Summary
Spoken Content Retrieval Speech Summarization

Speech
你好
Recognition core
techniques

I would like to leave Hi


Taipei on November
2nd Hello
Computer Assisted Information
Dialogue
Language Learning Extraction
CS8691-AI-UNIT-V-Speech Recognition
68of 68

You might also like