Mba-Ai Speech Technologies: Prof. Brian Mak

MBA-AI
Speech Technologies
Prof. Brian Mak

Department of Computer Science and Engineering
Automatic Speech Recognition (ASR)
p. 2
ASR Early Applications
 Telephony application -- automated call centers
 Replace interactive voice response (IVR) systems
 1992: AT&T automation of operator services
• 5 phrases: “collect”, “calling card”, “operator”, “person
to person”, “third number”
• >1 billion operator assisted calls a year
• saves >US$300 million a year.
 1996: Charles Schwab’s automated retail brokering
• understands 15,000 names of stocks or funds
• new accounts up by 41%; 100,000 calls/day;
• cost per call is cut from $4-5 to $1.
p. 3
ASR Consumer Applications
 Dictation: IBM (ViaVoice), Dragon Dictate, 話咁易
 Education: spoken language learning
p. 4
ASR New Applications
 Voice search
 Voice typing
 Digital personal
assistants
 Voice activation
 Auto-generated
captions on YouTube
p. 5
Statistical ASR
 Word sequence:
 Speech feature sequence:
 Recognized words
=
=
acoustic model
language model
p. 6
ASR System
p. 7
Language Modeling (LM)
 Statistics about co-occurrences of words
 Example: “ … …”
 bigram LM:
 trigram LM:
 Example: Stanford’s AI 100
 for each word, e.g., ”AI”, etc.
• #(AI) = 19, #(AI in) = 3, #(AI technologies) = 2, …
☞ P(in|AI) = 3/19, P(technologies|AI) = 2/19, …
 Google’s 1B word LM
p. 8
Example: Spoken Digit Recognition
“zero” “one” “nine”
feature feature feature

extraction extraction extraction
data data data

modeling modeling modeling
model 0 model 1 model 9
? feature
extraction
recognition recognized as “four”
p. 9
Too Many Words?
 How many words in the English language?
 How many characters in the Chinese

language?
p. 10
Elements: Basic Units in Chemistry
p. 11
Phonemes: Basic Units in Speech
http://www.englishclub.com/pronunciation/phonemic-chart-ia.htm p. 12
Phonemes
 English: 40 – 60 phonemes
 A spoken word is a sequence of phonemes:
 cat = /k/ /ae/ /t/
 Speech recognition: to recognize an utterance
which is a sequence of words
 => a sequence of sequences of phonemes
 “a pretty cat” = /ah/ /p/ /r/ /ih/ /t/ /iy/ /k/ /ae/ /t/
p. 13
Visible Speech
spectrogram
p. 14
Spectrogram Reading
p. 15
Spectrogram Reading
p. 16
Speech Visualization
[ https://www.speechandhearing.net/laboratory/wasp/ ]
p. 17
Difficulties in ASR: High Variability
 Intra-speaker variability
 age, emotion, physical condition, prosody
 Inter-speaker variability
 gender, age, pitch, accent, speaking style/rate
 Co-articulatory effects
 “heed” , “heal” , “real”
 Triphone: a phoneme with a specific left and right
contextual phoneme
 “heed”: /#-h+iy/ /h-iy+d/ /iy-d+#/ p. 18
Vilfredo Pareto’s 80/20 Principle
 WSJ: 80% of samples come from the most frequent
20% of all seen triphones.
p. 19
Difficulties in ASR ..
 Language variability/ambiguity
 Highly confusable sounds: /p/ vs. /b/ ; man vs. men
 Multiple pronunciations: “a”, “the”
 multiple meanings: “go”, “bank”, “time flies”
 Homonyms: “two”, “too”; “ice-cream”, “I scream”
 Lack of clear boundaries between acoustic
units (e.g., words, syllables, phonemes) in
continuous speech.
 Variable lengths
p. 20
Difficulties in ASR: High Variability …
 Channel variability
 Different types of microphones or telephone
handsets
 fixed line, wireless, mobile
 room acoustics, radio/TV broadcast
 Noise variability
 Different kinds of noises
 Different signal-to-noise ratios (SNR)
p. 21
Hidden Markov Model in a Nutshell
Box R G B P( ) P( ) P( )
#1 3 3 5 3/11 3/11 5/11
Box #1
p. 22
HMM in a Nutshell ..
Box R G B P( ) P( ) P( )
#1 3 3 5 3/11 3/11 5/11
#2 3 5 2 3/10 5/10 2/10

0.6 0.1
#3 6 4 3 6/13 4/13 3/13
0.3
0.1 0.5
0.2 0.4
Box #1 Box #2 0.3 Box #3 p. 23

0.5
Acoustic Modeling: HMM
image source: HTK Book p. 24

Gaussian Mixture Model (GMM)
p. 25
Machine Learning: Mimic The Brain
 Neurons = nerve cells
 ~1011 neurons
 Each connects up to 10,000
other neurons
 ~1015 synaptic connections
Artificial Neuron Network (ANN)
A Human Neuron p. 26
Deep Learning
 Add many more hidden layers:
deep neural networks (DNN)
Geoffrey Hinton
p. 27
Typical DNN for ASR
 #input units: 500 – 1000
 #hidden layers: 5 – 7
 #hidden units / layer : 1K – 2K
 #output units: 5K – 10K
 the number of triphone HMM states
 Total #weights: 30M – 100M
p. 28
Recurrent Neural Network (RNN)
𝑜𝑢𝑡𝑝𝑢𝑡
 𝑣𝑎𝑙𝑢𝑒 : 𝑦 𝑡 𝑦 𝑡
output output
unit unit
h𝑖𝑑𝑑𝑒𝑛
 𝑣𝑎𝑙𝑢𝑒 :h𝑡 h𝑡
hidden hidden :previous

unit unit hidden value
𝑖𝑛𝑝𝑢𝑡
 𝑣𝑎𝑙𝑢𝑒 : 𝑥 𝑡 𝑥𝑡
 feed-forward NN :  recurrent NN :
output at time considers only the output at time considers all
transformed input at time t the past transformed inputs p. 29
RNN Unfolded in Time
outputs from t = 1 to T
𝑦 1 𝑦 2 𝑦 3 𝑦 𝑇
output output output output

unit unit unit unit
h
 1 h
 2 h
 3 h
 𝑇
hidden hidden hidden hidden
unit h
 1 unit h
 2 unit h
 3 unit
𝑥1 𝑥2 𝑥3 𝑥𝑇

inputs from t = 1 to T
 time-unfolded RNN ≈ DNN

with history
p. 30
Long Short-Term Memory (LSTM)
h 𝑡 − 1
Complex structure to control what to forget and what

“good” old memory to propagate to help current prediction. p. 31
Bidirectional LSTM
Zhiyong Cui, et al., “Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural
Network for Network-wide Traffic Speed Prediction,” CoRR abs/1801.02143 2018.
p. 32
Typical BLSTM-RNN for ASR
 #input units: 40+
 #hidden layers: 5
 #hidden units / layer : 512 * 2
 #output units: 10K – 30K
 the number of triphone HMM states
 Total #weights: 30M – 60M
p. 33
Switchboard Conversational Speech
Switchboard: 309-hr conversations; vocabulary = 30K
p. 34
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%
p. 35
ASR Progress
1992 > 70.0%
2011 Microsoft / DNN 16.1%
2014 IBM / DNN+CNN 10.4%
2015 IBM / maxout+CNN+DNN+NNLM+extra 8%

data
p. 36
ASR Progress
1992 > 70.0%
2011 Microsoft / DNN 16.1%
2014 IBM / DNN+CNN 10.4%
2015 IBM / maxout+CNN+DNN+NNLM+extra 8%

data
2018 IBM / Microsoft complex systems ~5%
HUMAN ~5%
p. 37
Multi-task Learning
 Improve the generalization performance of a
learning task by jointly learning multiple related
tasks together.
 Requirements:
 A shared common representation
 Multiple error signals
p. 38
Multi-task Learning DNN (MTL-DNN)
p. 39
Multilingual MTL-DNN
p. 40
Results: Multilingual MTL-DNN
 Lwazi corpus: South African languages
 Amount of training data: ~1 hour
Model Afrikaans Sesotho siSwati

monolingual
9.50% 23.1% 19.8%
DNN
multilingual
14.5% 29.5% 20.7%
DNN
multilingual
8.60% 21.5% 18.8%
MTL-DNN
p. 41
Sequence-2-Sequence Modeling
p. 42
Speech Synthesis from Text (TTS)
synthesized
waveform
synthesized
spectrogram
“Printing, in the only sense with which we at present concerned, differs from
most if not from not all the arts and crafts represented in the Exhibition.” p. 43
Vocoder from Perfect Spectrogram
synthesized
waveform
WaveNet
original
spectrogram
p. 44
Google’s TTS
Jonathan Shen et al., “Natural TTS Synthesis By Conditioning WaveNet On Mel Spectrogram Predictions,” ICASSP 2018. p. 45
WaveNet: Dilated Convolution
 Deepmind 2016
 generate one speech sample at a time
p. 46
WaveNet Performance
p. 47
Multi-lingual Multi-speaker TTS
speaker language
embedding embedding
“how are you”
“ 你好嗎”
Tacotron2 WaveNet
spectrogram
audio waveform
p. 48
Lipreading: LipNet
p. 49
Lipreading Sentences in the Wild
p. 50
Lipreading Model: Oxford + Google
p. 51
Vid2Speech
p. 52
Vid2Speech: Hebrew U. of Jerusalem
Ariel Ephrat et al., “Improved Speech
Reconstruction from Silent Video,”
ICCV 2017.
p. 53
Other Applications
 Automatic assessment of spoken language
 Word spotting
 Spoken document retrieval
 Language identification (LID)
 Voice conversion / morphing
 Speaker recognition and verification
p. 54
General Tools
 Speech-specific
 HTK (Steve Young, Mark Gales; Cambridge)
 Kaldi (Dan Povey; CU/IBM/Microsoft/JHU)
 CNTK (Microsoft)
 General neural-net tools
 TensorFlow (Google)
 PyTorch (Facebook)
 Microsoft Cognitive Toolkit (CNTK reborn)
 Keras (on top of TensorFlow, CNTK, Theano) p. 55
Sequence-2-sequence Modeling Tools
 Fairseq (Facebook)
 on top of PyTorch
 ESPnet (JHU)
 end-2-end speech processing
 needs Chainer, PyTorch, Kaldi
p. 56

Mba-Ai Speech Technologies: Prof. Brian Mak

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mba-Ai Speech Technologies: Prof. Brian Mak

Uploaded by

Copyright:

Available Formats

MBA-AI

Prof. Brian Mak

feature feature feature

data data data

model 0 model 1 model 9

 How many characters in the Chinese

#1 3 3 5 3/11 3/11 5/11

#1 3 3 5 3/11 3/11 5/11

#2 3 5 2 3/10 5/10 2/10

Box #1 Box #2 0.3 Box #3 p. 23

image source: HTK Book p. 24

Artificial Neuron Network (ANN)

output output output output

𝑥1 𝑥2 𝑥3 𝑥𝑇

 time-unfolded RNN ≈ DNN

Complex structure to control what to forget and what

2015 IBM / maxout+CNN+DNN+NNLM+extra 8%

2015 IBM / maxout+CNN+DNN+NNLM+extra 8%

Model Afrikaans Sesotho siSwati

You might also like