You are on page 1of 56

MBA-AI

Speech Technologies

Prof. Brian Mak


Department of Computer Science and Engineering
Automatic Speech Recognition (ASR)

p. 2
ASR Early Applications
 Telephony application -- automated call centers
 Replace interactive voice response (IVR) systems
 1992: AT&T automation of operator services
• 5 phrases: “collect”, “calling card”, “operator”, “person
to person”, “third number”
• >1 billion operator assisted calls a year
• saves >US$300 million a year.
 1996: Charles Schwab’s automated retail brokering
• understands 15,000 names of stocks or funds
• new accounts up by 41%; 100,000 calls/day;
• cost per call is cut from $4-5 to $1.
p. 3
ASR Consumer Applications
 Dictation: IBM (ViaVoice), Dragon Dictate, 話咁易
 Education: spoken language learning

p. 4
ASR New Applications
 Voice search
 Voice typing
 Digital personal
assistants
 Voice activation
 Auto-generated
captions on YouTube

p. 5
Statistical ASR
  Word sequence:
 Speech feature sequence:
 Recognized words
=
=

acoustic model
language model
p. 6
ASR System

p. 7
Language Modeling (LM)
  Statistics about co-occurrences of words
 Example: “ … …”
 bigram LM:
 trigram LM:
 Example: Stanford’s AI 100
 for each word, e.g., ”AI”, etc.
• #(AI) = 19, #(AI in) = 3, #(AI technologies) = 2, …
☞ P(in|AI) = 3/19, P(technologies|AI) = 2/19, …
 Google’s 1B word LM
p. 8
Example: Spoken Digit Recognition
“zero” “one” “nine”

feature feature feature


extraction extraction extraction

data data data


modeling modeling modeling

model 0 model 1 model 9

? feature
extraction
recognition recognized as “four”
p. 9
Too Many Words?
 How many words in the English language?

 How many characters in the Chinese


language?

p. 10
Elements: Basic Units in Chemistry

p. 11
Phonemes: Basic Units in Speech

http://www.englishclub.com/pronunciation/phonemic-chart-ia.htm p. 12
Phonemes
 English: 40 – 60 phonemes
 A spoken word is a sequence of phonemes:
 cat = /k/ /ae/ /t/
 Speech recognition: to recognize an utterance
which is a sequence of words
 => a sequence of sequences of phonemes
 “a pretty cat” = /ah/ /p/ /r/ /ih/ /t/ /iy/ /k/ /ae/ /t/

p. 13
Visible Speech

spectrogram
p. 14
Spectrogram Reading

p. 15
Spectrogram Reading

p. 16
Speech Visualization

[ https://www.speechandhearing.net/laboratory/wasp/ ]
p. 17
Difficulties in ASR: High Variability
 Intra-speaker variability
 age, emotion, physical condition, prosody
 Inter-speaker variability
 gender, age, pitch, accent, speaking style/rate
 Co-articulatory effects
 “heed” , “heal” , “real”
 Triphone: a phoneme with a specific left and right
contextual phoneme
 “heed”: /#-h+iy/ /h-iy+d/ /iy-d+#/ p. 18
Vilfredo Pareto’s 80/20 Principle
 WSJ: 80% of samples come from the most frequent
20% of all seen triphones.

p. 19
Difficulties in ASR ..
 Language variability/ambiguity
 Highly confusable sounds: /p/ vs. /b/ ; man vs. men
 Multiple pronunciations: “a”, “the”
 multiple meanings: “go”, “bank”, “time flies”
 Homonyms: “two”, “too”; “ice-cream”, “I scream”
 Lack of clear boundaries between acoustic
units (e.g., words, syllables, phonemes) in
continuous speech.
 Variable lengths
p. 20
Difficulties in ASR: High Variability …
 Channel variability
 Different types of microphones or telephone
handsets
 fixed line, wireless, mobile
 room acoustics, radio/TV broadcast
 Noise variability
 Different kinds of noises
 Different signal-to-noise ratios (SNR)
p. 21
Hidden Markov Model in a Nutshell
Box R G B P( ) P( ) P( )

#1 3 3 5 3/11 3/11 5/11

Box #1
p. 22
HMM in a Nutshell ..
Box R G B P( ) P( ) P( )

#1 3 3 5 3/11 3/11 5/11

#2 3 5 2 3/10 5/10 2/10


0.6 0.1
#3 6 4 3 6/13 4/13 3/13

0.3
0.1 0.5

0.2 0.4

Box #1 Box #2 0.3 Box #3 p. 23


0.5
Acoustic Modeling: HMM

image source: HTK Book p. 24


Gaussian Mixture Model (GMM)

p. 25
Machine Learning: Mimic The Brain
 Neurons = nerve cells
 ~1011 neurons
 Each connects up to 10,000
other neurons
 ~1015 synaptic connections

Artificial Neuron Network (ANN)

A Human Neuron p. 26
Deep Learning
 Add many more hidden layers:
deep neural networks (DNN)

Geoffrey Hinton

p. 27
Typical DNN for ASR
 #input units: 500 – 1000
 #hidden layers: 5 – 7
 #hidden units / layer : 1K – 2K
 #output units: 5K – 10K
 the number of triphone HMM states
 Total #weights: 30M – 100M

p. 28
Recurrent Neural Network (RNN)
𝑜𝑢𝑡𝑝𝑢𝑡
   𝑣𝑎𝑙𝑢𝑒 : 𝑦 𝑡 𝑦  𝑡

output output
unit unit
h𝑖𝑑𝑑𝑒𝑛
   𝑣𝑎𝑙𝑢𝑒 :h𝑡 h𝑡 
hidden hidden :previous
 
unit unit hidden value

𝑖𝑛𝑝𝑢𝑡
   𝑣𝑎𝑙𝑢𝑒 : 𝑥 𝑡 𝑥𝑡 

  feed-forward NN :   recurrent NN :
output at time considers only the output at time considers all
transformed input at time t the past transformed inputs p. 29
RNN Unfolded in Time
outputs from t = 1 to T
𝑦 1  𝑦 2  𝑦 3  𝑦 𝑇 

output output output output


unit unit unit unit
h
 1  h
 2  h
 3  h
 𝑇 
hidden hidden hidden hidden
unit h
 1  unit h
 2  unit h
 3  unit

𝑥1  𝑥2  𝑥3  𝑥𝑇 


inputs from t = 1 to T

 time-unfolded RNN ≈ DNN


  with history
p. 30
Long Short-Term Memory (LSTM)

h 𝑡 − 1

Complex structure to control what to forget and what


“good” old memory to propagate to help current prediction. p. 31
Bidirectional LSTM

Zhiyong Cui, et al., “Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural
Network for Network-wide Traffic Speed Prediction,” CoRR abs/1801.02143 2018.
p. 32
Typical BLSTM-RNN for ASR
 #input units: 40+
 #hidden layers: 5
 #hidden units / layer : 512 * 2
 #output units: 10K – 30K
 the number of triphone HMM states
 Total #weights: 30M – 60M

p. 33
Switchboard Conversational Speech
Switchboard: 309-hr conversations; vocabulary = 30K

p. 34
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%

p. 35
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%
2011 Microsoft / DNN 16.1%
2014 IBM / DNN+CNN 10.4%

2015 IBM / maxout+CNN+DNN+NNLM+extra 8%


data

p. 36
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%
2011 Microsoft / DNN 16.1%
2014 IBM / DNN+CNN 10.4%

2015 IBM / maxout+CNN+DNN+NNLM+extra 8%


data
2018 IBM / Microsoft complex systems ~5%
HUMAN ~5%

p. 37
Multi-task Learning
 Improve the generalization performance of a
learning task by jointly learning multiple related
tasks together.
 Requirements:
 A shared common representation
 Multiple error signals

p. 38
Multi-task Learning DNN (MTL-DNN)

p. 39
Multilingual MTL-DNN

p. 40
Results: Multilingual MTL-DNN
 Lwazi corpus: South African languages
 Amount of training data: ~1 hour

Model Afrikaans Sesotho siSwati


monolingual
9.50% 23.1% 19.8%
DNN
multilingual
14.5% 29.5% 20.7%
DNN
multilingual
8.60% 21.5% 18.8%
MTL-DNN
p. 41
Sequence-2-Sequence Modeling

p. 42
Speech Synthesis from Text (TTS)

synthesized
waveform

synthesized
spectrogram

“Printing, in the only sense with which we at present concerned, differs from
most if not from not all the arts and crafts represented in the Exhibition.” p. 43
Vocoder from Perfect Spectrogram

synthesized
waveform

WaveNet

original
spectrogram

p. 44
Google’s TTS

Jonathan Shen et al., “Natural TTS Synthesis By Conditioning WaveNet On Mel Spectrogram Predictions,” ICASSP 2018. p. 45
WaveNet: Dilated Convolution
 Deepmind 2016
 generate one speech sample at a time

p. 46
WaveNet Performance

p. 47
Multi-lingual Multi-speaker TTS

speaker language
embedding embedding
“how are you”
“ 你好嗎”

Tacotron2 WaveNet
spectrogram

audio waveform

p. 48
Lipreading: LipNet

p. 49
Lipreading Sentences in the Wild

p. 50
Lipreading Model: Oxford + Google

p. 51
Vid2Speech

p. 52
Vid2Speech: Hebrew U. of Jerusalem
Ariel Ephrat et al., “Improved Speech
Reconstruction from Silent Video,”
ICCV 2017.

p. 53
Other Applications
 Automatic assessment of spoken language
 Word spotting
 Spoken document retrieval
 Language identification (LID)
 Voice conversion / morphing
 Speaker recognition and verification

p. 54
General Tools
 Speech-specific
 HTK (Steve Young, Mark Gales; Cambridge)
 Kaldi (Dan Povey; CU/IBM/Microsoft/JHU)
 CNTK (Microsoft)
 General neural-net tools
 TensorFlow (Google)
 PyTorch (Facebook)
 Microsoft Cognitive Toolkit (CNTK reborn)
 Keras (on top of TensorFlow, CNTK, Theano) p. 55
Sequence-2-sequence Modeling Tools
 Fairseq (Facebook)
 on top of PyTorch
 ESPnet (JHU)
 end-2-end speech processing
 needs Chainer, PyTorch, Kaldi

p. 56

You might also like