Professional Documents
Culture Documents
Speech Technologies
p. 2
ASR Early Applications
Telephony application -- automated call centers
Replace interactive voice response (IVR) systems
1992: AT&T automation of operator services
• 5 phrases: “collect”, “calling card”, “operator”, “person
to person”, “third number”
• >1 billion operator assisted calls a year
• saves >US$300 million a year.
1996: Charles Schwab’s automated retail brokering
• understands 15,000 names of stocks or funds
• new accounts up by 41%; 100,000 calls/day;
• cost per call is cut from $4-5 to $1.
p. 3
ASR Consumer Applications
Dictation: IBM (ViaVoice), Dragon Dictate, 話咁易
Education: spoken language learning
p. 4
ASR New Applications
Voice search
Voice typing
Digital personal
assistants
Voice activation
Auto-generated
captions on YouTube
p. 5
Statistical ASR
Word sequence:
Speech feature sequence:
Recognized words
=
=
acoustic model
language model
p. 6
ASR System
p. 7
Language Modeling (LM)
Statistics about co-occurrences of words
Example: “ … …”
bigram LM:
trigram LM:
Example: Stanford’s AI 100
for each word, e.g., ”AI”, etc.
• #(AI) = 19, #(AI in) = 3, #(AI technologies) = 2, …
☞ P(in|AI) = 3/19, P(technologies|AI) = 2/19, …
Google’s 1B word LM
p. 8
Example: Spoken Digit Recognition
“zero” “one” “nine”
? feature
extraction
recognition recognized as “four”
p. 9
Too Many Words?
How many words in the English language?
p. 10
Elements: Basic Units in Chemistry
p. 11
Phonemes: Basic Units in Speech
http://www.englishclub.com/pronunciation/phonemic-chart-ia.htm p. 12
Phonemes
English: 40 – 60 phonemes
A spoken word is a sequence of phonemes:
cat = /k/ /ae/ /t/
Speech recognition: to recognize an utterance
which is a sequence of words
=> a sequence of sequences of phonemes
“a pretty cat” = /ah/ /p/ /r/ /ih/ /t/ /iy/ /k/ /ae/ /t/
p. 13
Visible Speech
spectrogram
p. 14
Spectrogram Reading
p. 15
Spectrogram Reading
p. 16
Speech Visualization
[ https://www.speechandhearing.net/laboratory/wasp/ ]
p. 17
Difficulties in ASR: High Variability
Intra-speaker variability
age, emotion, physical condition, prosody
Inter-speaker variability
gender, age, pitch, accent, speaking style/rate
Co-articulatory effects
“heed” , “heal” , “real”
Triphone: a phoneme with a specific left and right
contextual phoneme
“heed”: /#-h+iy/ /h-iy+d/ /iy-d+#/ p. 18
Vilfredo Pareto’s 80/20 Principle
WSJ: 80% of samples come from the most frequent
20% of all seen triphones.
p. 19
Difficulties in ASR ..
Language variability/ambiguity
Highly confusable sounds: /p/ vs. /b/ ; man vs. men
Multiple pronunciations: “a”, “the”
multiple meanings: “go”, “bank”, “time flies”
Homonyms: “two”, “too”; “ice-cream”, “I scream”
Lack of clear boundaries between acoustic
units (e.g., words, syllables, phonemes) in
continuous speech.
Variable lengths
p. 20
Difficulties in ASR: High Variability …
Channel variability
Different types of microphones or telephone
handsets
fixed line, wireless, mobile
room acoustics, radio/TV broadcast
Noise variability
Different kinds of noises
Different signal-to-noise ratios (SNR)
p. 21
Hidden Markov Model in a Nutshell
Box R G B P( ) P( ) P( )
Box #1
p. 22
HMM in a Nutshell ..
Box R G B P( ) P( ) P( )
0.3
0.1 0.5
0.2 0.4
p. 25
Machine Learning: Mimic The Brain
Neurons = nerve cells
~1011 neurons
Each connects up to 10,000
other neurons
~1015 synaptic connections
A Human Neuron p. 26
Deep Learning
Add many more hidden layers:
deep neural networks (DNN)
Geoffrey Hinton
p. 27
Typical DNN for ASR
#input units: 500 – 1000
#hidden layers: 5 – 7
#hidden units / layer : 1K – 2K
#output units: 5K – 10K
the number of triphone HMM states
Total #weights: 30M – 100M
p. 28
Recurrent Neural Network (RNN)
𝑜𝑢𝑡𝑝𝑢𝑡
𝑣𝑎𝑙𝑢𝑒 : 𝑦 𝑡 𝑦 𝑡
output output
unit unit
h𝑖𝑑𝑑𝑒𝑛
𝑣𝑎𝑙𝑢𝑒 :h𝑡 h𝑡
hidden hidden :previous
unit unit hidden value
𝑖𝑛𝑝𝑢𝑡
𝑣𝑎𝑙𝑢𝑒 : 𝑥 𝑡 𝑥𝑡
feed-forward NN : recurrent NN :
output at time considers only the output at time considers all
transformed input at time t the past transformed inputs p. 29
RNN Unfolded in Time
outputs from t = 1 to T
𝑦 1 𝑦 2 𝑦 3 𝑦 𝑇
h 𝑡 − 1
Zhiyong Cui, et al., “Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural
Network for Network-wide Traffic Speed Prediction,” CoRR abs/1801.02143 2018.
p. 32
Typical BLSTM-RNN for ASR
#input units: 40+
#hidden layers: 5
#hidden units / layer : 512 * 2
#output units: 10K – 30K
the number of triphone HMM states
Total #weights: 30M – 60M
p. 33
Switchboard Conversational Speech
Switchboard: 309-hr conversations; vocabulary = 30K
p. 34
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%
p. 35
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%
2011 Microsoft / DNN 16.1%
2014 IBM / DNN+CNN 10.4%
p. 36
ASR Progress
Year Lab-School / Technique Word Error Rate
1992 > 70.0%
1997 (CMU) GMM + extra 2000-hr data 35.1%
< 2005 GMM + “careful engineering” + extra data < 20%
2011 Microsoft / DNN 16.1%
2014 IBM / DNN+CNN 10.4%
p. 37
Multi-task Learning
Improve the generalization performance of a
learning task by jointly learning multiple related
tasks together.
Requirements:
A shared common representation
Multiple error signals
p. 38
Multi-task Learning DNN (MTL-DNN)
p. 39
Multilingual MTL-DNN
p. 40
Results: Multilingual MTL-DNN
Lwazi corpus: South African languages
Amount of training data: ~1 hour
p. 42
Speech Synthesis from Text (TTS)
synthesized
waveform
synthesized
spectrogram
“Printing, in the only sense with which we at present concerned, differs from
most if not from not all the arts and crafts represented in the Exhibition.” p. 43
Vocoder from Perfect Spectrogram
synthesized
waveform
WaveNet
original
spectrogram
p. 44
Google’s TTS
Jonathan Shen et al., “Natural TTS Synthesis By Conditioning WaveNet On Mel Spectrogram Predictions,” ICASSP 2018. p. 45
WaveNet: Dilated Convolution
Deepmind 2016
generate one speech sample at a time
p. 46
WaveNet Performance
p. 47
Multi-lingual Multi-speaker TTS
speaker language
embedding embedding
“how are you”
“ 你好嗎”
Tacotron2 WaveNet
spectrogram
audio waveform
p. 48
Lipreading: LipNet
p. 49
Lipreading Sentences in the Wild
p. 50
Lipreading Model: Oxford + Google
p. 51
Vid2Speech
p. 52
Vid2Speech: Hebrew U. of Jerusalem
Ariel Ephrat et al., “Improved Speech
Reconstruction from Silent Video,”
ICCV 2017.
p. 53
Other Applications
Automatic assessment of spoken language
Word spotting
Spoken document retrieval
Language identification (LID)
Voice conversion / morphing
Speaker recognition and verification
p. 54
General Tools
Speech-specific
HTK (Steve Young, Mark Gales; Cambridge)
Kaldi (Dan Povey; CU/IBM/Microsoft/JHU)
CNTK (Microsoft)
General neural-net tools
TensorFlow (Google)
PyTorch (Facebook)
Microsoft Cognitive Toolkit (CNTK reborn)
Keras (on top of TensorFlow, CNTK, Theano) p. 55
Sequence-2-sequence Modeling Tools
Fairseq (Facebook)
on top of PyTorch
ESPnet (JHU)
end-2-end speech processing
needs Chainer, PyTorch, Kaldi
p. 56