Professional Documents
Culture Documents
SYSTEMS
Dr TIAN Jing
tianjing@nus.edu.sg
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 1 of 48
Module objective
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 2 of 48
Major reference
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 3 of 48
Topics
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 4 of 48
Automatic speech recognition
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 5 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 6 of 48
Automatic speech recognition
Behind the Mic: The Science of Talking with Computers, (6 minutes)
https://www.youtube.com/watch?v=yxxRAHVtafI
Language is easy for humans to understand (most of the time), but not so easy for
computers. This video talks about speech recognition, language understanding,
neural nets, and using our voices to communicate with the technology around us.
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 7 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 8 of 48
Automatic speech recognition
• Human-machine Interaction • Language Acquisition
• Automatic Speech • Pronunciation Training
Recognition
• Security/Forensics
• Speech Synthesis / Text- • Speaker ID
to-Speech (TTS)
• Speaker Verification
• Natural Language
Understanding (NLU) • Medical Applications
• Natural Language • Diagnosis of Diseases
Generation (NLG) • Information Retrieval
• Telecommunication • Video/Audio Transcribing
• Speech Coding • Audio/Text Summarizing
• Language • Speech Manipulation
• Statistical Machine • Speaking Rate Adjusting
Translation (SMT)
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 9 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 10 of 48
Speech recognition as pattern
recognition problem
Feature Pattern Decision output
unknown
Extraction Matching Making
speech word
signal
feature
vector
training sequence
speech
Feature Reference
Extraction Patterns
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 11 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 12 of 48
Feature extraction: MFCC
Fourier transform decomposes a signal into sinusoids of different frequencies,
by transforming the view of the signal from time domain to frequency domain.
1
Inverse DFT: 𝑓 𝑥 𝐹 𝑢 𝑒 , where 𝑥 0,1, ⋯ , 𝑁 1
𝑁
Example
• Signal 𝐹 0 𝑓 𝑥 𝑒 2 3 4 4 13
𝑓 𝑥 2, 3, 4, 4
• Fourier coefficients of signal 𝐹 1 𝑓 𝑥 𝑒 2𝑒 3𝑒 /
4𝑒 4𝑒 /
2 𝑖
F 𝑢 13, 2 𝑖 , 1, 2 𝑖
• 𝑖 is the imaginary unit
𝐹 2 𝑓 𝑥 𝑒 2𝑒 3𝑒 4𝑒 4𝑒 1
/ /
𝐹 3 𝑓 𝑥 𝑒 2𝑒 3𝑒 4𝑒 4𝑒 2 𝑖
Demo: http://www.falstad.com/fourier/
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 13 of 48
Summary of MFCC
• For each speech frame, compute frequency spectrum and apply Mel filtering
• Compute cepstrum using inverse DFT on the log of the mel-warped spectrum
• 39-dimensional MFCC feature vector: First 12 cepstral coefficients + energy +
13 delta + 13 double-delta coefficients
Reference: Automatic Speech Recognition, https://github.com/ekapolc/ASR_course
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 14 of 48
Matching: Dynamic time warping
Dynamic time warping (DTW) is one of the algorithms for measuring
similarity between two temporal sequences, which may vary in speed.
TEMPLATE (WORD)
𝑃 𝐎𝐖
𝑃 𝐖
Phoneme
model
Reference: http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=six
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 17 of 48
𝐀
𝐐 𝐖
Speech Audio
Sequence States Words
𝑃 𝐀𝐎 𝑃 𝐎𝐐 𝑃 𝐐𝐋 𝑃 𝐋𝐖 𝑃 𝐖
Feature Frame Sequence Lexicon Language
Extraction Classification Model Model Model
Sentence
t ah m aa t ow
𝐎 𝐋
Feature Frames Phonemes
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 18 of 48
Statistical speech recognition: Frame
classification
• Output probability is modeled by Gaussian mixture
models 𝑃𝑸 𝒐 𝑃 𝒐 𝒒 ∑ 𝑤𝒒 , 𝑁 𝒐 𝝁 , 𝚺
1 2 3
w1,1 w2 ,1 w3,1
1 R n1
N 11 ( x ) N 21 ( x ) N 31 ( x )
w1, 2 w2 , 2 w3, 2
2 R n2
N12 ( x ) N 22 ( x ) N 32 ( x )
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 19 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 20 of 48
HMM: Idea
• State probability
• State transition probability matrix
• Observation distribution given state
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 21 of 48
HMM: Model
Notation Descriptions
𝐐 𝑞 ,𝑞 ,⋯,𝑞 A set of 𝑁 states
𝐀 𝑎 ,𝑎 ,⋯,𝑎 ,⋯,𝑎 A transition probability matrix 𝐀, each 𝑎
representing the probability of moving from state 𝑖
to state 𝑗, 𝑠. 𝑡. ∑ 𝑎 1
𝐎 𝑜 ,𝑜 ,⋯,𝑜 A sequence of 𝑇 observations, each one drawn
from a vocabulary 𝑉 𝑣 , 𝑣 , ⋯ , 𝑣
𝐵 𝑏 𝑜 A sequence of observation likelihoods, also called
emission probabilities, each expressing the
probability of an observation 𝑜 being generated
from a state 𝑖
𝑞 ,𝑞 A special start state and end (final) state that are
not associated with observations, together with
transition probabilities 𝑎 , 𝑎 , ⋯ , 𝑎 out of the
start state and 𝑎 , 𝑎 , ⋯ , 𝑎 into the end state.
Reference: Daniel Jurafsky, James H. Martin, Chapter 9 Automatic Speech Recognition, Speech and Language Processing,
2nd Edition, Pearson, 2008.
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 22 of 48
HMM: Tasks
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 23 of 48
HMM: Example
Example: HMM model (2 states). State labels are initialized/estimated (during training)
Data: 3 observation sequences with different lengths. All observations are vectors.
Time 1 2 3 4 5 6 7 8 9 10
sequence sequence sequence
State S1 S1 S2 S2 S2 S1 S1 S2 S1 S1
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂
A
Time 1 2 3 4 5 6 7 8 9 -
State S2 S2 S1 S1 S2 S2 S2 S2 S1 -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 -
B
Time 1 2 3 4 5 6 7 8 - -
State S1 S2 S1 S1 S1 S2 S2 S2 - -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 - -
C
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 24 of 48
HMM: Example
• Among 3 observations, 2 start with S1, 1 starts with S2. So 𝜋 𝑆1 , 𝜋 𝑆2
• S1 occurs 11 times in non-terminal locations, it is followed immediately by S1 6 times,
and it is followed immediately by S2 5 times. So 𝑃 𝑆1 𝑆1 , 𝑃 𝑆2 𝑆1
• S2 occurs 13 times in non-terminal locations, it is followed immediately by S1 5 times,
and it is followed immediately by S2 8 times. So 𝑃 𝑆1 𝑆2 , 𝑃 𝑆2 𝑆2
• Formulate distributions of S1 and S2 using their observation vectors as Gaussian.
A Time 1 2 3 4 5 6 7 8 9 10
State S1 S1 S2 S2 S2 S1 S1 S2 S1 S1
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂
B Time 1 2 3 4 5 6 7 8 9 -
State S2 S2 S1 S1 S2 S2 S2 S2 S1 -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 -
C Time 1 2 3 4 5 6 7 8 - -
State S1 S2 S1 S1 S1 S2 S2 S2 - -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 - -
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 25 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 26 of 48
Statistical speech recognition:
Language model
N-gram models: Build the language model by calculating probabilities from
text training corpus: how likely is one word to follow another.
Example: Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 27 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 28 of 48
Statistical speech recognition:
Summary
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …
Hidden Markov
942 942 6
Model (HMM):
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 29 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 30 of 48
Deep speech recognition
• A typical speech recognizer uses the GMM-HMM
framework. Replace frame classification
• Use the DNN to generate good features to feed into the
general GMM-HMM framework.
• Language modelling
Source: https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 31 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 32 of 48
Vision cognition: Marr theory
Reference
• https://en.wikipedia.org/wiki/David_Marr_(neuroscientist)
• T. Poggio, Vision (2010, The MIT Press), Afterword, P.367
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 33 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 34 of 48
Vision cognition tasks
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 35 of 48
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Reference:
• Mo Zi, 400 B. C., https://en.wikipedia.org/wiki/Camera_obscura
• Picture source, http://www.sohu.com/a/140776287_736731
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 36 of 48
Vision cognitive system pipeline
Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Online demo: http://demo.ipol.im/demo/54/
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 38 of 48
Vision cognitive system pipeline
Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Online demo: http://bigwww.epfl.ch/demo/ip/demos/edgeDetector/
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 39 of 48
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Online demo: http://demo.ipol.im/demo/my_affine_sift/
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 40 of 48
Vision cognitive system pipeline
Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Reference: https://www.how-old.net
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 41 of 48
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Reference
Analytics task
• https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_words
• https://www.phrases.org.uk/meanings/a-picture-is-worth-a-thousand-words.html
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 42 of 48
Speech vision synchronizing
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 43 of 48
Figure 6: The three modules in the Speech2Vid model. From top to bottom: (i) audio
encoder, (ii) identity encoder with a single still image input, and (iii) image decoder.
/2 refers to the stride of each kernel in a specific layer which is normally of equal
stride in both spatial dimensions except for the Pool2 layer in which we use stride 2
in the timestep dimension (denoted by /2t). The network includes two skip
connections between the identity encoder and the image decoder.
Reference: Joon Son Chung, Amir Jamaludin, Andrew Zisserman, You said that? BMVC 2017, https://arxiv.org/abs/1705.02966
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 44 of 48
Speech-visual source separation
Fig. 4. Our model’s multi-stream neural network-based architecture: The visual streams take as input thumbnails of detected
faces in each frame in the video, and the audio stream takes as input the video’s soundtrack, containing a mixture of speech and
background noise. The visual streams extract face embeddings for each thumbnail using a pretrained face recognition model,
then learn a visual feature using a dilated convolutional NN. The audio stream first computes the STFT of the input signal to
obtain a spectrogram, and then learns an audio representation using a similar dilated convolutional NN. A joint, audio-visual
representation is then created by concatenating the learned visual and audio features, and is subsequently further processed
using a bidirectional LSTM and three fully connected layers. The network outputs a complex spectrogram mask for each speaker,
which is multiplied by the noisy input, and converted back to waveforms to obtain an isolated speech signal for each speaker.
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 45 of 48
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 46 of 48
Summary
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 47 of 48
Thank you!
Dr TIAN Jing
Email: tianjing@nus.edu.sg
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 48 of 48