You are on page 1of 24

SPEECH/ VISION COGNITIVE

SYSTEMS

Dr TIAN Jing
tianjing@nus.edu.sg

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 1 of 48

Module objective

Knowledge and understanding


• Understand the fundamentals of speech recognition
systems, including statistical acoustic modelling, and
end-to-end system using machine learning
• Understand basic concepts of vision cognitive
systems
Key skills
• Design, build, implement and evaluate various
speech recognition approaches in Python

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 2 of 48
Major reference

• [Tutorial] Daniel Jurafsky, James H. Martin, Chapter 9


Automatic Speech Recognition, Speech and Language
Processing, 2nd Edition, Pearson, 2008.
• [Introduction] Automatic Speech Recognition,
https://github.com/ekapolc/ASR_course
• [Comprehensive] CS224S, Spoken Language Processing,
http://web.stanford.edu/class/cs224s/
• [Introduction] CS131: Computer Vision: Foundations and
Applications,
http://vision.stanford.edu/teaching/cs131_fall1718/syllabus.html
• [Comprehensive] Computer Vision Crash Course, Jia-Bin
Huang, https://filebox.ece.vt.edu/~jbhuang/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 3 of 48

Topics

• Speech recognition systems


• Vision cognition systems
• Workshop: Design and build speech
recognition system in Python

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 4 of 48
Automatic speech recognition

Amazon Echo Google Home Anki Cozmo


2015 2016 Facebook M 2016
2015

Apple  Google Microsoft 


Siri Assistant Cortana
Slack Bot API
2011 2016 2014
2015
Source: CS224S Spoken Language Processing, http://web.stanford.edu/class/cs224s/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 5 of 48

Automatic speech recognition

More introductions on history of automatic speech recognition can be found at


• https://ileriseviye.wordpress.com/2011/02/17/speech-recognition-in-1920s-
radio-rex-the-first-speech-recognition-machine/
• https://machinelearning-blog.com/2018/09/07/a-brief-history-of-asr-
automatic-speech-recognition/
Source: Automatic Speech Recognition (CS753), Lecture 1: Introduction to Statistical Speech Recognition, https://www.cse.iitb.ac.in/~pjyothi/cs753/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 6 of 48
Automatic speech recognition
Behind the Mic: The Science of Talking with Computers, (6 minutes)
https://www.youtube.com/watch?v=yxxRAHVtafI
Language is easy for humans to understand (most of the time), but not so easy for
computers. This video talks about speech recognition, language understanding,
neural nets, and using our voices to communicate with the technology around us.

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 7 of 48

Automatic speech recognition


Behind the Mic: The Science of Talking with Computers
Video (MM:SS) Topics
0:00–0:46 • Speech Communication
0:48-3:00 • Automatic Speech Recognition
• The need for “Natural Language
Understanding

3:01-4:16 • Natural Language Understanding is hard


(and we are not sure how it works)

4:16-6:10 • Neural Networks can be (are) the solution (?)

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 8 of 48
Automatic speech recognition
• Human-machine Interaction • Language Acquisition
• Automatic Speech • Pronunciation Training
Recognition
• Security/Forensics
• Speech Synthesis / Text- • Speaker ID
to-Speech (TTS)
• Speaker Verification
• Natural Language
Understanding (NLU) • Medical Applications
• Natural Language • Diagnosis of Diseases
Generation (NLG) • Information Retrieval
• Telecommunication • Video/Audio Transcribing
• Speech Coding • Audio/Text Summarizing
• Language • Speech Manipulation
• Statistical Machine • Speaking Rate Adjusting
Translation (SMT)

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 9 of 48

Automatic speech recognition


Challenges of speech recognition
• Style: Read speech or spontaneous (conversational) speech?
• Continuous natural speech or command & control?
• Speaker characteristics: Rate of speech, accent, prosody
(stress, intonation), speaker age, pronunciation variability even
when the same speaker speaks the same word
• Channel characteristics: Background noise, room acoustics,
microphone properties, interfering speakers
• Task specifics: Vocabulary size (very large number of words to
be recognized), language-specific complexity, resource
limitations

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 10 of 48
Speech recognition as pattern
recognition problem
Feature Pattern Decision output
unknown
Extraction Matching Making
speech word
signal
feature
vector
training sequence
speech
Feature Reference
Extraction Patterns

Speech signal represented in time domain and frequency-domain features,


e.g., Mel frequency cepstral coefficient (MFCC)

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 11 of 48

Feature extraction: MFCC

• Preemphasis: Boost the amount of energy


in high frequencies by signal filtering.
• Windowing: Extract features from a small
window, called frame, not from the entire
signals. Frame size and frame offset.
Reference: Daniel Jurafsky, James H. Martin, Chapter 9 Automatic
Speech Recognition, Speech and Language Processing, 2nd
Edition, Pearson, 2008.

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 12 of 48
Feature extraction: MFCC
Fourier transform decomposes a signal into sinusoids of different frequencies,
by transforming the view of the signal from time domain to frequency domain.

Forward DFT: 𝐹 𝑢 𝑓 𝑥 𝑒 , where 𝑢 0,1, ⋯ , 𝑁 1

1
Inverse DFT: 𝑓 𝑥 𝐹 𝑢 𝑒 , where 𝑥 0,1, ⋯ , 𝑁 1
𝑁

Example
• Signal 𝐹 0 𝑓 𝑥 𝑒 2 3 4 4 13
𝑓 𝑥 2, 3, 4, 4
• Fourier coefficients of signal 𝐹 1 𝑓 𝑥 𝑒 2𝑒 3𝑒 /
4𝑒 4𝑒 /
2 𝑖
F 𝑢 13, 2 𝑖 , 1, 2 𝑖
• 𝑖 is the imaginary unit
𝐹 2 𝑓 𝑥 𝑒 2𝑒 3𝑒 4𝑒 4𝑒 1

/ /
𝐹 3 𝑓 𝑥 𝑒 2𝑒 3𝑒 4𝑒 4𝑒 2 𝑖

Demo: http://www.falstad.com/fourier/
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 13 of 48

Feature extraction: MFCC


Understand Fourier transformation as vector projection.

Summary of MFCC
• For each speech frame, compute frequency spectrum and apply Mel filtering
• Compute cepstrum using inverse DFT on the log of the mel-warped spectrum
• 39-dimensional MFCC feature vector: First 12 cepstral coefficients + energy +
13 delta + 13 double-delta coefficients
Reference: Automatic Speech Recognition, https://github.com/ekapolc/ASR_course

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 14 of 48
Matching: Dynamic time warping
Dynamic time warping (DTW) is one of the algorithms for measuring
similarity between two temporal sequences, which may vary in speed.

TEMPLATE (WORD)

Reference: Sakoe, H.; Chiba, "Dynamic programming


algorithm optimization for spoken word recognition". IEEE
Trans. on Acoustics, Speech, and Signal Processing. Vol.
26, No. 1, 1978, pp. 43–49.
UNKNOWN WORD
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 15 of 48

Statistical speech recognition

• Let 𝐎 represent a sequence of acoustic feature


observations (i.e., 𝐎 𝑜 , 𝑜 , ⋯ , 𝑜 , and 𝐖 denote a
word sequence. Then the speech recognizer
decodes 𝐖 ∗ as
𝑃 𝐎𝐖 𝑃 𝐖
𝐰∗ argmax 𝑃 𝐖|𝐎 argmax ∝ argmax 𝑃 𝐎 𝐖 P 𝐖
𝐖 𝐖 𝑃 𝐎 𝐖

𝑃 𝐎𝐖

𝑃 𝐖

Reference: CS753, Automatic Speech Recognition,


https://www.cse.iitb.ac.in/~pjyothi/cs753/
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 16 of 48
Statistical speech recognition

• Further introduce three definitions, acoustic signal,


𝐀 , phoneme, 𝐋 , and state, 𝐐 . The optimization
problem statement is changed to be
𝐰∗ argmax 𝑃 𝐎 𝐖 P 𝐖 argmax 𝑃 𝐀 𝐎 𝑃 𝐎 𝐐 𝑃 𝐐 𝐋 𝑃 𝐋 𝐖 𝑃 𝐖
𝐖 𝐖

Phoneme
model

Details will be studied Each phoneme


in acoustic model in has 3 states
following slides.

Reference: http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=six
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 17 of 48

Statistical speech recognition


𝐰∗ argmax 𝑃 𝐎 𝐖 P 𝐖 argmax 𝑃 𝐀 𝐎 𝑃 𝐎 𝐐 𝑃 𝐐 𝐋 𝑃 𝐋 𝐖 𝑃 𝐖
𝐖 𝐖

𝐀
𝐐 𝐖
Speech Audio
Sequence States Words
𝑃 𝐀𝐎 𝑃 𝐎𝐐 𝑃 𝐐𝐋 𝑃 𝐋𝐖 𝑃 𝐖
Feature Frame Sequence Lexicon Language
Extraction Classification Model Model Model

Sentence
t ah m aa t ow
𝐎 𝐋
Feature Frames Phonemes

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 18 of 48
Statistical speech recognition: Frame
classification
• Output probability is modeled by Gaussian mixture
models 𝑃𝑸 𝒐 𝑃 𝒐 𝒒 ∑ 𝑤𝒒 , 𝑁 𝒐 𝝁 , 𝚺

1 2 3

w1,1 w2 ,1 w3,1
1  R n1
N 11 ( x ) N 21 ( x ) N 31 ( x )
w1, 2 w2 , 2 w3, 2
 2  R n2
N12 ( x ) N 22 ( x ) N 32 ( x )

  

w1, M w2 , M w3, M Reference: Deep learning for audio, CS 598


 M  R nM LAZ: Cutting-Edge Trends in Deep Learning
and Recognition,
N 1M ( x ) N 2M ( x) N 3M ( x ) http://slazebni.cs.illinois.edu/spring17/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 19 of 48

Statistical speech recognition:


Acoustic model
• Goal: Model likelihood of sounds given spectral
features, pronunciation models, and prior context
• Usually represented as Hidden Markov Model
(HMM)
• States represent phones or other subword units
• Transition probabilities on states: how likely is it to
see one sound after seeing another?
• Observation/output likelihoods: how likely is
spectral feature vector to be observed from phone
state, given state in previous timestamp?

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 20 of 48
HMM: Idea

HMM for weather HMM for word

• State probability
• State transition probability matrix
• Observation distribution given state

Reference: CS224S, Spoken Language Processing, http://web.stanford.edu/class/cs224s/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 21 of 48

HMM: Model
Notation Descriptions
𝐐 𝑞 ,𝑞 ,⋯,𝑞 A set of 𝑁 states
𝐀 𝑎 ,𝑎 ,⋯,𝑎 ,⋯,𝑎 A transition probability matrix 𝐀, each 𝑎
representing the probability of moving from state 𝑖
to state 𝑗, 𝑠. 𝑡. ∑ 𝑎 1
𝐎 𝑜 ,𝑜 ,⋯,𝑜 A sequence of 𝑇 observations, each one drawn
from a vocabulary 𝑉 𝑣 , 𝑣 , ⋯ , 𝑣
𝐵 𝑏 𝑜 A sequence of observation likelihoods, also called
emission probabilities, each expressing the
probability of an observation 𝑜 being generated
from a state 𝑖
𝑞 ,𝑞 A special start state and end (final) state that are
not associated with observations, together with
transition probabilities 𝑎 , 𝑎 , ⋯ , 𝑎 out of the
start state and 𝑎 , 𝑎 , ⋯ , 𝑎 into the end state.

Reference: Daniel Jurafsky, James H. Martin, Chapter 9 Automatic Speech Recognition, Speech and Language Processing,
2nd Edition, Pearson, 2008.
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 22 of 48
HMM: Tasks

• Evaluation: Given the observation sequence 𝐎


𝑜 , 𝑜 , ⋯ , 𝑜 , and an HMM model, how do we
efficiently compute the probability of the observation
sequence, given the model?
• Decoding: Given the observation sequence 𝐎
𝑜 , 𝑜 , ⋯ , 𝑜 , and an HMM model, how do we
choose a corresponding state sequence 𝐐
𝑞 , 𝑞 , ⋯ , 𝑞 , that is optimal in some sense (i.e., best
explains the observations)?
• Learning: How do we train the model parameters?

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 23 of 48

HMM: Example

Example: HMM model (2 states). State labels are initialized/estimated (during training)
Data: 3 observation sequences with different lengths. All observations are vectors.

Time 1 2 3 4 5 6 7 8 9 10
sequence sequence sequence

State S1 S1 S2 S2 S2 S1 S1 S2 S1 S1
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂
A

Time 1 2 3 4 5 6 7 8 9 -
State S2 S2 S1 S1 S2 S2 S2 S2 S1 -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 -
B

Time 1 2 3 4 5 6 7 8 - -
State S1 S2 S1 S1 S1 S2 S2 S2 - -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 - -
C

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 24 of 48
HMM: Example
• Among 3 observations, 2 start with S1, 1 starts with S2. So 𝜋 𝑆1 , 𝜋 𝑆2
• S1 occurs 11 times in non-terminal locations, it is followed immediately by S1 6 times,
and it is followed immediately by S2 5 times. So 𝑃 𝑆1 𝑆1 , 𝑃 𝑆2 𝑆1
• S2 occurs 13 times in non-terminal locations, it is followed immediately by S1 5 times,
and it is followed immediately by S2 8 times. So 𝑃 𝑆1 𝑆2 , 𝑃 𝑆2 𝑆2
• Formulate distributions of S1 and S2 using their observation vectors as Gaussian.

A Time 1 2 3 4 5 6 7 8 9 10
State S1 S1 S2 S2 S2 S1 S1 S2 S1 S1
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂
B Time 1 2 3 4 5 6 7 8 9 -
State S2 S2 S1 S1 S2 S2 S2 S2 S1 -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 -
C Time 1 2 3 4 5 6 7 8 - -
State S1 S2 S1 S1 S1 S2 S2 S2 - -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 - -
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 25 of 48

Statistical speech recognition:


Lexicon model
• Lexical modelling forms the bridge between the
acoustic and language models
• Each one with a pronunciation in terms of phones
• CMU dictionary: 127K words,
http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Deterministic model Probabilistic model


Word Pronunciation Word Pronunciation Probability
t ah m aa t ow t ah m aa t ow 0.45
TOMATO TOMATO
t ah m ey t ow t ah m ey t ow 0.55
COVERA k ah v er ah jh COVERA k ah v er ah jh 0.65
GE k ah v r ah jh GE k ah v r ah jh 0.35

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 26 of 48
Statistical speech recognition:
Language model
N-gram models: Build the language model by calculating probabilities from
text training corpus: how likely is one word to follow another.

Example: Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 27 of 48

Statistical speech recognition:


Optimization and search
• Find the best hypothesis 𝑃 𝐎|𝐖 𝑃 𝐖 given
• A sequence of acoustic feature vectors (𝐎)
• A trained HMM (Acoustic model)
• Lexicon (Pronunciation model)
• Probabilities of word sequences (Language model)
• Optimization and serach
• Dynamic programming
• Calculate most likely state sequence in HMM given
transition and observations

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 28 of 48
Statistical speech recognition:
Summary
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …
Hidden Markov
942 942 6
Model (HMM):

Acoustic GMM model 𝑃 𝐎|𝐐


Model: 𝐎: input features
𝐐: HMM state

Features Features Features


Audio Input:

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 29 of 48

Statistical speech recognition: Data

• Collect corpora appropriate for recognition task


• Small speech + phonetic transcription to associate
sounds with symbols (Acoustic Model)
• Large (>= 60 hrs) speech + orthographic
transcription to associate words with sounds
(Acoustic Model)
• Very large text corpus to identify 𝑁-gram
probabilities or build a grammar (Language Model)

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 30 of 48
Deep speech recognition
• A typical speech recognizer uses the GMM-HMM
framework. Replace frame classification
• Use the DNN to generate good features to feed into the
general GMM-HMM framework.
• Language modelling

Source: https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 31 of 48

Evaluation of speech recognition

• Word Error Rate (WER): Minimum Edit Distance:


Distance in words between the system output and
the reference transcription (truth)
• 𝑆 is the number of substitutions,
𝑆 𝐷 𝐼 • 𝐷 is the number of deletions,
𝑊𝐸𝑅 • 𝐼 is the number of insertions and
𝑁
• 𝑁 is the number of words in the reference

Truth: What a bright day System: What a day


Deletion: "Bright" was deleted by the system
Truth: What a day System: What a bright day
Insertion: "Bright" was inserted by the system
Truth: What a bright day System: What a light day
Substitution: "Bright" was substituted by "light" by the system
Reference: https://martin-thoma.com/word-error-rate-calculation/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 32 of 48
Vision cognition: Marr theory

Reference
• https://en.wikipedia.org/wiki/David_Marr_(neuroscientist)
• T. Poggio, Vision (2010, The MIT Press), Afterword, P.367
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 33 of 48

Vision cognition: Low level

Photo manipulation Image <-> Image Image <-> Semantics


‐ Size ‐ Panoramas ‐ Image classification
‐ Color Image <-> World ‐ Object detection
‐ Exposure ‐ Multi-view stereo ‐ Segmentation
Feature extraction ‐ Structure from motion Applications
‐ Edges Image <-> Time ‐ Retrieval
‐ Oriented gradients ‐ Optical flow ‐ Robots
‐ Segments ‐ Time lapse Reference: Ancient secret of computer vision,
https://courses.cs.washington.edu/courses/cse455/18sp/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 34 of 48
Vision cognition tasks

Reference: Connelly Barnes, CS 4501: Introduction to Computer Vision, http://www.connellybarnes.com/work/class/2017/intro_vision/index.html

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 35 of 48

Vision cognitive system pipeline


Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Analytics task
Reference:
• Mo Zi, 400 B. C., https://en.wikipedia.org/wiki/Camera_obscura
• Picture source, http://www.sohu.com/a/140776287_736731
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 36 of 48
Vision cognitive system pipeline
Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Analytics task

Online demo: http://ipolcore.ipol.im/demo/clientApp/demo.html?id=230


S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 37 of 48

Vision cognitive system pipeline


Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Analytics task
Online demo: http://demo.ipol.im/demo/54/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 38 of 48
Vision cognitive system pipeline
Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Analytics task
Online demo: http://bigwww.epfl.ch/demo/ip/demos/edgeDetector/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 39 of 48

Vision cognitive system pipeline


Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Analytics task
Online demo: http://demo.ipol.im/demo/my_affine_sift/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 40 of 48
Vision cognitive system pipeline
Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Analytics task
Reference: https://www.how-old.net

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 41 of 48

Vision cognitive system pipeline


Image
Image
processing
restoration
and analysis

Feature
Image
representation
enhancement
& description

Object
Image
detection and
acquisition
recognition

Scene
Application domain understanding

Reference
Analytics task
• https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_words
• https://www.phrases.org.uk/meanings/a-picture-is-worth-a-thousand-words.html

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 42 of 48
Speech vision synchronizing

Reference: J. S. Chung and A. Zisserman, Out of time: automated lip sync in


the wild, Asian Conference on Computer Vision, Workshop on Multi-view Lip-
reading, 2006, http://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 43 of 48

Speech-based video generation

Figure 6: The three modules in the Speech2Vid model. From top to bottom: (i) audio
encoder, (ii) identity encoder with a single still image input, and (iii) image decoder.
/2 refers to the stride of each kernel in a specific layer which is normally of equal
stride in both spatial dimensions except for the Pool2 layer in which we use stride 2
in the timestep dimension (denoted by /2t). The network includes two skip
connections between the identity encoder and the image decoder.

Reference: Joon Son Chung, Amir Jamaludin, Andrew Zisserman, You said that? BMVC 2017, https://arxiv.org/abs/1705.02966

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 44 of 48
Speech-visual source separation

Fig. 4. Our model’s multi-stream neural network-based architecture: The visual streams take as input thumbnails of detected
faces in each frame in the video, and the audio stream takes as input the video’s soundtrack, containing a mixture of speech and
background noise. The visual streams extract face embeddings for each thumbnail using a pretrained face recognition model,
then learn a visual feature using a dilated convolutional NN. The audio stream first computes the STFT of the input signal to
obtain a spectrogram, and then learns an audio representation using a similar dilated convolutional NN. A joint, audio-visual
representation is then created by concatenating the learned visual and audio features, and is subsequently further processed
using a bidirectional LSTM and three fully connected layers. The network outputs a complex spectrogram mask for each speaker,
which is multiplied by the noisy input, and converted back to waveforms to obtain an isolated speech signal for each speaker.

Reference: A. Ephrat, Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-


Visual Model for Speech Separation. SIGGRPAH 2018, https://arxiv.org/abs/1804.03619

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 45 of 48

Workshop on speech cognitive


systems
Exercise
• Build a HMM-based speech recognizer
• Build speech recognizer using SpeechRecognition library
Reference
• Prateek Joshi, Python Machine Learning Cookbook, Packt
Publishing, 2016, Code available at
https://github.com/PacktPublishing/Python-Machine-Learning-
Cookbook
• https://github.com/realpython/python-speech-recognition

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 46 of 48
Summary

• Speech cognitive systems


• Vision cognitive systems
• Workshop on building a speech recognition system

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 47 of 48

Thank you!
Dr TIAN Jing
Email: tianjing@nus.edu.sg

S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 48 of 48

You might also like