Cognitive Speech and Vision v1.0

SPEECH/ VISION COGNITIVE
SYSTEMS
Dr TIAN Jing
tianjing@nus.edu.sg
S-CS/Speech/vision cognitive systems/V1.0 © 2019 National University of Singapore. All Rights Reserved Page 1 of 48
Module objective
Knowledge and understanding

• Understand the fundamentals of speech recognition
systems, including statistical acoustic modelling, and
end-to-end system using machine learning
• Understand basic concepts of vision cognitive
systems
Key skills
• Design, build, implement and evaluate various
speech recognition approaches in Python
Major reference
• [Tutorial] Daniel Jurafsky, James H. Martin, Chapter 9

Automatic Speech Recognition, Speech and Language
Processing, 2nd Edition, Pearson, 2008.
• [Introduction] Automatic Speech Recognition,
https://github.com/ekapolc/ASR_course
• [Comprehensive] CS224S, Spoken Language Processing,
http://web.stanford.edu/class/cs224s/
• [Introduction] CS131: Computer Vision: Foundations and
Applications,
http://vision.stanford.edu/teaching/cs131_fall1718/syllabus.html
• [Comprehensive] Computer Vision Crash Course， Jia-Bin
Huang, https://filebox.ece.vt.edu/~jbhuang/
Topics
• Speech recognition systems

• Vision cognition systems
• Workshop: Design and build speech
recognition system in Python
Automatic speech recognition
Amazon Echo Google Home Anki Cozmo

2015 2016 Facebook M 2016
2015
Apple Google Microsoft

Siri Assistant Cortana
Slack Bot API
2011 2016 2014
2015
Source: CS224S Spoken Language Processing, http://web.stanford.edu/class/cs224s/
More introductions on history of automatic speech recognition can be found at

• https://ileriseviye.wordpress.com/2011/02/17/speech-recognition-in-1920s-
radio-rex-the-first-speech-recognition-machine/
• https://machinelearning-blog.com/2018/09/07/a-brief-history-of-asr-
automatic-speech-recognition/
Source: Automatic Speech Recognition (CS753), Lecture 1: Introduction to Statistical Speech Recognition, https://www.cse.iitb.ac.in/~pjyothi/cs753/
Behind the Mic: The Science of Talking with Computers, (6 minutes)
https://www.youtube.com/watch?v=yxxRAHVtafI
Language is easy for humans to understand (most of the time), but not so easy for
computers. This video talks about speech recognition, language understanding,
neural nets, and using our voices to communicate with the technology around us.

Behind the Mic: The Science of Talking with Computers
Video (MM:SS) Topics
0:00–0:46 • Speech Communication
0:48-3:00 • Automatic Speech Recognition
• The need for “Natural Language
Understanding
3:01-4:16 • Natural Language Understanding is hard

(and we are not sure how it works)
4:16-6:10 • Neural Networks can be (are) the solution (?)
• Human-machine Interaction • Language Acquisition
• Automatic Speech • Pronunciation Training
Recognition
• Security/Forensics
• Speech Synthesis / Text- • Speaker ID
to-Speech (TTS)
• Speaker Verification
• Natural Language
Understanding (NLU) • Medical Applications
• Natural Language • Diagnosis of Diseases
Generation (NLG) • Information Retrieval
• Telecommunication • Video/Audio Transcribing
• Speech Coding • Audio/Text Summarizing
• Language • Speech Manipulation
• Statistical Machine • Speaking Rate Adjusting
Translation (SMT)

Challenges of speech recognition
• Style: Read speech or spontaneous (conversational) speech?
• Continuous natural speech or command & control?
• Speaker characteristics: Rate of speech, accent, prosody
(stress, intonation), speaker age, pronunciation variability even
when the same speaker speaks the same word
• Channel characteristics: Background noise, room acoustics,
microphone properties, interfering speakers
• Task specifics: Vocabulary size (very large number of words to
be recognized), language-specific complexity, resource
limitations
Speech recognition as pattern
recognition problem
Feature Pattern Decision output
unknown
Extraction Matching Making
speech word
signal
feature
vector
training sequence
speech
Feature Reference
Extraction Patterns
Speech signal represented in time domain and frequency-domain features,

e.g., Mel frequency cepstral coefficient (MFCC)
Feature extraction: MFCC
• Preemphasis: Boost the amount of energy

in high frequencies by signal filtering.
• Windowing: Extract features from a small
window, called frame, not from the entire
signals. Frame size and frame offset.
Reference: Daniel Jurafsky, James H. Martin, Chapter 9 Automatic
Speech Recognition, Speech and Language Processing, 2nd
Edition, Pearson, 2008.
Fourier transform decomposes a signal into sinusoids of different frequencies,
by transforming the view of the signal from time domain to frequency domain.
Forward DFT: 𝐹 𝑢 𝑓 𝑥 𝑒 , where 𝑢 0,1, ⋯ , 𝑁 1
1
Inverse DFT: 𝑓 𝑥 𝐹 𝑢 𝑒 , where 𝑥 0,1, ⋯ , 𝑁 1
𝑁
Example
• Signal 𝐹 0 𝑓 𝑥 𝑒 2 3 4 4 13
𝑓 𝑥 2, 3, 4, 4
• Fourier coefficients of signal 𝐹 1 𝑓 𝑥 𝑒 2𝑒 3𝑒 /
4𝑒 4𝑒 /
2 𝑖
F 𝑢 13, 2 𝑖 , 1, 2 𝑖
• 𝑖 is the imaginary unit
𝐹 2 𝑓 𝑥 𝑒 2𝑒 3𝑒 4𝑒 4𝑒 1
/ /
𝐹 3 𝑓 𝑥 𝑒 2𝑒 3𝑒 4𝑒 4𝑒 2 𝑖
Demo: http://www.falstad.com/fourier/

Understand Fourier transformation as vector projection.
Summary of MFCC
• For each speech frame, compute frequency spectrum and apply Mel filtering
• Compute cepstrum using inverse DFT on the log of the mel-warped spectrum
• 39-dimensional MFCC feature vector: First 12 cepstral coefficients + energy +
13 delta + 13 double-delta coefficients
Reference: Automatic Speech Recognition, https://github.com/ekapolc/ASR_course
Matching: Dynamic time warping
Dynamic time warping (DTW) is one of the algorithms for measuring
similarity between two temporal sequences, which may vary in speed.
TEMPLATE (WORD)
Reference: Sakoe, H.; Chiba, "Dynamic programming

algorithm optimization for spoken word recognition". IEEE
Trans. on Acoustics, Speech, and Signal Processing. Vol.
26, No. 1, 1978, pp. 43–49.
UNKNOWN WORD
Statistical speech recognition
• Let 𝐎 represent a sequence of acoustic feature

observations (i.e., 𝐎 𝑜 , 𝑜 , ⋯ , 𝑜 , and 𝐖 denote a
word sequence. Then the speech recognizer
decodes 𝐖 ∗ as
𝑃 𝐎𝐖 𝑃 𝐖
𝐰∗ argmax 𝑃 𝐖|𝐎 argmax ∝ argmax 𝑃 𝐎 𝐖 P 𝐖
𝐖 𝐖 𝑃 𝐎 𝐖
𝑃 𝐎𝐖
𝑃 𝐖
Reference: CS753, Automatic Speech Recognition,

https://www.cse.iitb.ac.in/~pjyothi/cs753/
• Further introduce three definitions, acoustic signal,

𝐀 , phoneme, 𝐋 , and state, 𝐐 . The optimization
problem statement is changed to be
𝐰∗ argmax 𝑃 𝐎 𝐖 P 𝐖 argmax 𝑃 𝐀 𝐎 𝑃 𝐎 𝐐 𝑃 𝐐 𝐋 𝑃 𝐋 𝐖 𝑃 𝐖
𝐖 𝐖
Phoneme
model
Details will be studied Each phoneme

in acoustic model in has 3 states
following slides.
Reference: http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=six

𝐰∗ argmax 𝑃 𝐎 𝐖 P 𝐖 argmax 𝑃 𝐀 𝐎 𝑃 𝐎 𝐐 𝑃 𝐐 𝐋 𝑃 𝐋 𝐖 𝑃 𝐖
𝐖 𝐖
𝐀
𝐐 𝐖
Speech Audio
Sequence States Words
𝑃 𝐀𝐎 𝑃 𝐎𝐐 𝑃 𝐐𝐋 𝑃 𝐋𝐖 𝑃 𝐖
Feature Frame Sequence Lexicon Language
Extraction Classification Model Model Model
Sentence
t ah m aa t ow
𝐎 𝐋
Feature Frames Phonemes
Statistical speech recognition: Frame
classification
• Output probability is modeled by Gaussian mixture
models 𝑃𝑸 𝒐 𝑃 𝒐 𝒒 ∑ 𝑤𝒒 , 𝑁 𝒐 𝝁 , 𝚺
1 2 3
w1,1 w2 ,1 w3,1
1  R n1
N 11 ( x ) N 21 ( x ) N 31 ( x )
w1, 2 w2 , 2 w3, 2
 2  R n2
N12 ( x ) N 22 ( x ) N 32 ( x )
  
w1, M w2 , M w3, M Reference: Deep learning for audio, CS 598

 M  R nM LAZ: Cutting-Edge Trends in Deep Learning
and Recognition,
N 1M ( x ) N 2M ( x) N 3M ( x ) http://slazebni.cs.illinois.edu/spring17/
Statistical speech recognition:

Acoustic model
• Goal: Model likelihood of sounds given spectral
features, pronunciation models, and prior context
• Usually represented as Hidden Markov Model
(HMM)
• States represent phones or other subword units
• Transition probabilities on states: how likely is it to
see one sound after seeing another?
• Observation/output likelihoods: how likely is
spectral feature vector to be observed from phone
state, given state in previous timestamp?
HMM: Idea
HMM for weather HMM for word
• State probability
• State transition probability matrix
• Observation distribution given state
Reference: CS224S, Spoken Language Processing, http://web.stanford.edu/class/cs224s/
HMM: Model
Notation Descriptions
𝐐 𝑞 ,𝑞 ,⋯,𝑞 A set of 𝑁 states
𝐀 𝑎 ,𝑎 ,⋯,𝑎 ,⋯,𝑎 A transition probability matrix 𝐀, each 𝑎
representing the probability of moving from state 𝑖
to state 𝑗, 𝑠. 𝑡. ∑ 𝑎 1
𝐎 𝑜 ,𝑜 ,⋯,𝑜 A sequence of 𝑇 observations, each one drawn
from a vocabulary 𝑉 𝑣 , 𝑣 , ⋯ , 𝑣
𝐵 𝑏 𝑜 A sequence of observation likelihoods, also called
emission probabilities, each expressing the
probability of an observation 𝑜 being generated
from a state 𝑖
𝑞 ,𝑞 A special start state and end (final) state that are
not associated with observations, together with
transition probabilities 𝑎 , 𝑎 , ⋯ , 𝑎 out of the
start state and 𝑎 , 𝑎 , ⋯ , 𝑎 into the end state.
Reference: Daniel Jurafsky, James H. Martin, Chapter 9 Automatic Speech Recognition, Speech and Language Processing,
2nd Edition, Pearson, 2008.
HMM: Tasks
• Evaluation: Given the observation sequence 𝐎

𝑜 , 𝑜 , ⋯ , 𝑜 , and an HMM model, how do we
efficiently compute the probability of the observation
sequence, given the model?
• Decoding: Given the observation sequence 𝐎
𝑜 , 𝑜 , ⋯ , 𝑜 , and an HMM model, how do we
choose a corresponding state sequence 𝐐
𝑞 , 𝑞 , ⋯ , 𝑞 , that is optimal in some sense (i.e., best
explains the observations)?
• Learning: How do we train the model parameters?
HMM: Example
Example: HMM model (2 states). State labels are initialized/estimated (during training)
Data: 3 observation sequences with different lengths. All observations are vectors.
Time 1 2 3 4 5 6 7 8 9 10
sequence sequence sequence
State S1 S1 S2 S2 S2 S1 S1 S2 S1 S1
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂
A
Time 1 2 3 4 5 6 7 8 9 -
State S2 S2 S1 S1 S2 S2 S2 S2 S1 -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 -
B
Time 1 2 3 4 5 6 7 8 - -
State S1 S2 S1 S1 S1 S2 S2 S2 - -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 - -
C
HMM: Example
• Among 3 observations, 2 start with S1, 1 starts with S2. So 𝜋 𝑆1 , 𝜋 𝑆2
• S1 occurs 11 times in non-terminal locations, it is followed immediately by S1 6 times,
and it is followed immediately by S2 5 times. So 𝑃 𝑆1 𝑆1 , 𝑃 𝑆2 𝑆1
• S2 occurs 13 times in non-terminal locations, it is followed immediately by S1 5 times,
and it is followed immediately by S2 8 times. So 𝑃 𝑆1 𝑆2 , 𝑃 𝑆2 𝑆2
• Formulate distributions of S1 and S2 using their observation vectors as Gaussian.
A Time 1 2 3 4 5 6 7 8 9 10
State S1 S1 S2 S2 S2 S1 S1 S2 S1 S1
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂
B Time 1 2 3 4 5 6 7 8 9 -
State S2 S2 S1 S1 S2 S2 S2 S2 S1 -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 -
C Time 1 2 3 4 5 6 7 8 - -
State S1 S2 S1 S1 S1 S2 S2 S2 - -
Obs 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 𝑂 - -

Lexicon model
• Lexical modelling forms the bridge between the
acoustic and language models
• Each one with a pronunciation in terms of phones
• CMU dictionary: 127K words,
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Deterministic model Probabilistic model

Word Pronunciation Word Pronunciation Probability
t ah m aa t ow t ah m aa t ow 0.45
TOMATO TOMATO
t ah m ey t ow t ah m ey t ow 0.55
COVERA k ah v er ah jh COVERA k ah v er ah jh 0.65
GE k ah v r ah jh GE k ah v r ah jh 0.35
Language model
N-gram models: Build the language model by calculating probabilities from
text training corpus: how likely is one word to follow another.
Example: Bigram counts for eight of the words (out of V = 1446) in the Berkeley
Restaurant Project corpus of 9332 sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day

Optimization and search
• Find the best hypothesis 𝑃 𝐎|𝐖 𝑃 𝐖 given
• A sequence of acoustic feature vectors (𝐎)
• A trained HMM (Acoustic model)
• Lexicon (Pronunciation model)
• Probabilities of word sequences (Language model)
• Optimization and serach
• Dynamic programming
• Calculate most likely state sequence in HMM given
transition and observations
Summary
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …
Hidden Markov
942 942 6
Model (HMM):
Acoustic GMM model 𝑃 𝐎|𝐐

Model: 𝐎: input features
𝐐: HMM state
Features Features Features

Audio Input:
Statistical speech recognition: Data
• Collect corpora appropriate for recognition task

• Small speech + phonetic transcription to associate
sounds with symbols (Acoustic Model)
• Large (>= 60 hrs) speech + orthographic
transcription to associate words with sounds
(Acoustic Model)
• Very large text corpus to identify 𝑁-gram
probabilities or build a grammar (Language Model)
Deep speech recognition
• A typical speech recognizer uses the GMM-HMM
framework. Replace frame classification
• Use the DNN to generate good features to feed into the
general GMM-HMM framework.
• Language modelling
Source: https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a
Evaluation of speech recognition
• Word Error Rate (WER): Minimum Edit Distance:

Distance in words between the system output and
the reference transcription (truth)
• 𝑆 is the number of substitutions,
𝑆 𝐷 𝐼 • 𝐷 is the number of deletions,
𝑊𝐸𝑅 • 𝐼 is the number of insertions and
𝑁
• 𝑁 is the number of words in the reference
Truth: What a bright day System: What a day

Deletion: "Bright" was deleted by the system
Truth: What a day System: What a bright day
Insertion: "Bright" was inserted by the system
Truth: What a bright day System: What a light day
Substitution: "Bright" was substituted by "light" by the system
Reference: https://martin-thoma.com/word-error-rate-calculation/
Vision cognition: Marr theory
Reference
• https://en.wikipedia.org/wiki/David_Marr_(neuroscientist)
• T. Poggio, Vision (2010, The MIT Press), Afterword, P.367
Vision cognition: Low level
Photo manipulation Image <-> Image Image <-> Semantics

‐ Size ‐ Panoramas ‐ Image classification
‐ Color Image <-> World ‐ Object detection
‐ Exposure ‐ Multi-view stereo ‐ Segmentation
Feature extraction ‐ Structure from motion Applications
‐ Edges Image <-> Time ‐ Retrieval
‐ Oriented gradients ‐ Optical flow ‐ Robots
‐ Segments ‐ Time lapse Reference: Ancient secret of computer vision,
https://courses.cs.washington.edu/courses/cse455/18sp/
Vision cognition tasks
Reference: Connelly Barnes, CS 4501: Introduction to Computer Vision, http://www.connellybarnes.com/work/class/2017/intro_vision/index.html
Vision cognitive system pipeline

Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Application domain understanding
Analytics task
Reference:
• Mo Zi, 400 B. C., https://en.wikipedia.org/wiki/Camera_obscura
• Picture source, http://www.sohu.com/a/140776287_736731
Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Analytics task
Online demo: http://ipolcore.ipol.im/demo/clientApp/demo.html?id=230


Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Analytics task
Online demo: http://demo.ipol.im/demo/54/
Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Analytics task
Online demo: http://bigwww.epfl.ch/demo/ip/demos/edgeDetector/

Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Analytics task
Online demo: http://demo.ipol.im/demo/my_affine_sift/
Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Analytics task
Reference: https://www.how-old.net

Image
Image
processing
restoration
and analysis
Feature
Image
representation
enhancement
& description
Object
Image
detection and
acquisition
recognition
Scene
Reference
Analytics task
• https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_words
• https://www.phrases.org.uk/meanings/a-picture-is-worth-a-thousand-words.html
Speech vision synchronizing
Reference: J. S. Chung and A. Zisserman, Out of time: automated lip sync in

the wild, Asian Conference on Computer Vision, Workshop on Multi-view Lip-
reading, 2006, http://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/
Speech-based video generation
Figure 6: The three modules in the Speech2Vid model. From top to bottom: (i) audio
encoder, (ii) identity encoder with a single still image input, and (iii) image decoder.
/2 refers to the stride of each kernel in a specific layer which is normally of equal
stride in both spatial dimensions except for the Pool2 layer in which we use stride 2
in the timestep dimension (denoted by /2t). The network includes two skip
connections between the identity encoder and the image decoder.
Reference: Joon Son Chung, Amir Jamaludin, Andrew Zisserman, You said that? BMVC 2017, https://arxiv.org/abs/1705.02966
Speech-visual source separation
Fig. 4. Our model’s multi-stream neural network-based architecture: The visual streams take as input thumbnails of detected
faces in each frame in the video, and the audio stream takes as input the video’s soundtrack, containing a mixture of speech and
background noise. The visual streams extract face embeddings for each thumbnail using a pretrained face recognition model,
then learn a visual feature using a dilated convolutional NN. The audio stream first computes the STFT of the input signal to
obtain a spectrogram, and then learns an audio representation using a similar dilated convolutional NN. A joint, audio-visual
representation is then created by concatenating the learned visual and audio features, and is subsequently further processed
using a bidirectional LSTM and three fully connected layers. The network outputs a complex spectrogram mask for each speaker,
which is multiplied by the noisy input, and converted back to waveforms to obtain an isolated speech signal for each speaker.
Reference: A. Ephrat, Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-

Visual Model for Speech Separation. SIGGRPAH 2018, https://arxiv.org/abs/1804.03619
Workshop on speech cognitive

systems
Exercise
• Build a HMM-based speech recognizer
• Build speech recognizer using SpeechRecognition library
Reference
• Prateek Joshi, Python Machine Learning Cookbook, Packt
Publishing, 2016, Code available at
https://github.com/PacktPublishing/Python-Machine-Learning-
Cookbook
• https://github.com/realpython/python-speech-recognition
Summary
• Speech cognitive systems

• Vision cognitive systems
• Workshop on building a speech recognition system
Thank you!
Dr TIAN Jing
Email: tianjing@nus.edu.sg

Cognitive Speech and Vision v1.0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cognitive Speech and Vision v1.0

Uploaded by

Copyright:

Available Formats

SPEECH/ VISION COGNITIVE

Knowledge and understanding

• [Tutorial] Daniel Jurafsky, James H. Martin, Chapter 9

• Speech recognition systems

Amazon Echo Google Home Anki Cozmo

Apple Google Microsoft

Automatic speech recognition

More introductions on history of automatic speech recognition can be found at

Automatic speech recognition

3:01-4:16 • Natural Language Understanding is hard

4:16-6:10 • Neural Networks can be (are) the solution (?)

Automatic speech recognition

Speech signal represented in time domain and frequency-domain features,

Feature extraction: MFCC

• Preemphasis: Boost the amount of energy

Forward DFT: 𝐹 𝑢 𝑓 𝑥 𝑒 , where 𝑢 0,1, ⋯ , 𝑁 1

Feature extraction: MFCC

Reference: Sakoe, H.; Chiba, "Dynamic programming

Statistical speech recognition

• Let 𝐎 represent a sequence of acoustic feature

Reference: CS753, Automatic Speech Recognition,

• Further introduce three definitions, acoustic signal,

Details will be studied Each phoneme

Statistical speech recognition

w1, M w2 , M w3, M Reference: Deep learning for audio, CS 598

Statistical speech recognition:

HMM for weather HMM for word

Reference: CS224S, Spoken Language Processing, http://web.stanford.edu/class/cs224s/

• Evaluation: Given the observation sequence 𝐎

Statistical speech recognition:

Deterministic model Probabilistic model

Statistical speech recognition:

Acoustic GMM model 𝑃 𝐎|𝐐

Features Features Features

Statistical speech recognition: Data

• Collect corpora appropriate for recognition task

Evaluation of speech recognition

• Word Error Rate (WER): Minimum Edit Distance:

Truth: What a bright day System: What a day

Vision cognition: Low level

Photo manipulation Image <-> Image Image <-> Semantics

Reference: Connelly Barnes, CS 4501: Introduction to Computer Vision, http://www.connellybarnes.com/work/class/2017/intro_vision/index.html

Vision cognitive system pipeline

Online demo: http://ipolcore.ipol.im/demo/clientApp/demo.html?id=230

Vision cognitive system pipeline

Vision cognitive system pipeline

Vision cognitive system pipeline

Reference: J. S. Chung and A. Zisserman, Out of time: automated lip sync in

Speech-based video generation

Reference: A. Ephrat, Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-

Workshop on speech cognitive

• Speech cognitive systems

You might also like