Automatic Speech Recognition
What is the task?
Getting a computer to understand spoken
language
By understand we might mean
React appropriately
Convert the input speech into another medium,
e.g. text
How do humans do it?
Articulation produces
sound waves which
the ear conveys to the brain
for processing
3
How computers do it?
Acoustic waveform
Acoustic signal
Digitization
Acoustic analysis of the speech
signal
Phoneme dictionary
Language model
Speech recognition
Multilingual Architecture
Multilingual speakers already out-number
monolingual speakers.
The capacity to transparently recognize multiple
spoken languages is a desirable feature of ASR
systems.
eg. OK GOOGLE, SIRI
Multilingual Techniques
Universal Speech Model
Language Identification (LID) classifiers
Monolingual speech recognizers decode along
with LID (Confidence Score)
Dynamic confidence score and LID decision
ASR Multilingual Design
The end-to-end multilingual speech recognition system consists of the
following components:
1. Client
2. Frontend
-Recognize
-Recognize+Search+Synthesis
-Multi-recognize+Search+Synthesis
3. Backend
-LID Backend
-Speech Recognizer Backend
-Web Search Backend
-Voice Synthesizer Backend
9
10
Multirecognizer Module
11
Representation of Speech & Speech
Signal
Grammar & Syntax
-How the occurrence of words in sequence is governed
Lexicon or Dictionary
- How a word is supposed to be pronounced as a
sequence of unitary sounds
Acoustic-phonetics
-How a unitary sound and/or a sequence of unitary sounds
are supposed to be produced with the articulatory
apparatus
12
THE HIDDEN MAROV MODEL
The input audio waveform from a microphone is converted into a sequence of
fixed size acoustic vectors Y 1: T = y 1. . . y T in a process called feature
extraction[3]. The decoder then attempts to find the sequence of words w 1: L =
w 1. . . w L which is most likely to have generated Y, i.e. the decoder tries to
find,
w = arg max {P (w|Y)}.
However, since P (w|Y) is difficult to model directly, Bayes Rule is used
to transform above equation into the equivalent problem of finding:
w = arg max {p(Y |w) P (w)}
13
Arcgitecture of HMM Based
Recognizer
14
The overall recognition system of speech recognition using HMM includes :
Feature Analysis
Unit Matching System
Lexical Decoding
Syntactic analysis
Semantic Analysis
15
Phoneme and Topologies
16
Composite HMM for Vertibri Recogition (Pronunciation Dictionary)
17