Speech Processing

Research Issues in Speech Processing
Dr. M. Sabarimalai Manikandan Amrita University
Speech Production: the source-filter model

Speech signal conveys the information contained in the spoken word highly non-stationary signal Short segments of speech (20 to 30 ms ) acoustical energy is in the frequency range of 100-6000 Hz
Vocal tract transfer function can be modeled by an all-pole filter
Speech Processing Tasks

Speech recognition (recognizing lexical content) Speech synthesis (Text-to speech) Speaker recognition (recognizing who is speaking) Speech understanding and vocal dialog Speech coding (data rate deduction) Speech enhancement (Noise reduction) Speech transmission (noise free communication) Voice conversion
Speech Processing
Speech measurements Short-time energy (STE) Zero crossing rate (ZCR) Autocorrelation (AC) Pitch period or frequency Formants Speech signal components Speech-Silence or Non-speech Voiced speech-Unvoiced speech
Speech Processing
Speech representations or models
Temporal features Low energy rate Zero crossing rate (ZCR) 4Hz modulation energy Pitch contour Spectral features Spectral Centroid (sharpness) Spectral Flux (rate of change) Spectral Roll-Off (spectral shape) Spectral Flatness (deviation of the spectral form) Linear Predictive Coefficients (LPC) Cepstral coefficients Mel Frequency Cepstral Coefficients (MFCC): human auditory system Harmonic features: sinusoidal harmonic modelling Perceptual features: model of the human hearing process First order derivative (DELTA)
Elements of the speech signal

Phonemes: the smallest units of speech sounds
Vowels and Consonants ~12 to 21 different vowel sounds used in the English language Consonants involve rapid and sometimes subtle changes in sound according to the manner of articulation:
plosive (p, b, t, etc.) fricative (f, s, sh, etc.) nasal (m, n, ng) liquid (r, l) and semivowel (w, y)
Consonants are more independent of language than vowels are.
Syllable: one or more phonemes Word: one or more syllables
Automatic Speech Recognition

There are two uses for speech recognition systems: Dictation: translation of the spoken word into written text Computer Control: control of the computer, and software applications by speaking commands Speaker dependent system: to operate for a single speaker Speaker independent system: to operate for any speaker of a particular type Speaker adaptive system: to adapt its operation to the characteristics of new speakers The size of vocabulary affects the complexity, processing requirements and the accuracy of the system
Speech Recognition: Applications

Automatic translation Vehicle navigation systems Human computer Interaction Content-based spoken audio search Home automation Pronunciation evaluation Robotics Video games Transcription of speech into mobile text messages People with disabilities
Speech Recognition System

Sampling of speech Acoustic signal processing: Linear Prediction Cepstral Coefficients (LPCC) Mel Frequency Cepstral Coefficients (MFCC) Perceptual Linear Prediction Cepstral Coefficients (PLPCC) Recognition of phonemes, groups of phonemes and words: Dynamic Time Warping (DTW) hidden Markov models (HMMs) Gaussian mixture models (GMMs) Neural Networks (NNs) Expert systems and combinations of techniques
Automatic Speaker Recognition

Speaker recognition: the process of automatically recognizing who is speaking by using the speaker-specific information included in speech sounds Speaker identity: physiological and behavioral characteristics of the speech production model of an individual speaker the spectral envelope (vocal tract characteristics) the supra-segmental features (voice source characteristics) of speech Applications: banking over a telephone network telephone shopping and database access services voice dialing and mail information and reservation services security control for confidential information forensics and surveillance applications
Speaker Recognition
Speaker identification: the process of determining which registered speaker provides input speech sounds
Similarity
Ref. template or model (speaker #1)
Input speech
Feature Extraction
Similarity
Maximum selection
Identification result (Speaker ID)
Ref. template or model (speaker #2)
Similarity
Ref. template or model (speaker #N)
Speaker Recognition
Speaker verification: the process of accepting or rejecting the identity claim of a speaker.
Input speech Feature Extraction Similarity Decision Verification result (Accept /Reject)
Input speech
Ref. template or model (speaker #M)
Threshold
Open Set and Closed Set Recognition Text-dependent and Text-independent Recognition Vector quantization Gaussian mixture models (GMM) Dynamic time warping (DTW) Hidden Markov model (HMM)
Text-to-Speech (TTS) System

Synthesis of Speech for effective human machine communications
reading email messages call center help desks and customer care announcement machines
Raw or tagged text
Text Analysis
Document Structure Detection
Phonetic Analysis
Homograph disambiguation
Prosodic Analysis
Pitch
Speech Synthesis
Voice Rendering
Synthetic Speech
Text Normalization
Grapheme-toPhoneme Conversion
Duration
Linguistic Analysis
Synthetic speech should be intelligible and natural
Speech Synthesis
Text-to-speech (TTS) synthesis systems Approach TTS system performance measure Synthetic Speech Intelligibility Synthetic speech naturalness Speech Intelligibility Tests Segmental level analysis the Rhyme Test the Modified Rhyme Test the Diagnostic Rhyme Test Supra-segmental analysis the Harvard Psychoacoustic Sentences (HPS) the Haskins syntactic sentences
Speech Coding (Compression)

Speech Coding for efficient transmission and storage of speech
narrowband and broadband wired telephony cellular communications Voice over IP (VoIP) to utilize the Internet Telephone answering machines IVR systems Prerecorded messages
Speech-Assisted Translation Corrector System Objective: Develop a speech-assisted translation corrector (SATC)
system which provides a grammatically correct sentence for a translated sentence from the machine translation
input sentence translated sentence with grammatical errors grammatically correct sentence Speech assisted translation corrector system
Multilingual Machine Translation
text
He
came
here
Translator
speech
storage
speech signal is produced from the words in the translated sentence.
A MT system is correct and complete if it can analyze of the grammatical structures encountered in the source language, and it can generate all of the grammatical structures necessary in the target language translation.
8/25/2011 16
SATC System: Requirements and Challenging Tasks

Creation of large scale rich multilingual speech databases is crucial task for research and development in language and speech technology Indian languages speakers (10 Males and 10 Females) age groups ( <20, 15-40, >40) audio format: 16-bit stereo, and sampling rate of 44.1 kHz annotation and assessment of speech databases Development of multilingual text to speech interface Development of spoken word matching module Development of speech signal processing (SSP) tools
8/25/2011
17
Major Problems in Speech Processing

Acoustic variability: the same phonemes pronounced in different contexts will have different acoustic realization (coarticulation effect) The signal is different when speech is uttered in various environments:
noise reverberation different types of microphones.
Speaking variability: when the same speaker speaks normally, shouts, whispers, uses a creaky voice, or has a cold Speaker variability: since different speakers have different timbers and different speaking habits

Linguistic variability: the same sentence can be pronounced in many different ways, using many different words, synonyms, and many different syntactic structures and prosodic schemes Phonetic variability: due to the different possible pronunciations of the same words by speakers having different regional accents Lombard effect: noise modifies the utterance of the words (as people tend to speak louder)

Continuous speech: words are connected together (not separated by pauses or silences). It is difficult to find the start and end points of words The production of each phoneme is affected by the production of surrounding phonemes The start and end of words are affected by the preceding and following words the rate of speech (fast speech tends to be harder)
References
M. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How to observe and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp. 777-782, 1999 S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981 M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movements in speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286, 2001 T. Kaburagi and M. Honda, NTT CS Laboratories A model of articulator trajectory formation based on the motor tasks of vocal-tract shapes, J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996. S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints, Proc. ICSLP98, pp. 2251-2254, 1998. Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility using Semantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.

Speech Processing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Processing

Uploaded by

Copyright:

Available Formats

Research Issues in Speech Processing

Dr. M. Sabarimalai Manikandan Amrita University

Speech Production: the source-filter model

Vocal tract transfer function can be modeled by an all-pole filter

Speech Processing Tasks

Elements of the speech signal

Consonants are more independent of language than vowels are.

Syllable: one or more phonemes Word: one or more syllables

Automatic Speech Recognition

Speech Recognition: Applications

Speech Recognition System

Automatic Speaker Recognition

Ref. template or model (speaker #1)

Identification result (Speaker ID)

Ref. template or model (speaker #2)

Ref. template or model (speaker #N)

Ref. template or model (speaker #M)

Text-to-Speech (TTS) System

Raw or tagged text

Synthetic speech should be intelligible and natural

Speech Coding (Compression)

Multilingual Machine Translation

speech signal is produced from the words in the translated sentence.

SATC System: Requirements and Challenging Tasks

Major Problems in Speech Processing

Major Problems in Speech Processing

Major Problems in Speech Processing

You might also like