You are on page 1of 21

Research Issues in Speech Processing

Dr. M. Sabarimalai Manikandan Amrita University

Speech Production: the source-filter model


Speech signal conveys the information contained in the spoken word highly non-stationary signal Short segments of speech (20 to 30 ms ) acoustical energy is in the frequency range of 100-6000 Hz

Vocal tract transfer function can be modeled by an all-pole filter

Speech Processing Tasks


Speech recognition (recognizing lexical content) Speech synthesis (Text-to speech) Speaker recognition (recognizing who is speaking) Speech understanding and vocal dialog Speech coding (data rate deduction) Speech enhancement (Noise reduction) Speech transmission (noise free communication) Voice conversion

Speech Processing
Speech measurements Short-time energy (STE) Zero crossing rate (ZCR) Autocorrelation (AC) Pitch period or frequency Formants Speech signal components Speech-Silence or Non-speech Voiced speech-Unvoiced speech

Speech Processing
Speech representations or models
Temporal features Low energy rate Zero crossing rate (ZCR) 4Hz modulation energy Pitch contour Spectral features Spectral Centroid (sharpness) Spectral Flux (rate of change) Spectral Roll-Off (spectral shape) Spectral Flatness (deviation of the spectral form) Linear Predictive Coefficients (LPC) Cepstral coefficients Mel Frequency Cepstral Coefficients (MFCC): human auditory system Harmonic features: sinusoidal harmonic modelling Perceptual features: model of the human hearing process First order derivative (DELTA)

Elements of the speech signal


Phonemes: the smallest units of speech sounds
Vowels and Consonants ~12 to 21 different vowel sounds used in the English language Consonants involve rapid and sometimes subtle changes in sound according to the manner of articulation:
plosive (p, b, t, etc.) fricative (f, s, sh, etc.) nasal (m, n, ng) liquid (r, l) and semivowel (w, y)

Consonants are more independent of language than vowels are.

Syllable: one or more phonemes Word: one or more syllables

Automatic Speech Recognition


There are two uses for speech recognition systems: Dictation: translation of the spoken word into written text Computer Control: control of the computer, and software applications by speaking commands Speaker dependent system: to operate for a single speaker Speaker independent system: to operate for any speaker of a particular type Speaker adaptive system: to adapt its operation to the characteristics of new speakers The size of vocabulary affects the complexity, processing requirements and the accuracy of the system

Speech Recognition: Applications


Automatic translation Vehicle navigation systems Human computer Interaction Content-based spoken audio search Home automation Pronunciation evaluation Robotics Video games Transcription of speech into mobile text messages People with disabilities

Speech Recognition System


Sampling of speech Acoustic signal processing: Linear Prediction Cepstral Coefficients (LPCC) Mel Frequency Cepstral Coefficients (MFCC) Perceptual Linear Prediction Cepstral Coefficients (PLPCC) Recognition of phonemes, groups of phonemes and words: Dynamic Time Warping (DTW) hidden Markov models (HMMs) Gaussian mixture models (GMMs) Neural Networks (NNs) Expert systems and combinations of techniques

Automatic Speaker Recognition


Speaker recognition: the process of automatically recognizing who is speaking by using the speaker-specific information included in speech sounds Speaker identity: physiological and behavioral characteristics of the speech production model of an individual speaker the spectral envelope (vocal tract characteristics) the supra-segmental features (voice source characteristics) of speech Applications: banking over a telephone network telephone shopping and database access services voice dialing and mail information and reservation services security control for confidential information forensics and surveillance applications

Speaker Recognition
Speaker identification: the process of determining which registered speaker provides input speech sounds
Similarity

Ref. template or model (speaker #1)

Input speech

Feature Extraction

Similarity

Maximum selection

Identification result (Speaker ID)

Ref. template or model (speaker #2)

Similarity

Ref. template or model (speaker #N)

Speaker Recognition
Speaker verification: the process of accepting or rejecting the identity claim of a speaker.
Input speech Feature Extraction Similarity Decision Verification result (Accept /Reject)

Input speech

Ref. template or model (speaker #M)

Threshold

Open Set and Closed Set Recognition Text-dependent and Text-independent Recognition Vector quantization Gaussian mixture models (GMM) Dynamic time warping (DTW) Hidden Markov model (HMM)

Text-to-Speech (TTS) System


Synthesis of Speech for effective human machine communications
reading email messages call center help desks and customer care announcement machines

Raw or tagged text

Text Analysis
Document Structure Detection

Phonetic Analysis
Homograph disambiguation

Prosodic Analysis
Pitch

Speech Synthesis
Voice Rendering

Synthetic Speech

Text Normalization

Grapheme-toPhoneme Conversion

Duration

Linguistic Analysis

Synthetic speech should be intelligible and natural

Speech Synthesis
Text-to-speech (TTS) synthesis systems Approach TTS system performance measure Synthetic Speech Intelligibility Synthetic speech naturalness Speech Intelligibility Tests Segmental level analysis the Rhyme Test the Modified Rhyme Test the Diagnostic Rhyme Test Supra-segmental analysis the Harvard Psychoacoustic Sentences (HPS) the Haskins syntactic sentences

Speech Coding (Compression)


Speech Coding for efficient transmission and storage of speech
narrowband and broadband wired telephony cellular communications Voice over IP (VoIP) to utilize the Internet Telephone answering machines IVR systems Prerecorded messages

Speech-Assisted Translation Corrector System Objective: Develop a speech-assisted translation corrector (SATC)
system which provides a grammatically correct sentence for a translated sentence from the machine translation
input sentence translated sentence with grammatical errors grammatically correct sentence Speech assisted translation corrector system

Multilingual Machine Translation

text

He

came

here
Translator

speech

storage

speech signal is produced from the words in the translated sentence.

A MT system is correct and complete if it can analyze of the grammatical structures encountered in the source language, and it can generate all of the grammatical structures necessary in the target language translation.
8/25/2011 16

SATC System: Requirements and Challenging Tasks


Creation of large scale rich multilingual speech databases is crucial task for research and development in language and speech technology Indian languages speakers (10 Males and 10 Females) age groups ( <20, 15-40, >40) audio format: 16-bit stereo, and sampling rate of 44.1 kHz annotation and assessment of speech databases Development of multilingual text to speech interface Development of spoken word matching module Development of speech signal processing (SSP) tools

8/25/2011

17

Major Problems in Speech Processing


Acoustic variability: the same phonemes pronounced in different contexts will have different acoustic realization (coarticulation effect) The signal is different when speech is uttered in various environments:
noise reverberation different types of microphones.

Speaking variability: when the same speaker speaks normally, shouts, whispers, uses a creaky voice, or has a cold Speaker variability: since different speakers have different timbers and different speaking habits

Major Problems in Speech Processing


Linguistic variability: the same sentence can be pronounced in many different ways, using many different words, synonyms, and many different syntactic structures and prosodic schemes Phonetic variability: due to the different possible pronunciations of the same words by speakers having different regional accents Lombard effect: noise modifies the utterance of the words (as people tend to speak louder)

Major Problems in Speech Processing


Continuous speech: words are connected together (not separated by pauses or silences). It is difficult to find the start and end points of words The production of each phoneme is affected by the production of surrounding phonemes The start and end of words are affected by the preceding and following words the rate of speech (fast speech tends to be harder)

References
M. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How to observe and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp. 777-782, 1999 S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981 M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movements in speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286, 2001 T. Kaburagi and M. Honda, NTT CS Laboratories A model of articulator trajectory formation based on the motor tasks of vocal-tract shapes, J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996. S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints, Proc. ICSLP98, pp. 2251-2254, 1998. Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility using Semantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.

You might also like