You are on page 1of 20

It is a message information which is converted into a set of neural signals which control articulatory mechanism generating an accoustic waveform

containing information in original message

Message Information

Neural signal

Articulatory mechanism

Accoustic waveform


Concatenation of elements from finite set of phonemes Each language having distict set of phonemes Typically in the range of 30 50 English has around 42 phonemes

Six bit numerical code sufficient for numbering Average of 10 phonemes per second Total make up of 60 bits per second- average information rate Concerns
Representation of message content Representation in a form convenient for transmission/storage

study of speech signals and the processing methods of these signals usually processed in a digital representation,

So regarded as a special case of digital signal processing

Information source

Measurement of observation

Signal representation

Signal transformation

Extraction & utilization of information

Signal processing




In a system using text dependent speech, the individual presents either a fixed (password) or prompted (Please say the numbers 33-5463) phrase that is programmed into the system and can improve performance especially with cooperative users.

A text independent system has no advance knowledge of the presenter's phrasing and is much more flexible in situations where the individual submitting the sample may be unaware of the collection or unwilling to cooperate, which presents a more difficult challenge.

Speech coding is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bit stream. The two most important applications of speech coding are mobile telephony and Voice over IP.

Voice analysis is the study of speech sounds for purposes other than linguistic content, such as in speech recognition. Such studies include mostly medical analysis of the voice i.e. phoniatrics, but also speaker identification. More controversially, some believe that the truthfulness or emotional state of speakers can be determined using Voice Stress Analysis or Layered Voice Analysis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diaphones provides the largest output range, but may lack clarity. The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer.

Speech enhancement aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques. Enhancing of speech degraded by noise, or noise reduction, is the most important field of speech enhancement, and used for many applications such as mobile phones, VoIP, teleconferencing systems , speech recognition, and hearing aids.

As basic parameters in speech processing we regard

Pitch Duration Intensity voice quality signal to noise ratio voice activity detection strength of Lombard effect.

In the area of speech recognition, speech synthesis and speaker characterization basic parameters are needed which are crucial for good performance of the systems. There are two sets parameters. The first is related to prosody

Pitch Duration Intensity

The second characterizes the acoustic properties of the environment including the impact on the speakers voice.

voice quality signal to noise ratio voice activity detection strength of Lombard effect

Taking in account also adverse conditions the performance of many published algorithms to extract those parameters from the speech signal automatically is not known. A framework based on competitive evaluation is proposed to push algorithmic research and to make progress comparable.

personal voice qualities differ in the speakers use of temporal structures, articulation precision, vocal effort and type of phonation. Whereas temporal structures can be measured directly in the acoustic signal and conclusions about articulation precision can be made from the formant structure These voice quality percepts are a combination of several acoustic voice quality parameters. In an investigation on emotionally loaded speech material it could be shown, that the named acoustic parameters are useful for differentiating between the emotions happiness, sadness, anger, fear and boredom.

The signal-to-noise ratio (SNR) is an important feature in determining the quality of audio data. This is particularly important in speech recognition technology since it is well known that recognition performance is strongly influenced by the SNR. In most applications the SNR cannot be easily derived since the noise energy is not known. Further, the question arises as to what is "signal" and what is "noise". For example, would a cough or breath noise be considered part of the "signal" in spontaneous speech? Does it convey information?

Voice activity detection (VAD), also known as speech activity detection or speech detection A technique used in speech processing in which the presence or absence of human speech is detected. The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session. It can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.

The Lombard effect or Lombard reflex is the involuntary tendency of speakers to increase the intensity of their voice when speaking in loud noise to enhance its audibility. This change includes not only loudness but also other acoustic features such as pitch and rate and duration of sound syllables. This compensation effect results in an increase in the auditory signal-to-noise ratio of the speaker's spoken words. The effect links to the needs of effective communication as there is a reduced effect when words are repeated or lists are read where communication intelligibility is not important. Since the effect is also involuntary it is used as a means to detect malingering in those simulating hearing loss. The effect was discovered in 1909 by tienne Lombard, a French otolaryngologist.

Health care Military High-performance fighter aircraft Helicopters Battle management Training air traffic controllers Telephony and other domains