You are on page 1of 8

SPEECH PROCESSING BACKGROUND

2.1 Introduction
In order for communication to take place, a speaker must produce a speech signal in the form of a sound pressure wave that travels from the speaker's mouth to a listener's ears. Although the majority of the pressure wave originates from the mouth, sound also emanates from the nostrils, throat, and cheeks. Speech signals are composed of a sequence of sounds that serve as a symbolic representation for a thought that the speaker wishes to relay to the listener. The arrangement of these sounds is governed by rules associated with a language. The scientific study of language and the manner in which these rules are used in human communication is referred to as linguistics. The science that studies the characteristics of human sound production, especially for the description, classification, and transcription of speech is called phonetics.

2.2

Speech Communication
Speech is used to communicate information from a speaker to a listener. Although we

focus on the production of speech, hearing is an integral part of the so-called speech chain. Human speech production begins with an idea or thought that the speaker wants to convey to a listener. The speaker conveys this thought through a series of neurological processes and muscular movements to produce an acoustic sound pressure wave that is received by a listener's auditory system, processed, and converted back to neurological signals. To achieve this, a speaker forms an idea to convey, converts that idea into a linguistic structure by choosing appropriate words or phrases to represent that idea, orders the words or phrases based on learned grammatical rules associated with the particular language, and finally adds any additional local or global characteristics such as pitch intonation or stress to emphasize aspects important for overall meaning. Once this has taken place, the human brain produces a sequence of motor commands that move the various muscles of the vocal system to produce the desired sound pressure wave. This acoustic wave is received by the talker's auditory system and converted back to a sequence of neurological pulses that provide necessary feedback for proper speech production. This allows the talker to continuously monitor and control the vocal organs by receiving his or her own speech as feedback. Any delay in this feedback to our own ears can also cause difficulty in proper speech production. The acoustic wave is also transmitted through

a medium, which is normally air, to a listener's auditory system. The speech perception process begins when the listener collects the sound pressure wave at the outer ear, converts this into neurological pulses at the middle and inner ear, and interprets these pulses in the auditory cortex of the brain to determine what idea was received. In both production and perception, the human auditory system plays an important role in the ability to communicate effectively. one advantage of the auditory system is selectivity in what we wish to listen to. This permits the listener to hear one individual voice in the presence of several simultaneous talkers, known as the cocktail party effect. We are able to reject competing speech by capitalizing on the phase mismatch in the arriving sound pressure waves at each ear. A disadvantage of the auditory system is its inability to distinguish signals that are closely spaced in time or frequency. This occurs when two tones are spaced close together in frequency, one masks the other, resulting in the perception of a single tone. As the speech chain illustrates, there are many interrelationships between production and perception that allow individuals to communicate among one another.

2.2.1 Anatomy of the Speech Production System


The speech waveform is an acoustic sound pressure wave that originates from voluntary movements of anatomical structures which make up the human speech production system. Fig portrays a medium saggital section of the speech system in which we view the anatomy midway through the upper torso as we look on from the right side. The gross components of the system are the lungs, trachea (windpipe), larynx organ of voice production, pharyngeal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose). In technical discussions, the pharyngeal and oral cavities are usually grouped into one unit referred to as the vocal tract, and the nasal cavity is often called the nasal tract/ Accordingly, the vocal tract begins at the output of the larynx, and terminates at the input to the lips. The nasal tract begins at the velum and ends at the nostrils of the nose. Finer anatomical features critical to speech production include the vocal folds or vocal cords, soft palate or velum, tongue, teeth, and lips. The soft tip of the velum, which may be seen to hang down in the back of the oral cavity when the mouth is wide open, is called the uvula.These finer anatomical components move to different positions to produce various speech sounds and are known as articulators by speech scientists. The mandible or jaw

is also considered to be an articulator, since it is responsible for both gross and fine movements that affect the size and shape of the vocal tract as well as the positions of the other articulators.

Figure 2.1 : Schematic Diagram Of The Human Speech Production Mechanism.

The three main cavities of the speech production system vocal plus nasal tracts comprise the main acoustic filter. The filter is excited by the organs below it and is loaded at its main output by a radiation impedance due to the lips. The articulators with the filter itself are used to change the properties of the system, its form of excitation, and its output loading over time. the main cavities ,acoustic filter of the system that contribute to the resonant structure of human speech. In the average adult male (female), the total length of the vocal tract is about 17(14)cm, The vocal tract length of an average child is 10cm. The nasal tract constitutes an auxiliary path for the transmission of sound. Acoustic coupling between the nasal and vocal tracts is controlled by the size of the opening at the velum. nasal coupling can substantially influence the frequency

characteristics of the sound radiated from the mouth. If the velum is lowered, the nasal tract is acoustically coupled to produce the nasal sounds of speech. Larynx function is to provide periodic excitation to the system for speech sounds known as voiced.

Figure 2.2 : Block diagram of human speech production The spectral characteristics of the speech wave are timevarying or nonstationary, since the physical system changes rapidly over time. As a result, speech can be divided into sound segments that possess similar acoustic properties over short periods of time. speech sounds are typically partitioned into two broad categories: (1) vowels that contain no major airflow restriction through the vocal tract, and (2) consonants that involve a significant restriction and are therefore weaker in amplitude and often "noisier" than vowels.

2.2.2 Voice Production


Any speech sound is the manner of excitation. Two elemental excitation types: (1) voiced, and (2) unvoiced. Voiced sounds are produced by forcing air through the glottis or an

opening between the vocal folds. The tension of the vocal cords is adjusted so that they vibrate in oscillatory fashion. The periodic interruption of the subglottal airflow results in quasi-periodic puffs of air that excite the vocal tract. The sound produced by the larynx is called voice or phonation. Unvoiced sounds are generated by forming a constriction at some point along the vocal tract, and forcing air through the constriction to produce turbulence. Voicing is accomplished when the abdominal muscles force the diaphragm up, pushing air out from the lungs into the trachea, then up to the glottis, where it is periodically interrupted by movement of the vocal folds. The repeated opening and closing of the glottis is in response to subglottal air pressure from the trachea. Forces responsible for the glottal pulse cycle affect the shape of the glottal waveform and are ultimately related to its corresponding spectral characteristics. The subglottal air pressure continues to force the glottis to open wider and outward, resulting in increased airflow through the glottis the vocal folds spread apart, air velocity increases significantly through the narrow glottis, which causes a local drop in air pressure. Therefore, when the vocal folds are closed, air pressure and potential energy are high. As the glottis opens, air velocity and kinetic energy increase, while pressure and potential energy decrease. The glottis continues to open until the natural elastic tension of the vocal folds equals the separating force of the air pressure. At this point the glottal opening and rate of airflow have reached their maxima. The kinetic energy that was received by the vocal folds during opening is stored as elastic recoil energy, which in turn causes the vocal folds to begin to close The subglottal pressure and elastic restoring forces during closure cause the cycle to repeat. The variation in airflow through the glottis results in a periodic open and closed phase for the glottal or source excitation. The time between successive vocal fold openings is called the fundamental period To, while the rate of vibration is called the fundamental frequency of the phonation,
F0 1 T0

The fundamental period is dependent pitch to refer to the perceived fundamental

frequency of a sound, whether or not that sound is actually present in the waveform. Speech transmitted over the commercial phone lines, for example, are usually band limited to about 3003000 Hz on the size and tension of the speaker's vocal folds.

Figure 2.3: Time waveform of volume velocity of the glottal source excitation.

2.2.3 Phonemes
The basic theoretical unit for describing how speech conveys linguistic meaning is called a phoneme. a phoneme as an ideal sound unit with a complete set of corresponding articulatory gestures. The study of the abstract units and their relationships in a language is called phonemics, while the study of the actual sounds of the language is called phonetics.

2.3

Source Filter Model

Figure 2.4: A general discrete-time model for speech production

This model attempts to represent the speech production process based on its output signal characteristics. In spite of the fact that there are no provisions for coupling or nonlinear effects between subsystems in the model, this system produces reasonable quality speech for coding purposes. In this general terminal-analog system, a vocal-tract model H(z) and radiation model R(z) are excited by a discrete-time glottal excitation signal u n u glottis (n) . During unvoiced speech activity, the excitation source is a flat spectrum noise source modelled by a random noise generator. During periods of voiced speech activity, the excitation uses an estimate of the local pitch period to set an impulse train generator that drives a glottal pulse shaping filter G(z). This excitation produces a glottal pulse waveform similar in shape to those we have studied in previous discussions. This excitation is certainly flawed, since it cannot address phonemes with more than one source of excitation. The vocal-tract transfer function H(z) is modelled as

H (z)

HO 1

b
k 1

1
N k 1

HO Pk z
1

(2.1)

where H represents an overall gain term and Pk the complex pole locations for the N-tube model. In the unvoiced case

S ( z) E( z) H ( z) R( z)
where E(z) represents a partial realization of a white noise process. In the voiced case,

(2.2)

S ( z ) E ( z )G ( z ) H ( z ) R ( z )

(2.3)

where E(z) represents a discrete-time impulse train of period P, the pitch period of the utterance. In the above, G,H, and R represent the "true" models. Accordingly, the true overall system function is

( z)

(2.4) S ( z) E ( z ) All-pole model of the speech production

system, however, arises from the fact that a very powerful and simple computational technique, linear prediction analysis exists for deriving an all-pole model of a given speech utterance. The goal has been to provide a sufficient background to pursue applications of digital signal processing to speech. Speech is produced through the careful movement and positioning of the vocal-tract articulators in response to an excitation signal that may be periodic at the

glottis, or

noiselike due to a major constriction along the vocal tract, or a combination.

Phonemes, the basic units of speech, can be classified in terms of the place of articulation, manner of articulation, or a spectral characterization. Each phoneme has a distinct set of features that major may not require articulator movement for proper sound production. Language or grammatical structure, prosodic features, and coarticulation are all employed during the production of speech.

You might also like