You are on page 1of 43

speech compression The aim of speech compression is to produce a compact representation of speech sounds such that when reconstructed

it is perceived to be close to the original. The two main measures of closeness are intelligibility and naturalness. The standard reference point is toll quality speech, this is the same as what would be expected over a telephone line, for example, speech coded at 8 kHz using 8 bit ulaw coding and a maximum frequency of about 3.3 kHz. This is a bit rate of 64 kbps, and as such represents a compressed form over (say) 16 bit, 16 kHz speech which is the standard in speech recognition work. ulaw coding does not exploit the (normally large) sample to sample correlations found in speech. ADPCM is the next family of speech coding techniques, and does exploit this redundancy by using a simple linear filter to predict the next sample of speech. The resulting prediction error is typically quantised to 4 bits thus giving a bit rate of 32 kbps (see, for example, the software in Q3.3: 32 kbps ADPCM, G.711/721/723 Compression, shorten). The advantages of ADPCM are that is simple to implement and has very low delay. To obtain more compression specific properties of the speech signal must be modelling. The main assumption is known as the source filter model of speech production. This assumes that a source (voicing or fricative excitation) is passed through a filter (the vocal tract response) to produce the speech. The simplest implementation of this is known as a LPC synthesiser (e.g. LPC10e). At every frame the speech is analysed to compute the filter coefficients, the energy of the excitation, a voicing decision, and a pitch value if voiced. At the decoder a regular set of pulses for voiced speech or white noise for unvoiced speech is passed through the linear filter and multiplied by the gain to produce the speech. This is a very efficient system and typically produces speech coded at 1200-2400bps. With clever acoustic vector prediction this can be reduced to 300-600bps. The disadvantages are a loss of naturalness over most of the speech and occasionally a loss of intelligibility. The CELP family of coders compensates for the lack of quality of the simple LPC model by using more information in the excitation. Each of a set of codebook of excitation vectors is tried and the index of the one that best matches the original speech is transmitted. This results in an increase in the bit rate to typically 4800-9600bps. Most speech coding research is currently directed towards CELP coders. (See, for example, CELP 3.2a, a TMS implementation, a G.728 LD-CELP vocoder, and the L&H implementation.
Speech compression involves the compression of audio data in the form of speech. Speech is a somewhat unique form of audio data, with a number of needs which must be addressed during compression to ensure that it will be intelligible and reasonably pleasant to listen to. A number of software programs have been designed specifically with speech compression in mind, including programs which can perform additional functions such as encrypting the compressed data for security.

Raw audio data can take up a great deal of memory. During compression, the data is compressed so that it will occupy less space. This frees up room in storage, and it also becomes important when data is being transmitted over a network. On a mobile phone network, for example, if speech compression is used, more users can be accommodated at a given time because less bandwidth is needed. Likewise, speech compression becomes important with teleconferencing and other applications; sending data is expensive, and anything which reduces the volume of data which needs to be sent can help to cut costs. Speech is a relatively simple and widely studied type of audio data, which makes it easy to compress in some ways. However, it is important to ensure that compression retains the integrity of the speech. If the data becomes distorted in some way, it can be difficult to understand, and it can also be hard to listen to. Thus, speech compression needs to be performed in a way which retains the key qualities of the data. It is easy for speech to song “wrong” to a listener, interfering with understanding of the transmitted data. Programs which handle the creation of audio files may have a compression option available. After recording or generating the raw audio file, people can choose between a number of parameters to get the file compressed to a more manageable size. Speech compression can also be done on the fly, as when people use cell phones and the network compresses the data while generating a data signal so that people can talk in real time.
If the data also needs to be encrypted, this may be done in real time or in a second pass which encrypts the compressed data. In this case, someone who wants to hear the speech will need to decrypt the data and run it through a program, which may be embedded into a piece of equipment such as a secured phone, which is capable of reading compressed data.

The compression of speech signals has many practical applications. One example is in digital cellular technology where many users share the same frequency bandwidth. Compression allows more users to share the system than otherwise possible. Another example is in digital voice storage (e.g. answering machines). For a given memory size, compression allows longer messages to be stored than otherwise. Historically, digital speech signals are sampled at a rate of 8000 samples/sec. Typically, each sample is represented by 8 bits (using mu-law). This corresponds to an uncompressed rate of 64 kbps (kbits/sec). With current compression techniques (all of which are lossy), it is possible to reduce the rate to 8 kbps with almost no perceptible loss in quality. Further compression is possible at a cost of lower quality. All of the current low-rate speech coders are based on the principle of linear predictive coding (LPC) which is presented in the following sections.

LPC Modeling
A. Physical Model:

When you speak:
• •

• • • • •

Air is pushed from your lung through your vocal tract and out of your mouth comes speech. For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration). For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened. The shape of your vocal tract determines the sound that you make. As you speak, your vocal tract changes its shape producing different sound. The shape of the vocal tract changes relatively slowly (on the scale of 10 msec to 100 msec). The amount of air coming from your lung determines the loudness of your voice.

B. Mathematical Model:

• • •

The above model is often called the LPC Model. The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses or a white noise sequence. The relationship between the physical and the mathematical models: Vocal Tract Air Vocal Cord Vibration Vocal Cord Vibration Period Fricatives and Plosives (LPC Filter) (Innovations) (voiced) (pitch period) (unvoiced) (gain)

Air Volume The LPC filter is given by:

which is equivalent to saying that the input-output relationship of the filter is given by the linear difference equation:

The LPC model can be represented in vector form as:

There are 50 frames/second. (this is described in the next section). o For Unvoiced Sounds (UV):} a different white noise sequence is used. At a sampling rate of 8000 samples/sec. • • LPC Synthesis: Given LPC Analysis: Given . 20 msec is equivalent to 160 samples. find the best (this is done using standard filtering techniques). generate . LPC Analysis • Consider one frame of speech signal: • The signal is related to the innovation through the linear difference equation: • The ten LPC parameters innovation: are chosen to minimize the energy of the . The model says that is equivalent to Thus the 160 values of • is compactly represented by the 13 values of . The digital speech signal is divided into frames of size 20 msec. There's almost no perceptual difference in if: o For Voiced Sounds (V): the impulse train is shifted (insensitive to phase change).• • • changes every 20 msec or so.

• Using standard calculus. o Any matrix inversion method (MATLAB). • Levinson-Durbin Recursion: . o The Levinson-Durbin recursion (described below). we take the derivative of with respect to and set it to zero: • We now have 10 linear equations with 10 unknowns: where • The above matrix equation could be solved using: o The Gaussian elimination method.

4kbps LPC Vocoder • The following is a block diagram of a 2. we solve for the innovation: • Then calculate the autocorrelation of : • Then make a decision based on the autocorrelation: 2. and then set • To get the other three parameters: .4 kbps LPC Vocoder: .Solve the above for .

• • • • The LPC coefficients are represented as line spectrum pair (LSP) parameters. 2400 bps is equivalent to 48 bits/frame. LSP are more amenable to quantization. These bits are allocated as follows: • The 34 bits for the LSP are allocated as follows: . we get: are called the LSP parameters. The frame size is 20 msec. LSP are mathematically equivalent (one-to-one) to LPC. • LSP are ordered and bounded: • • LSP are more correlated from one frame to the next than LPC. LSP are calculated as follows: • Factoring the above equations. There are 50 frames/sec.

8 kbps CELP Coder • • CELP=Code-Excited Linear Prediction. is encoded using a 7-bit non-uniform scalar quantizer (a 1-dimensional vector quantizer). values of follows: ranges from 20 to 146.• • The gain. are jointly encoded as 4. For voiced speech. . The principle is similar to the LPC Vocoder except: o Frame size is 30 msec (240 samples) is coded directly More bits are need Computationally more complex A pitch prediction filter is included Vector quantization concept is used A block diagram of the CELP encoder is shown below: o o o o o • .

The gain is quantized using 5 bits per subframe. the codebook contains 512 codevectors. 4. Each frame is divided into 4 subframes. In each subframe.8 kbps is equivalent to 144 bits/frame. The LSP parameters are quantized using 34 bits similar to the LPC Vocoder.• The pitch prediction filter is given by: where • could be an integer or a fraction thereof. These 144 bits are allocated as follows: . At 30 msec per frame. The perceptual weighting filter is given by: where • • • • have been determined to be good choices.

0 kbps CS-ACELP • • • CS-ACELP=Conjugate-Structured Algebraic CELP. most adaptive filters are digital filters that perform digital signal processing and adapt their performance based on the input signal. each of which is 5 msec (40 samples) o The LSP parameters are encoded using two-stage vector quantization. At 10 msec per frame. adaptive coefficients are required since some parameters of the desired processing operation (for instance. a non-adaptive filter has static filter coefficients (which collectively form the transfer function). By way of contrast. The principle is similar to the 4. For some applications. Because of the complexity of the optimizing algorithms. .8 kbps CELP Coder except: o Frame size is 10 msec (80 samples) o There are only two subframes. o The gains are also encoded using vector quantization. In these situations it is common to employ an adaptive filter. 8 kbps is equivalent to 80 bits/frame. which uses feedback to refine the values of the filter coefficients and hence its frequency response. These 80 bits are allocated as follows: Adaptive filter An adaptive filter is a filter that self-adjusts its transfer function according to an optimizing algorithm.8. the properties of some noise signal) are not known in advance.

which means. To start the discussion of the block diagram we take the following assumptions: . the exact frequency of the power supply might (hypothetically) wander between 47 Hz and 53 Hz. However. Block diagram The block diagram. the adapting process involves the use of a cost function. Example Suppose a hospital is recording a heart beat (an ECG). Such an adaptive technique generally allows for a filter with a smaller rejection range. minimizing the noise component of the input). in our case. As the power of digital signal processors has increased. that the quality of the output signal is more accurate for medical diagnoses. an adaptive filter could be used. To circumvent this potential loss of information. shown in the following figure. which could excessively degrade the quality of the ECG since the heart beat would also likely have frequency components in the rejected range. serves as a foundation for particular adaptive filter realisations. which determines how to modify the filter coefficients to minimize the cost on the next iteration. to feed an algorithm. due to slight variations in the power supply to the hospital. and medical monitoring equipment. One way to remove the noise is to filter the signal with a notch filter at 50 Hz. which is a criterion for optimum performance of the filter (for example.Generally speaking. adaptive filters have become much more common and are now routinely used in devices such as mobile phones and other communication devices. The adaptive filter would take input both from the patient and from the power supply directly and would thus be able to track the actual frequency of the noise as it fluctuates. which is being corrupted by a 50 Hz noise (the frequency coming from the power supply in many countries). such as Least Mean Squares (LMS) and Recursive Least Squares (RLS). camcorders and digital cameras. The idea behind the block diagram is that a variable filter extracts an estimate of the desired signal. A static filter would need to remove all the frequencies between 47 and 53 Hz.

• The error signal or cost function is the difference between the desired and the estimated signal The variable filter estimates the desired signal by convolving the input signal with the impulse response. LMS and RLS define two different coefficient update algorithms. For such structures the impulse response is equal to the filter coefficients. Moreover. In vector notation this is expressed as where is an input signal vector. Applications of adaptive filters • • • • Noise cancellation Signal prediction Adaptive feedback cancellation Echo cancellation Filter implementations • • Least mean squares filter Recursive least squares filter .• The input signal is the sum of a desired signal d(n) and interfering noise v(n) x(n) = d(n) + v(n) • The variable filter has a Finite Impulse Response (FIR) structure. the variable filter updates the filter coefficients at every time instant where is a correction factor for the filter coefficients. The adaptive algorithm generates this correction factor based on the input and error signals. The coefficients for a filter of order p are defined as .

This creates the destructive interference that reduces the amplitude of the perceived noise. This waveform has identical or directly proportional amplitude to the waveform of the original noise. but its signal is inverted. This requires a much lower power level for cancellation but is effective only for a single user.an effect which is called phase cancellation. In digital signal processing. the resulting soundwave may be so faint as to be inaudible to human ears. Noise cancellation at other locations is more difficult as the three dimensional wavefronts of the unwanted sound and the cancellation signal could match and create alternating zones of constructive and destructive interference. Depending on the circumstances and the method used. sound-absorbing ceiling tiles or muffler. Alternatively. Less bulky. linear prediction can be viewed as a part of mathematical modelling or optimization. The active methods (this) differs from passive noise control methods (soundproofing) in that a powered system is involved. A noise-cancellation speaker may be co-located with the sound source to be attenuated. which consists of a compression phase and a rarefaction phase. the passenger compartment of a car) such global cancellation can be achieved via multiple speakers and feedback microphones.g. linear prediction is often called linear predictive coding (LPC) and can thus be viewed as a subset of filter theory. The waves combine to form a new wave.Noise cancellation Active noise control (ANC) (also known as noise cancellation. which analyzes the waveform of the background aural or nonaural noise. Modern active noise control is achieved through the use of a computer. and measurement of the modal responses of the enclosure. and effectively cancel each other out . Sound is a pressure wave. Able to block noise selectively. then generates a signal reversed waveform to cancel it out by interference. In this case it must have the same audio power level as the source of the unwanted sound. Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples. in a process called interference.g. In small enclosed spaces (e. the transducer emitting the cancellation signal may be located at the location where sound attenuation is wanted (e. rather than unpowered methods such as insulation. In system analysis (a subfield of mathematics). the user's ear). . The advantages of active noise control methods compared to passive ones are that they are generally: • • • More effective at low frequencies. active noise reduction (ANR) or antinoise) is a method for reducing unwanted sound. A noise-cancellation speaker emits a sound wave with the same amplitude but with inverted phase (also known as antiphase) to the original sound.

this process increases the capacity achieved through silence suppression by preventing echo from traveling across a network. a recurrent distortion artifact. The time varying acoustic feedback leakage paths can only be eliminated with adaptive feedback cancellation. Adaptive feedback cancellation has its application in echo cancellation. The error between the desired and the actual output is taken and given as feedback to the adaptive processor for adjusting its coefficients to minimize the error. The error generated by this estimate is where x(n) is the true signal value. Adaptive feedback cancellation Adaptive feedback cancellation is a common method of cancelling audio feedback in a variety of electro-acoustic systems such as digital hearing aids.[1][2] echo cancellation The term echo cancellation is used in telephony to describe the process of removing echo from a voice communication in order to improve voice quality on a telephone call. When an electroacoustic system with an adaptive feedback canceller is presented with a correlated input signal. There is a difference between the system identification and feedback cancellation. . In addition to improving subjective quality. The differences are found in the way the parameters ai are chosen. These equations are valid for all types of (one-dimensional) linear prediction. and ai the predictor coefficients.The prediction model The most common representation is where is the predicted signal value. x(n − i) the previous observed values. For multi-dimensional signals the error metric is often defined as where is a suitable chosen vector norm. entrainment is generated.

The far-end signal is reproduced by the speaker in the room. Least mean squares filter Least mean squares (LMS) algorithms are a class of adaptive filter used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean squares of the error signal (difference between the desired and the actual signal). Echo cancellation is done using either echo suppressors or echo cancellers. but can also be implemented in software. A microphone also in the room picks up the resulting direct path sound. This technique is generally implemented using a digital signal processor (DSP). 3. Problem formulation . The Acoustic Echo Cancellation (AEC) process works as follows: 1. 5. and consequent reverberant sound as a near-end signal. 4. It is a stochastic gradient descent method in that the filter is only adapted based on the error at the current time. The resultant signal represents sounds present in the room excluding any direct or reverberated sound produced by the speaker. The far-end signal is filtered and delayed to resemble the near-end signal. with some delay. 2. or in some cases both. The filtered far-end signal is subtracted from the near-end signal. Echo cancellation involves first recognizing the originally transmitted signal that re-appears. A far-end signal is delivered to the system. 6. in the transmitted or received signal. it can be removed by 'subtracting' it from the transmitted or received signal.Two sources of echo have primary relevance in telephony: acoustic echo and hybrid echo. Once the echo is recognized.

. and it is minimized by the LMS. This cost function (C(n)) is the mean square error.Most linear adaptive filtering problems can be formulated using the block diagram above. but y(n). v(n) and h(n) are not directly observable. That is. definition of symbols d(n) = y(n) + ν(n) Idea The idea behind LMS filters is to use steepest descent to find filter weights minimize a cost function.} denotes the expected value. Its solution is closely related to the Wiener filter. d(n) and e(n). Applying steepest descent means to take the partial derivatives with respect to the individual entries of the filter coefficient (weight) vector where is the gradient operator. We start by defining the cost function as which where e(n) is the error at the current sample 'n' and E{. while using only observable signals x(n). an unknown system is to be identified and the adaptive filter attempts to adapt the filter to make it as close as possible to . This is where the LMS gets its name.

To the minimum of the cost function we need to take a step in the opposite direction of express that in mathematical terms where is the step size(adaptation constant). To find . Instead. Generally. Simplifications For most systems the expectation function done with the following unbiased estimator must be approximated. this algorithm is not realizable until we know . The simplest case is N = 1 For that simple case the update algorithm follows as Indeed this constitutes the update algorithm for the LMS filter. That means we have found a sequential update algorithm which minimizes the cost function. Unfortunately. This can be where N indicates the number of samples we use for that estimate. is a vector which points towards the steepest ascent of the cost function. the expectation above is not computed. to run the LMS in an online (updating after each new sample is received) environment.Now. we use an instantaneous estimate of that expectation. See below. LMS algorithm summary The LMS algorithm for a pth order algorithm can be summarized as Parameters p = filter order : .

μ = step size Initialisatio n: Computati For n = 0. This in contrast to other algorithms such as the least mean squares (LMS) that aim to reduce the mean square error.1. However. the RLS can be used to solve any problem that can be solved by adaptive filters. Compared to most of its competitors. this benefit comes at the cost of high computational complexity. : . Recursive least squares filter The Recursive least squares (RLS) adaptive filter is an algorithm which recursively finds the filter coefficients that minimize a weighted linear least squares cost function relating to the input signals. and potentially poor tracking performance when the filter to be estimated (the "true system") changes. on: where denotes the Hermitian transpose of . We will attempt to recover the desired signal d(n) by use of a p-tap FIR filter. In the derivation of the RLS. while for the LMS and similar algorithm they are considered stochastic.. Motivation In general.2. suppose that a signal d(n) is transmitted over an echoey. For example.. the input signals are considered deterministic. the RLS exhibits extremely fast convergence. noisy channel that causes it to be received as where v(n) represents additive noise..

in terms of . The benefit of the RLS algorithm is that there is no need to invert matrices. we would like to avoid completely redoing the least squares algorithm to find the new estimate for . Discussion The idea behind RLS filters is to minimize a cost function C by appropriately selecting the filter coefficients . thereby saving computational power. The error signal e(n) and desired signal d(n) are defined in the negative feedback diagram below: The error implicitly depends on the filter coefficients through the estimate : The weighted least squares error function C—the cost function we desire to minimize—being a function of e(n) is therefore also dependent on the filter coefficients: where samples. is the "forgetting factor" which gives exponentially less weight to older error . Another advantage is that it provides intuition behind such results as the Kalman filter. As time evolves. Our goal is to estimate the parameters of the filter . updating the filter as new data arrives. and at each time n we refer to the new least squares estimate by .where is the vector containg the p most recent samples of x(n).

Recursive algorithm The discussion resulted in a single equation to determine a coefficient vector which minimizes the cost function. In this section we want to derive a recursive solution of the form . replace e(n) with the definition of the error signal Rearranging the equation yields This form can be expressed in terms of matrices where is the weighted sample correlation matrix for x(n). the smaller contribution of previous samples. This makes the filter more sensitive to recent samples. and is the equivalent estimate for the cross-correlation between d(n) and x(n).The cost function is minimized by taking the partial derivatives for all entries k of the coefficient vector and setting the results to zero Next. Choosing λ The smaller λ is. which means more fluctuations in the filter co-efficients. The λ = 1 case is referred to as the growing window RLS algorithm. Based on this expression we find the coefficients which minimize the cost function as This is the main result of the discussion.

We start the derivation of the recursive in terms of algorithm by expressing the cross correlation where is the p+1 dimensional data vector Similarly we express in terms of by In order to generate the coefficient vector we are interested in the inverse of the deterministic autocorrelation matrix. For that task the Woodbury matrix identity comes in handy.where is a correction factor at time n-1. With A U V C is (p+1)-by(p+1) is (p+1)-by-1 is 1-by-(p+1) = I1 is the 1-by-1 identity matrix The Woodbury matrix identity follows .

= = To come in line with the standard literature. it is necessary to bring into another form Subtracting the second term on the left side yields With the recursive definition of the desired form follows Now we are ready to complete the recursion. As discussed . we define where the gain vector g(n) is Before we move on.

Compare this with the a posteriori error. through the weighting factor. λ. Next we incorporate the and get With we arrive at the update equation where is the a priori error. the error calculated after the filter is updated: That means we found the correction factor This intuitively satisfying result indicates that the correction factor is directly proportional to both the error and the gain vector.The second step follows from the recursive definition of recursive definition of together with the alternate form of . RLS algorithm summary The RLS algorithm for a p-th order RLS filter can be summarized as Parameters p = filter order : . which controls how much sensitivity is desired.

. That some sounds are intrinsically musical. is an oversimplification. From the tinkle of a bell to the slam of a door. and cultural considerations.λ = forgetting factor δ = value to initialize Initializatio n: where I is the (p + 1)-by-(p + 1) identity matrix Computati For on: . This article will analyze those involved in Western musical traditions. any tone with characteristics such as controlled pitch and timbre. while others are not. Note that the recursion for P follows a Riccati equation and thus draws parallels to the Kalman filter musical sound processing musical sound. any sound is a potential ingredient for the kinds of sound organization called music. aesthetic. The choices of sounds for music making have been severely limited in all places and periods by a diversity of physical. The sounds are produced by instruments in which the periodic vibrations can be controlled by the performer.

producers of periodic vibrations. the reed of a saxophone. If the reed were a part of a clarinet and the player continued blowing it with unvaried pressure. it could be described as follows: it grows weaker from the beginning (diminishing loudness) until it becomes inaudible. or tones. it will oscillate until friction and its own inertia cause it to return to its original state of rest. 3:4. the lips of the trumpet player.). Tone Most musical tones differ from the demonstration tone (above) in that they consist of more than a single wave form. Any material undergoing vibratory motion imposes its own characteristic oscillations on the fundamental vibration. etc. and timbre would appear as constants. Their periodicity is their controllable (i. They bear harmonic relationships to the fundamental motion that are expressible as frequency ratios of 1:2. Clearly. etc. The pattern may be visualized as an elastic reed—like that of a clarinet—fixed at one end. a third is vibrating at a frequency three times greater. e. Timbre (tone colour) is a product of the total complement of simultaneous motions enacted by any medium during its vibration. not by their character but by their sources. are those that produce periodic vibrations. Each of these attributes is revealed in the wave form of a tone. Once moving. Each period of the motion forms the same arc pattern (uniform wave content). The reed’s displacement to-and-fro diminishes in a smooth fashion as time passes (decreasing intensity). If this vibratory motion were audible. Its arc of movement will be lesser or greater depending upon the degree of pressure used to set it into motion. As it moves through its arc the reed passes through a periodic number of cycles per time unit. Noises are most readily identified. These partials are not fortuitous. created by each of these vibrating bodies is most directly a product of vibrational frequency. then pulling a strip of paper beneath it at a uniform rate.e. in their unique ways. and it is of unvarying tonal quality (uniform timbre). Another way of expressing this is that half the body is vibrating at a frequency twice as great as the whole. or high-low aspect. its motion through time could be charted by placing a carbon stylus on its moving head. it remains at a stable level of highness (steady pitch).g. Loudness is a product of the intensity of that motion. flute tone.. although its speed is not constant. loudness. and duration. With these conditions prevailing.The fundamental distinction usually made has been between tone and noise. The strings of the violin. they more readily achieve autonomy because they possess controlled pitch. . timbre. the noise of the dripping faucet.. Each cycle of its arc is equally spaced (uniform frequency). this reed’s motion will be in proportion to the applied force. Although tones too are commonly linked with their sources (violin tone. Tone differs from noise mainly in that it possesses features that enable it to be regarded as autonomous. The pitch. musical) basis. Instruments that yield musical sounds. moving like a pendulum in a to-and-fro pattern when set into motion (see illustration ). the grating chalk. etc. This means that the reed (or string or air column as well) is vibrating in halves and thirds and fourths as well as a whole. a distinction best clarified by referring to the physical characteristics of sound. thus creating partial wave forms in addition to the fundamental wave form. attributes that make them amenable to musical organization. Duration is the length of time that a tone persists. or the squeaking gate. loudness. and the wooden slabs of a xylophone are all. The reed probably would vibrate in parts as well as a whole. pitch.

Although pure tones.” to any manually produced tone. the ear normally ignores them as separate parts. The violin and flute tones are distinguishable because their articulatory “noises” are quite different and their overtone contents are dissimilar.3 feet (speed of sound ÷ frequency = wavelength).120 feet per second. the higher the pitch.These numerical relationships also are expressible by pitch relationships as the harmonic. the greater the elasticity. the faster. After articulation. the rate increasing by 1. Some electronic organs. it moves away from its source as a spherical pressure wave. . A typical violin tone is relatively rich in overtones while a flute tone sometimes approaches a pure tone. What the listener recognizes as “a violin tone” or “a trumpet tone” also is a product of the noise content that accompanies the articulation of any sound on the particular instrument. it is the presence or absence of overtones and their relative intensities that determine the timbre of any tone. The friction of the bow as it is set into motion across the string. This means that most tones are composites: they consist of partial vibrations of the vibrating body as well as the vibrations of the whole mass. Although one can develop the acuity required to hear some of these overtones within a musical tone. or tones lacking other than a fundamental frequency. a significant “noise factor. thereby imitating the sound of any traditional instrument. or overtone. the denser the medium. whether it be the mass of a string. A pitch of 263 cycles per second (middle C of the piano) is borne as a wavelength of around 4. Its rate of passage through any medium is determined by the medium’s density and elasticity. as well as tones similar to natural sounds. recognizing only a more or less rich tone quality within the fundamental pitch. even when they produce the same pitch. then synthesized through an auditory output circuit. reed. the slower the transmission. Electronic computers are capable of complete imitation of such sounds. Sound waves move as a succession of compressions through the air. however. or air column. Composers of electronic music have utilized this capability to synthesize tones quite different from any available on traditional instruments. use single vacuum tubes whose frequency output can be varied through control of an adjustable transformer. the shorter the wavelength. series (see illustration ). a vibrating mass performs motions that are the equivalents of these partial vibrations. the eddies of air pressure within a horn’s mouthpiece. Depending upon its shape and substance. sometimes occur in music. In air at around 60° F. sound moves at approximately 1.1 feet per second per degree of rise in temperature. or the hammer’s impact on a piano string all add an extra dimension. the tone is broken down into its component parts. woodblock. Movement Once an audible oscillation is produced by a vibrating body. which is merely a representation of numerical ratios in terms of pitch equivalents. for example. Musical tones of determined harmonic content can be produced by electronic vacuum tubes or transistors as well as by traditional manual instruments. most musical tones are composites. The wavelength is determined by frequency. Through ingenious mixing circuits a compound tone consisting of any predetermined overtone content can be produced.

If. Thus the high-frequency speakers (tweeters) in good . the absence of highly absorbent materials precludes appreciable loss of intensity of the wave during its movement. (See also acoustics: Architectural acoustics. thereby creating an inequity of sound intensities. most rooms where music is performed are large enough (wall lengths greater than about 30 feet) so that their natural resonance periods are too slow to fall within the range of pitches of the lowest musical tones (usually no lower than 27 cycles per second. For this reason listeners in any room should be within a direct path of sound propagation. Smaller rooms can produce disturbing sympathetic resonance unless obstructions or absorbent materials are added to minimize that effect. the space between the surfaces of the enclosure is so great that reflected sound waves travel extended distances before decaying. If the period is too long in a room where speech must be understood. low tones mushroom out in a broad trajectory while high tones move in straight paths. Its timbre has been altered slightly by objects within its path that disrupted an equitable distribution of frequencies. Fortunately. Anyone who has spoken or clapped his hands inside a large. of the air within the enclosure. the reverberation period is too brief in a room where human “presence” and music each contribute to the acoustics. There are two reasons for such protracted reverberation: first. The reverberation period is a crucial factor in rooms where sounds must be heard with considerable fidelity. The area within which a sound occurs can have considerable effect upon what is heard. travel at the same rate of speed through a particular medium. Just as a string or reed or air column has a natural resonance period (or rate of vibration). move in relatively straight paths from their sources. Any tone that approximates in frequency the characteristic resonance period of an enclosure will be reinforced through the sympathetic response. second. although some organs have pipes that extend to 15 cycles per second). a unit of time measured from the instant a sound fills the enclosure (steady state) until that sound has decayed to onemillionth of its initial intensity. particularly the high-frequency waves. spoken syllables will blend into each other and the words will be mumbled confusion.By the time a wave has moved some distance. only a “cold” and “dull” feeling will persist. making it appear richer and more powerful. which. (Bathroom singers revel in this phenomenon because the band of resonance sometimes lies close enough to the pitches of the male voice to support it. or natural resonance. it has changed in some of its characteristics. because no reverberative support of the prevailing sounds can be provided by the enclosure itself. which is inversely proportional to the square of the distance. This means that tones of frequencies differing from the resonance of the enclosure will be less intense than those that agree.) Although all sound waves. on the other hand. any enclosure— whether an audio speaker cabinet or the nave of a cathedral—imposes its resonance characteristics on a sound wave within it. and. empty room has experienced prolonged reverberation.) In addition to resonance. Seats far to the side at the front of an auditorium offer occupants a potentially distorted version of sound from its source. unlike the low. any enclosure possesses a reverberation period. The journey has robbed it of intensity. regardless of their pitch.

Division of the pitch spectrum Pitch is another matter.) Pitch and timbre Just as various denominations of coins combine to form the larger units of a monetary system. of course. ensuring wider coverage for high-frequency components of all sounds. Furthermore. or echoing. there was a day—even during the mid-18th century of Bach—when pitch uniformity was unknown. Sites of musical performance in the open demand quite different acoustical arrangements. or only frustration will result. duration. The history of music theory has to a great degree consisted of a commentary on the ways pitches are combined to make musical patterns. Musicians for the most part are content to denote a particular timbre by the name of the instrument that produced it.” “full. sound waves are reflected more uniformly over a wide area than with any other shape. there is no standard taxonomy of tone quality. other than the vague classifications “shrill. Such a reflector must be designed so that relatively uniform wave propagation will reach all locations where listening will occur.audio reproduction systems are angled toward the sides of the room. when instruments emerged as the principal vehicles of the musical impulse. A reflective shell placed behind the sound source can provide a boost in transmission of sounds toward listeners. Music terminology. pitch has been favoured as the dominating attribute by most Western theorists. When at least two instrumentalists sit down to play a duet. or pianississimo) to “extremely loud” (fff or fortississimo). Especially since the Renaissance. Although the standardization of the pitch name a′ (within the middle of the piano keyboard) at 440 cycles per second has been adopted by most of the professional music world. . (The musical dominance of Italy from the late 16th to the 18th century—when these Italian terms first were applied—explains their retention today. The shell form serves that purpose admirably since its curved shape avoids the right angles that might set up continuous reflections. and Western theory has been occupied with this task from as early as Aristoxenus (4th century bc). problems of pitch location (tuning) and representation (notation) have challenged the practicing musician. recognizes loudnesses in music in terms of an eight-level continuum of nuances from “extremely soft” (ppp. since sound reflection from ceilings and walls cannot occur and reverberation cannot provide the desirable support that would be available within a room. Although pitch.” and so on. there must be some agreement about pitch. for example. diffusing them equally over the path of propagations. A highly developed musical culture demands a precise standardization of pitch. and timbre act as four-fold coordinates in the structuring of these units. (The needs here are similar to those of the photographer who wishes to flood a scene uniformly with flat light rather than focus with a spotlight on a small area. so musical tones combine to form larger units of musical experience.” “mellow. leaving loudness and timbre more as the “understood” parameters of the musical palette.) The timbres of music enjoy an even less explicit and formalized ranking. loudness.

provides the chords but suffers from inequality of intervals. All three of these systems fail to provide the pitch wherewithal for the 12 musical keys found in the standard repertoire. or sonance. This music. are not offensive to the listener.000 cycles per second. Performers like singers. based on the simpler ratios of the overtone series. Pythagorean tuning provides uniformity but not the chords. There is a long history of speculations in this area. has not become standard fare in Western cultures. or semitones. within the various octave registers of man’s hearing. Furthermore. counting the beginning key. The piano keyboard is a useful visual representation of this 12unit division of the octave. a name derived from the scale theories of earlier times when only eight (Latin octo) notes within this breadth were codified. before a key occupying the same position in the pattern recurs. One must keep in mind that the chromatic scale. Beginning on any key. From a modern viewpoint all suffer from one of two mutually exclusive faults: either they lack relationships (intervals) of uniform size. The semitone is the smallest acknowledged interval of the Western pitch system. The compromise tuning system most widely accepted since the mid-19th century is called “equal temperament. the upper limit normally attenuating with advancing age. in spite of its advocates (Alois Hába. Western music history is dotted with systems formulated for the precise tuning of pitches within the octave. Just tuning. some contemporary music makes use of pitch placements that divide the octave into units smaller than the half-step. is merely a conventional standard of pitch tuning. The names of these intervals are derived from musical notation through a simple counting of lines and spaces of the staff (see illustration ). frequently make use of pitches that do not correspond precisely to this set of norms. there are 12 different keys (and thus 12 different pitches). Julian Carillo. even in simple music. This upper limit varies with the age and ear structure of the individual. the relationship of the constituent pitches of an interval determines its quality. or they are incapable of providing chords that are acceptable to the ear. The pitch spectrum is divided into octaves. who can alter the pitches they produce. The sizes of all remaining intervals can be calculated by determining how many semitones each contains. The music of many non-Western cultures also utilizes distinct divisions of the octave. called microtonal. but the subjectivity of the data indicates that little verifiable fact can be sorted from it. although not as euphonious as those of the overtone series. Meantone tuning provides equal intervals but gives rise to several objectionable chords. this method provides precisely equal intervals and a full set of chords that. . Just as the overtone content of a single tone determines timbre. Karlheinz Stockhausen) and even its special instruments that provide a means for consistent performance.” Based on the division of the octave into 12 equal half-steps. trombone and string players.Man’s perception of pitch is confined within a span of roughly 15 to 18. Today the octave is considered in Western music to define the boundaries for the pitches of the chromatic scale.

” “unstable. These scales are different from one another only in the intervals formed by their constituent pitches. beautiful and dissonant with unpleasant. the absence of any more convincing explanation and classification often leads musicians to use his ideas implicitly.” and lacking in fusion might within a particular context create an altogether different effect. held that two tones are consonant if they have one or more overtones (excluding the seventh and ninth) in common (see illustration ). and ugly. are dissonant. Another explanation. According to his view a chord is more dissonant than another if it contains a greater number of intervals that. stable. offered later by Helmholtz. The major scale. but difficulty arises if one attempts to pin a singular evaluation on a particular interval per se. The German composer Paul Hindemith (1895–1963) provided one explanation of harmonic tension and relaxation that depends upon the intervals found within chords. which result from simultaneous tones or their upper overtones of slightly differing frequencies. pitch organization in music usually is discussed in terms of less inclusive kinds of scale patterns. An initial theory was based on the notion that dissonance is a product of beats. The most important scales in traditional Western theory are seventone (heptatonic). as separate entities. These adjectives may be reasonably meaningful in musical contexts. consists of seven pitches arranged in the intervallic order: tone–tone– semitone–tone–tone–tone–semitone. although many attempts have been made to link consonant with pleasant. like the chromatic. Recognition of the power of context in shaping a response to the individual pitch interval has led some music theorists to think more in terms of a continuum of sonance that extends from more consonant to more dissonant. smooth.Consonance and dissonance Until the 20th century. Music in which a high degree of dissonance occurs has rekindled interest in this old problem of psychoacoustics. grating. music theorists were prone to concoct tables that showed an “objective” classification of intervals into the two opposing camps of consonant and dissonant. But only the person who utters these terms can know with assurance what he means by them. . Although Hindemith’s reasonings and conclusions have not been widely accepted. for instance. and vice versa. Thus the naked interval that sounds “grating. operate within the octave. The explanation of consonance and dissonance offered by Hermann von Helmholtz in On the Sensations of Tone (1863) is perhaps as helpful as any. Although the complete pitch spectrum can be tuned in a way that provides 12 pitches per octave (as the chromatic scale). tearing down the artificial fence once presumed to separate the two in experience. which. unstable. Theorists have noted that the character of an interval is altered considerably by the sounds that surround it.

Unfortunately. Spatial domain methods. the music of several Eastern cultures.. By contrast. although this is a sweeping generalization for which exceptions are not rare. and 2. Scales and modes Major and minor scales formed the primary pitch ingredients of music written between 1650 and 1900. which operate directly on pixels. establishing a sense of repose or tonality to which the remaining six pitches relate. If it looks good. They are abstractions that are meaningful for tonal music. Since that revolution of the early 1920s. Major and minor scale tonality was basic to Western music until it began to disintegrate in the art music of the late 19th century. Other scales. or to provide `better' input for other automated image processing techniques. frequency domain methods. then quantitative measures can determine which techniques are most appropriate. music in which a particular pitch acts as a focal point of perception. .e. These too are heptatonic patterns. this scale differs from the minor scale mainly in that the latter contains a small (or minor) third in this location. which operate on the Fourier transform of an image. it is good! However. there is no general theory for determining what is `good' image enhancement when it comes to human perception. Since three variants of the minor scale are recognized in the music of the Western repertoire. the raw pitch materials of Western music have frequently been drawn from the complete chromatic potential. Image enhancement techniques can be divided into two broad categories: 1.Called major because of the large (or major) third that separates the first and third pitches. when image enhancement techniques are used as pre-processing tools for other image processing techniques. though they do not utilize the total complement of 12 chromatic pitches per octave. and occasional Western folk songs incorporate pitch materials best classified as pentatonic (a five-pitch scale). The modes and the major and minor scales best represent the pitch structure of Western music. Image enhancement The aim of image enhancement is to improve the interpretability or perception of information in images for human viewers. possess greater representational power for music of earlier times and for much of the repertoire of Western folk music. It was replaced in part by the methods of Arnold Schoenberg (1874–1951). which used all 12 notes as basic material. a number of children’s songs. their uniqueness produced solely by the differing pitch relationships formed by their members. Each of the modes can most easily be reproduced by playing successive white keys at the piano. called modes. i. it is important to note that they share this small interval between their first and third pitches.

that is transformation or mapping. In this case any pixel with a grey level below the threshold in the input image gets mapped to 0 in the output image. Other grey scale transformations are outlined in figure 1 below. active at a chosen threshold value.y) in the enhanced image is the result of performing some operation on the pixels in the neighbourhood of (x.Spatial domain methods The value of a pixel with coordinates (x. but usually they are rectangular. Figure 1: Tone-scale adjustments.y) in the input image. pixel neighbourhood only depends on the value of F at (x. This is a grey scale The simplest case is thresholding where the intensity profile is replaced by a step function.y). . F. Grey scale manipulation The simplest form of operation is when the operator T only acts on a in the input image. Neighbourhoods can be any shape. Other pixels are mapped to 255.

and the equalized versions. If we could `stretch out' the grey levels at the dark end to produce a more uniformly distributed histogram then the image would become much clearer.Histogram Equalization Histogram equalization is a common technique for enhancing the appearance of images. Suppose we have an image which is predominantly dark. Figure 2: The original image and its histogram. Both images are quantized to 64 grey levels. Then its histogram would be skewed towards the lower end of the grey scale and all the image detail is compressed into the dark end of the histogram. . Histogram equalization involves finding a grey scale transformation function that creates an output image with a uniform histogram (or nearly so).

If we transform the input image to get s = T(r) what is the probability distribution Ps(s) ? From probability theory it turns out that where r = T-1(s). we get . and for . The inverse transformation from s to r is given by r = T-1(s). We must find a transformation T that maps grey values r in the input image F to grey values s = T(r) in the transformed image It is assumed that • • . T is single valued and monotonically increasing.How do we determine this grey scale transformation function? Assume our grey levels are continuous and have been normalized to lie between 0 and 1. If one takes the histogram for the input image and normalizes it so that the area under the histogram is 1. Consider the transformation This is the cumulative distribution function of r. we have a probability distribution for grey levels in the input image Pr(r). Using this definition of T we see that the derivative of s with respect to r is Substituting this back into the expression for Ps.

Thus the discretization and rounding of sk to the nearest integer will mean that the transformed image will not have a perfectly uniform histogram. The values of sk will have to be scaled up by 255 and rounded to the nearest integer so that the output values of this transformation will range from 0 to 255.and . Neighbourhood Averaging Each point in the smoothed image. missing pixel values etc. There are many different techniques for image smoothing. The transformation now becomes Note that . which Discrete Formulation We first need to determine the probability distribution of grey levels in the input image.y) in the input image. Ps(s) is now a uniform distribution function.for all is what we want. . For example. is obtained from the average pixel value in a neighbourhood of (x. Now where nk is the number of pixels having grey level k. we will consider neighbourhood averaging and edge-preserving smoothing. Image Smoothing The aim of image smoothing is to diminish the effects of camera noise. and N is the total number of pixels in the image.Thus. the index . spurious pixel values. if we use a neighbourhood around each pixel we would use the mask .

15. 20. 100) and the median here is 20. 20. one usually expects the value of a pixel to be more closely related to the values of pixels close to it than to those further away. 20. 20. An alternative approach is to use median filtering. Here we set the grey level to be the median of the pixel values in the neighbourhood of that pixel.1/ 9 1/ 9 1/ 9 1/ 9 1/ 9 1/ 9 1/ 9 1/ 9 1/ 9 Each pixel value is multiplied by . 25. . This is because most points in an image are spatially coherent with their neighbours. The median m of a set of values is such that half the values in the set are less than m and half are greater. However. a triangular weighting function. 20. Edge preserving smoothing Neighbourhood averaging or Gaussian smoothing will tend to blur edges because the high frequencies in the image are attenuated. 25. Smoothing reduces or attenuates the higher frequencies in the image. If we sort the values we get (10. 20. Gaussian smoothing has the attribute that the frequency components of the image are modified in a smooth manner. indeed it is generally only at edge or feature points where this hypothesis is not valid. Accordingly it is usual to weight the pixels near the centre of the mask more strongly than those at the edge. 20. but as far as the appearance of the image is concerned we usually don't notice much. 15. suppose the pixel values in a neighbourhood are (10. or a Gaussian. Mask shapes other than the Gaussian can do odd things to the frequency spectrum. Some common weighting functions include the rectangular weighting function above (which just takes the average over the window). and then the result placed in the output image. In practice one doesn't notice much difference between different weighting functions. For example. the image is convolved with this smoothing mask (also known as a spatial filter or kernel). 20. 20. summed. This mask is successively moved across the image until every pixel has been covered. 100). |20|. although Gaussian smoothing is the most commonly used. That is.

with salt and pepper noise. It is the rank of the value of the pixel used in the neighbourhood that determines the type of morphological operation. median filters are nonlinear. and the result of median filtering. but at the same time edges are preserved. Of course. pixel values are replaced with the smallest value in the neighbourhood. Dilating an image corresponds to replacing pixel values with the largest value in the neighbourhood. . Median filtering is in fact a morphological operation. When we erode an image. Figure 3: Image of Genevieve. the result of averaging.The outcome of median filtering is that pixels with outlying values are forced to become more like their neighbours. Median filtering replaces pixels with the median value in the neighbourhood.

we want to enhance the high-frequency components. A simple spatial filter that achieves image sharpening is given by 1/9 1/9 1/9 . or to enhance detail that has been blurred (perhaps due to noise or other effects.Image sharpening The main aim in image sharpening is to highlight fine detail in the image. such as motion). With image sharpening. this implies a spatial filter shape that has a high positive component at the centre (see figure 4 below). Figure 4: Frequency domain filters (top) and their corresponding spatial domain counterparts (bottom).8/9 1/9 1/9 1/9 1/9 1/9 .

We simply compute the Fourier transform of the image to be enhanced. if we multiply the original image by an amplification factor A before subtracting the low pass image. that is. and take the inverse transform to produce the enhanced image.-1/9 1/9 1/9 . Thus.Since the sum of all the weights is zero. High boost filtering We can think of high pass filtering in terms of subtracting a low pass image from the original image. if A = 1 we have a simple high pass filter. the average signal value. Frequency domain methods Image enhancement in the frequency domain is straightforward. For display purposes. we will get a high boost or high frequency emphasis filter. When A > 1 part of the original image is retained in the output. multiply the result by a filter (rather than convolve in the spatial domain). Now. . the resulting signal will have a zero DC value (that is.-1/9 1/9 1/9 1/9 /9 1/9 . However.Low pass. we also want to retain some of the low frequency components to aid in the interpretation of the image. Thus. or the coefficient of the zero frequency term in the Fourier expansion). A simple filter for high boost filtering is given by . in many cases where a high pass image is required. High pass = Original . we might want to add an offset to keep the result in the range.

it is often more efficient to implement these operations as convolutions by small spatial filters in the spatial domain. Understanding frequency domain concepts is important. An ideal low pass filter would retain all the low frequency components. . However.The idea of blurring an image by reducing its high frequency components. Homomorphic filtering Images normally consist of light reflected from objects. and are denoted i(x. Figure 5: Transfer function for an ideal low pass filter. The basic nature of the image F(x. or sharpening an image by increasing the magnitude of its high frequency components is intuitively easy to understand.y) and r(x. These portions of light are called the illumination and reflectance components. However. computationally. The functions i and r combine multiplicatively to give the image function F: F(x. and leads to enhancement techniques that might not have been thought of by restricting attention to the spatial domain. These problems are caused by the shape of the associated spatial domain filter. It results in blurring of the image (and thus a reduction in sharp transitions associated with noise). achieve much better results. Filtering Low pass filtering involves the elimination of the high frequency components in the image.y) may be characterized by two components: (1) the amount of source light incident on the scene being viewed. ideal filters suffer from two problems: blurring and ringing.y) respectively. which has a large number of undulations.y)r(x. and eliminate all the high frequency components.y) = i(x. Smoother transitions in the frequency domain filter. and (2) the amount of light reflected by the objects in the scene. such as the Butterworth filter.y).

The function Z represents the Fourier transform of the sum of two images: a low frequency illumination image and a high frequency reflectance image. that we define Then or where Z. however. that is Suppose. We cannot easily use the above product to operate separately on the frequency components of illumination and reflection because the Fourier transform of the product of two functions is not separable. I and R are the Fourier transforms of and respectively. Figure 6: Transfer function for homomorphic filtering. .where and 0 < r(x.y) < 1.

If we now apply a filter with a transfer function that suppresses low frequency components and enhances high frequency components. . the process of homomorphic filtering can be summarized by the following diagram: Figure 7: The process of homomorphic filtering.y) + r'(x.y). then we can suppress the illumination component and enhance the reflectance component. In the spatial domain By letting and we get s(x. as z was obtained by taking the logarithm of the original image F. the inverse yields the desired enhanced image : that is Thus. Finally. Thus where S is the Fourier transform of the result.y) = i'(x.