You are on page 1of 11

Wideband Speech and Audio Coding

Powerful algorithms and standards are now available to enhance services in communication-based and storage-based audio-only and audiovisual applications.
Peter No11

lthough high bit ratechannelsandnetworks have b e c o m e m o r e easily accessible, low bit r a t e coding of speech and audio signals has retained its importance. The main motivations for low bit rate coding are the need to minimize transmission costs o r provide costefficient storage, and t h e d e m a n d to transmit over channels of limited capacity such as mobile radio channels. I n addition, t h e r e may b e a need to share capacity for different services such as voice, audio, data. graphics, and images in integrated services networks, and to support variable-rate coding in packet-oriented networks. Basic requirements in the design of low bit rate codcrs are: 1) high quality of reconstructed speech or audio signals with robustness tovariations in spectra and levels; 2) robustncss, also required t o r a n d o m a n d bursty c h a n n e l bit e r r o r s a n d packet losscs; 3) low complexity and power consumption of the coders (which a r e highly relevant). For example, the complexity of audio decoders should be low enough to support low-cost solutions because broadcast and playback applications play a dominant role in wideband audio. Additional network-related requirements are low encoderidecoder delays, robust tandeming of codecs, transcodeability, and a graceful degradation of qualitywith increasing bit error rates in mobile radio and broadcast applications. All these partly conflicting factors have to b e carefully considered in selecting a wideband speech o r a u d i o coding algorithm for a given application. Earlier proposals to reduce t h e PCM rates have followcd those for narrowband speech coding. However, differences between audio and speech signals are manifold, since audio coding implies higher values of sampling rate, amplitude resolution and dynamic range, larger variations in power PLTER NOLL i.5 u p ~ ~ f r t . \ ~ r - density spectra, differences in human perception, ~ ~ ~ l ~ l e c ~ l ~ l l ? i l i r U /i ~ ~ l l f ~ l ~ and higher listener expectations of quality. Speech f l\ h e Tccliizrcul U r u w r u n coding can be so efficient because spccch signals have an underlying vocal tract production model, qf Berlin urid hc chuirs /lie ludio Subgroup iwliirz whereas for audio, in general, this is not the case. ISOIlLIPEG T h e r e has b e e n r a p i d p r o g r e s s in c o d i n g of

speech and audio signals. Linear prediction, subband coding, transform coding, and various forms ofvector quantization and entropy coding techniques have been used to design efficient coding algorithms that can achieve substantially more compression than was thought possible only a few years ago. Recent results indicate that good to excellent coding quality can or will soon be obtained with bit rates of 1 bisample for speech and wideband speech, and 2 bisample for audio. Rate reductions to 0.5 and 1 bisamplc, respectively, can be expected over the next decade. These high reductions are achieved by employing perceptual coding techniques, so that only those details of the signal that a r e perceptible by ear will be transmitted. Applying knowledge of auditory perception leads to hearing-specific coders that perform remarkably well. It is important to note that bit rate reduced digital representations of source signals can be much more robust to channel impairments than analog t e c h n i q u e s if s o u r c e a n d c h a n n e l coding a r e implemented efficiently. In addition, with todays data compression and multilevel signaling techniques, one can actually reduce the bandwidth by going digital, bandwidth expansion is no longer the price to be paid for digital coding and transmission. In this article, we will describe typical parameters of wideband speech and audio signals, including digitizedversions of each; potential applications; and available transmission media. We will briefly introduce facts about human auditory perception that a r e exploited in audio coding and quality measures that play an important role in coder evaluations and designs. Then we will describe techniques of efficient coding of wideband speech and audio signals,with an emphasis on existing standards. The recent ISOiMPEG audio coding stand a r d is covered in some detail, since it will b e used in many application areas, including digital storage, transmission, and broadcasting of audio-only signals a n d audiovisual applications such as video-telephony,video-conferencing,and TV broadcasting. Finally, ongoing reserach and standardization work will be outlined.

34

Olh3-(,804/93/S03.00 0 1993 IEEE

IEEE Communications Magazine

November 1993

Signals and Signal Delivery


Signals
Telephone speech, wideband speech, and wideband audio signals differ not only in bandwidth and dynamic range, but also in listener expectation of offered quality. The conventional digital format is PCM, with typical sampling rates and amplitude resolutions (PCM bits per sample) as shown in Table 1.

WidebandSpeech-Higher bandwidths than that of the 300 to 3400 Hz telephone bandwidth result in major subjective improvements in represented speech quality. A bandwidth of 50 to 7000 Hz not only improves the intelligibility and naturalness of speech, but adds also a feeling of transparent communication, and eases speaker recognition. Applications of high relevance are loudspeaker telephony, ISDN conferencing systems, multipoint interactive audiovisual communications, and the use of commentary channels for broadcasting. I n 1986, CCITT recommended a 64 kbis wideband speech coder developed primarily for transmission over the ISDN basic rate (B) channel [l]. This G.722 wideband speech coder will be described later in this article. Current activities in wideband speech coding concentrate on coding at 32 kbis and below, with the 64 kbis CCITT standard serving as reference.
Audio - The compact disc (CD) has made digital a u d i o p o p u l a r , its 16 b P C M f o r m a t is a n accepted audio representation standard. In audio production resolutions u p t o 24 b P C M a r e in use. O n a compact disc signals of 20 kHz bandwidth and 44.1 kHz sampling rate are stored with a resolution of 16 bisample. Hence the resulting net bit rate is 44.1 x 16 = 705.6 kbis per monophonic channel. A significant overhead is needed for a line code that maps 8 information bits into 14bits, for synchronization, and for error correction. Bit error rates for clean disks after error correction are around lo-' I , but handling (fingerprints) may bring thisvalue down to Table 2 lists some parameters of the C D and the digital audio tape (DAT). The Moving Pictures Expert Group within the International Organization of Standardization (ISOIMPEG)has been developing,a series of audiovisual standards. Its recent audio coding standard is the first international standard in the field of high quality digital audio compression a n d is about to become a standard in many other application arcas, both for consumer and professional audio [2, 31. Decoder chips are already available, a first consumer product, Philips Digital Compact C a s s e t t e ( D C C ) m a k e s use of Layer I of t h e ISOiMPEG coder. Typical application areas for digital audio are in the fields of audio production, program distribution and exchange, digital sound broadcasting (DSB)],digital storage (archives, studios,consumer electronics). Digital audio is also useful for interpersonal communicationssuch asvideo-conferencing and multimedia applications, and for cnhanced quality TV systems.

Table 1. Typical values of basic parameters ojthree classes of acoustic signals.

.Table 2 . Formats of CD and DATstorage formats.


reproduction standard must be defined to provide an improved stereophonic image for audio-only applications including teleconferencing and for improved television systems. Loudspeaker arrangements, referred to as 3/2-stereo,with a left and a right channel (L and R), an additional center channel C and two sideirear surround channels (Ls and Rs), improve the presentation performance, both in audio-only applications and in audio-with-picture reproduction, where directional imagingdistortions, i.e., angular displacements betweenvisual and auditory images, play a significant role. In particular, the t h r e e f r o n t i o u d s p c a k c r s e n s u r e a sufficient directional stability and clarity of t h e frontal sound image (i.e., a stable middle and an enlarged listening area). An additional pair of surround loudspeakers2 may be useful for collective viewing situations, e.g., HDTV-cinema, but a3i2-stereo format is considered to be a good compromise both for pro4. duction and transmission [ ] Examples of digital multichannel surround systems are the upcoming ISOiMPEG 3i2-stereo coding standard and Dolby's Stereo SR. D system based on its AC-3 audio coding algorithm. Both systems offer an additional optional low frcqucncy effect (subwoofer) channel, to reproduce frequencies below around 120 Hz with one or more loudspeakers which can be positioned freely in the listeningroom. The overall bitrate for a3/2-stereo system will possibly fit into the 384 kbis HO channel of the ISDN hierarchy (see below).

Signal Delivery Delivery of digital speech and audio signals is possible ovcr terrestrial and satellite-baseddigital broadcast and transmission systems such as subscriber lines, program exchange links, cellular mobile radio networks, cable-TV networks, etc.
lSDN -With ISDN, customers have physical access to a number of relatively low-cost dial-up digital telecommunications channels. For the transmission of wideband speech and audio signals the basicrate interface is of interest; it consists of two 64-kb/s

The Europecriz teini is Digital Audio Broadcaslirig (LIAB)


A 3l4-stcreo format does twt inzply two crdditioriul sigrlcr1.Y; olle or two surround sigtials ma?; feed Iwo orfoirr sidelrear loud-

Multichannel Stereophony -As a logical further step in digital audio a universal loudspeaker

spcw kers.

IEEE Communications Magazine


..
. _

November 1993

35

___

+ 8 kbls, a n d f o r emission a t 128 kbls. T h e ISOiMPEG Layer I11 coder is recommended for commentary links at a rate of 60 4 kbls.

Perception and Quality Measures


Perception
Auditory perception is based on critical band analysis in the inner e a r where a frequency-toplace transformation occurs along the basilar membrane. The power spectra are not represented on a linear frequency scale but on limited frequency b a n d s called critical b a n d s [5]. T h e auditory system can be described as a bandpass filterbank, consisting of strongly overlapping bandpass filters with bandwidths in the order of 100 Hz for signals below 500 Hz and u p to 5000 H z for signals at high frequencies. U p to 24,000 Hz 26 critical bands have to be taken into account.

H Figure 1. Threshold in quiet and masking threshold (after (61). Acoustical

events in the hatched areas will not be audible.


B channels and one 16 kbls D channel (which supports signaling but can also carry user information). The primary-rate interface is either a 23 B + D configuration (North America and Japan) o r a 30 B + D configuration (Europe) where the D channels operate at 64 kbis. Both configurations support also 384 kb/s HO channels, and various combinations. From these numbers it is clear that ISDN offers useful channels for a practical distribution of stereophonic and multichannel audio signals.
lVDLAN - T h e Integrated VoiceiData Local A r e a Network ( I V D L A N ) ( I E E E 802.9) c a n cope with real time constraints. It provides a high bandwidth packet service (P channel) and a number of full-duplex isochronous digital channels (B, C, and D channels), similar to ISDN channels. There a r e two 64 kbis B channels, a 16 kbls o r 64 kb/s packet channel, a n d a m x 64 kbis b r o a d b a n d channel, similar to ISDN H channels.

DSB - Satellite-based or terrestrial digital audio broadcasting is a complex task, in particular, if listeners use mobile and portable receivers. Multipath interference and selective fading are the main impairments to be expected. In addition, a broadcast network chain can include sectionswith different quality requirements ranging from production quality which supports editing, cutting, and postprocessing, to commentary grade quality which is capable of delivering speech of excellent quality and musical program material at a reduced level of performance t o the listener. T h e C C I R has made extensive tests to define audio coders to be used for DSB. In its ongoing work CCIR is preparing a draft recommendation that the ISOiMPEG Layer I1 audio coding algorithm with an independent coding of the left and rigth information should be used for digital audio contribution links (which support exchangeofprograms)at 180 12kb/s (the l a t t e r r a t e is f o r d a t a ) , f o r distribution links (which transmit the sound to the emitters) at 120

SimultaneousMasking -Simultaneous masking is a frequency dornain phenomenon where a lowlevel signal, e.g., a pure tone (the maskee) can be made inaudible (masked) by a simultaneouslyoccuring stronger signal (the masker), e.g., smallband noise, if masker andmaskee are close enough to each other in frequency [6]. A masking threshold can be measured belowwhich any signal will not be audible. The masking threshold depends on the sound pressure level (SPL) and the frequency of the masker, and on the characteristics of masker and maskee. For example, with the masking threshold for the S P L = 60 d B masker in Fig. 1 at around 1 kHz, the SPL of the maskee can be surprisingly high it will be masked as long as its SPLis below the masking threshold. The slope of the masking threshold is steeper towards lower frequencies, i.e., higher frequencies are more easily masked. It should be noted t h a t t h e distance between masking level a n d masking threshold is smaller in noise-masks-tone experiments than in tone-masks-noise experiments. Noise and low-level signal contributions are masked inside and outside the particular critical band if their SPL is below the masking threshold. Noise contributions can be coding noise, aliasing distortions, and transmission errors. Without a masker, a signal is inaudible if its SPL is below the threshold of quiet, which depends on frequency a n d covers a dynamic r a n g e of more than 60 d B as shown in the lower curve of Fig. 1. The qualitative sketch of Fig. 2gives more details about the masking threshold: the distance between the level of the masker (shown as a tone in Fig. 2) and t h e masking threshold is called signal-tomask ratio (SMR). Its maximum value is at the left border of the critical band (point A). Within a critical band, coding noise will not be audible as long as its signal-to-noise ratio (SNR) is higher than its SMR. Let SNR(m) be the signal-to-noise ratio resulting from a m-bit quantization, the perceivable distortion in a given subband is then measured by the noise-to-mask ratio NMR(m) = SMR - SNR(m) (in dB). The noise-to-mask ratio NMR(m) describes the difference between the coding noise in a given subband and the level where a distortion may just become audible; its value in d B should be negative.

36

IEEE Communications Magazine

Novembci- 1993

W e have just described masking by only one masker. If t h e source signal consists of many simultaneous maskers, a global masking threshold can be computed that describes the threshold of just noticeable distortions as a function of frequency. T h e calculation of the global masking threshold is based on the high resolution short-term amplitude spectrum of the audio or speech signal, sufficient for critical-band-based analyses, and is determined in audio codingvia a 512- or 1024-point FFT. In a first step all individual masking thresholds are determined, depending on signal level, type of masker (noise or tone), and frequency range. Next, the global masking threshold is determined by adding all individual masking thresholds and the threshold in quiet. (Adding this latter threshold ensures that the computed global masking threshold is not below the thrcshold in quiet). The effects of masking reaching over critical band bounds must be included in the calculation. Finally the global signalto-mask ratio (SMR) is determined as the ratio of t h e maximum of t h e signal p o w e r a n d t h e global masking threshold (or as the difference of the corresponding levels in dB), as shown in Fig. 2.

Figure 2 . Masking threshold and SMR. Acoustical events in the hatched areas will not be audible. In practical designs of perceptual coding we cannot go to the limits of masking or just noticeable distortion, since postprocessing of the acousticsignal (e.g., filteringin equalizers) by the end-user and multiple encodingidecoding processes may demask the noise. In addition, since our current knowledge about auditory masking is very limited, thc hearing model used in the design of a particular perception-based audio coder may not be accurate enough. Therefore, as an additional requirement, we need a sufficient safety margin in practical designs of coders.
Quality Measures Digital representations of analog waveforms cause the introduction of some kind of distortions that can be specified by subjectivecriteria, such as the mean opinion score (MOS) as a measure of perceptual similarity;by simple objective criteria, such as the signal-to-noise ratio as a measure of t h e waveform similarity b e t w e e n s o u r c e a n d reconstructed signal; or by complex criteria serving as objective measures of perceptual similarity, which take into account facts about the human auditory perception.

Temporal Masking - In addition to simultaneous masking, two time domain phenomena also play an important role in human auditory perception, pre-masking, and post-masking. The temporal masking effects occur before and after a masking signal has been switched on and off, respectively (Fig. 3). The duration when pre-masking applies is less than - or as newer results indicate, significantly less than - one-tenth that of t h e postmasking, which is in the order of 50 t o 200 ms. Both pre- and postmasking are being exploited in the ISOiMPEG audio coding algorithm. Perception-basedCoding- In perception-based coders the encoding process is controlled by the global signal-to-mask ratio vs. frequency curve. If the necessary bit rate for a complete masking of distortions is available the coding schemewill be transparent, i.e., the decoded signal is indistinguishable from the source signal. If the necessary bit rate for a complete masking of distortions is not available, t h e global masking threshold serves as a spectral error weighting function: the resulting error spectrum will have the shape of the global masking threshold.

Figure 3. Temporal Masking (after [6]).Acoustical events in the hatched area will not be audible.

IEE E Communications Magazine

November 1993

37

Speech coding with a bandwidth wider than that oflered in telephony results in major improvements in represented speech quality
1 .

Figure 4. Structure of CCITT G.722 wideband speech coder,

T h e most p o p u l a r subjective assessment method is the mean opinion scoring where subjects classify the quality of coders on an N-point quality scalc. The final result of such tests is an averaged j u d g e m e n t . t h e M O S . T w o 5 - p o i n t adjectival grading scales are in use, one for signal quality and the other for signal impairment, with an associated numbering [7]. In the 5-point CCIR impairment scalc a MOSvalue of 5 refers to an impcrccptiblc impairment, whereas a MOS value of 4 is defined by a perceptible, but not annoying impairment. etc. The impairment scale is extremely useful if coders with only small impairments have to be graded. The ISOiMPEG tests have shown that triple stimulusihidden referenceidouble blind tests, based on such MOS evaluations, can lead to very reliable results; in addition, small differences in quality become detectable. I n these tests the subject is offered three signals, A, B, and C (triple stimulus). A is always the unprocessed source signal (the reference). B and C, or C and B, are the reference and the system undcr test (hidden reference). Thc selection is neither known t o the subjects nor to the conductor(s) of the test (double blind test). The subjects have to decide if B or C is the reference and have to grade the remaining one. The advantage of MOS values is that different impairment factors can be assessed simultaneously, and that even small impairments can be graded. However, on the ncgative side, MOS values vary with time and from listener panel to listener panel, and it seems to be very difficult to duplicatc test results at a different test site. In the case of audio signals, MOS values also depend strongly on thc selected test items, and there are significant differences between MOSvalues obtainedwith average audio material and those obtained with most critical test items. Finally, quality scale and impairment scale MOS values are not comparable. For all these reasons, care is needed when comparing results between different experiments.

On the other hand, it should be pointed out that the MPEG and CCIR listening tests, carried out under very similar and carefully defined conditions with experienced listeners, have shown very similar and stable evaluation results. Objective assessments should correlate with human perception for all distortions likely to be found in the coding algorithm and transmission or storage system. Perception-based measures make use of masking thresholds derived from the input signal inorder to compare them with the actual codingnoise of the coder. Recent results have shown that such measures can give high correlations between subjective MOS scores and objective scores. F o r example, the perceptual audio quality measure has been applied t o audio signals in the C C I R Digital Sound Broadcasting tests and hasgiven acorrelation of 0.98 with a standard deviation of 0.17 [8]. Another set of parameters, including local noiseto-mask r a t i o s a n d a v e r a g e s over all critical bands, has proven to be easily implementable and to be accurate enough to be useful in coder design and evaluation 191. I n t h e C C I R audio coding tests t h e correlation was 0.94 with a standard deviation of 0.27.

Wideband Speech Coding


The CClTT Standard
Speech coding with a bandwidth wider than that offered in telephony results in major improvements in represented speech quality [ I , 101. The CCITT G.722 wideband speech coding algorithm s u p p o r t s bit r a t e s of 64, 56, and 48 kbis. T h e codec can be integrated on one chip and its overall delay is around 3 ms, small enough to cause no echo problems in telecommunication networks. In the CCITT wideband speech coder a subband splitting, based on two identical quadrature mirror (bandpass) filters (QMF), divides the 16kHz sampled 14 b PCM representation of the wideband input signal into two critically subsampled

38

IEEE Communications Magazine

November 1093

(8 kHz sampled) components, called low subband and high subband (Fig. 4). T h e filters overlap, and aliasing will occur because of the subsamplingof each of the components; the synthesis Q M F filterbank at the receiver ensures that aliasing products are canceled. However, quantization error c o m p o n e n t s on t h e two s u b b a n d s will not b e eliminated. Therefore, 24-tap Q M F filters with a stop-band attenuation of 60 dB are employed. The coding of thc subband signals is based on a modified version of the 32 kbls CCITT G.721 ADPCM speech coder. Input samples are adaptively predicted, the prediction error (or difference) signal is quantized and transmitted. The predictor is backward-adaptive, i.e., the predictor coefficicnts a r e u p d a t e d sample-wise u n d e r t h e control of the already coded difference signal that is also available at the decoder. The predictor usesa pole-zerostructurewith sixzcroes and two poles. It combines good prediction gain (with an equivalent gain in overall SNR) and simple stability control. The quantizer is also backwarda d a p t i v e a n d can rapidly a d a p t itself t o t h e changing statistics of speech signals. After transmission errors, both predictor and quantizer converge (in the long term) to identical values once no more transmission errors are observed. High quality coding with the G.722 wideband speech coder is provided by a fixed bit allocation, where the low and high subband ADPCM coders use a 6 bisample and a 2 bisample quantizer, respectively. In the low subband the signal resembles the narrowband speech signal in most of its properties. A reduction of the quantizer resolution to 5 or 4 bisample is possible to support transmission at lower rates or of auxiliary data at rates of 8 and 16 kbis. Embedded encodingis usedin the low subband ADPCM coding, i.e., the adaptations of predictor and quantizer are always based only on the four most significant bits ofeach ADPCMcodeword. Hence a stripping of one o r two least significant bits from the ADPCM codewords does not affect the adaptation processes and cannot lead to mistracking effects otherwise caused by different decoding processes in transmitter and receiver. Averaged MOS values from tests conducted worldwide in seven laboratories, with loudspeaker reprcsentation of wideband speech with seven different languages. are shown in Table 3. The uncodcd source has the samc MOS rating as the 64 kbJs version. implying no measurable difference in subjective quality. Wc notice agraceful degradation of subjective quality with decreasing bit rate and increasing bit e r r o r r a t e . T h e c o d e r is e r r o r robust a t the two rates it had been tested for; its subjective performance at a bit error rate of 0.001 still has a fair rating, close to the score of 64-kb/s PCM. Beyond the Wideband Speech Coding Stan dard With the introduction of narrowband ISDN, lowcost video telephony bit rates lower than 48 kbls, the minimum bit rate of G.722, are needed forwideband speech. O n e approach has been to use 32 kb/s adaptive transform coding with Huffman coding, perceptually-based noise shaping. and dynamic bit allocation [ l 11. In another approach a speech analysis-by-synthesis technique has been followed: a low-delay 32 kbls code-excited linear

64 kbls

H Table 3. MOS values of CCITT G.722 Wideband Speech Coder ill. The MOS scores result from loudspeakerpresentations.

1
,

G.722 G.722 G.722

48 kbls 56 kbls

37 . 4.3

37 . 3.7

37 . 4.0

64 kbls

4.0

4.1

41 .

H Table 4. MOS values of 16 kbls wideband speech coders [ I S ] .

predictive (CELP) coder, similar to the recent CCI7T G.728 coder for narrowband speech but with an LPC filter order of only 32 (instead of 50), has shown an averagerating in subjective tests essentiallyequal to that of the 64 kb/s G.722 coder, suggesting that itsMOSvaIueisabove4.0[12].Recentlymostefforts have gone into reducing the bit rate even further to 16 kbls and below - at the expense of higher delays. The values in Table 4 result from extensive subjective tests for loudspeaker listening (the values f o r high quality h a n d s e t listening a r e sligthly lower, 0.3 points on t h e average) [13]. T h e results show that male voices obtain a subjective score at a rate of 16 kbls, which is close to that of the CCITT G.722 performance at 56 kbis; whereas the performance with female voices is below the CCITT G.722 performance at 48 kbis. Finally, it should be noted that MPEG, in its current phase 2, has also addressed codings at lower sampling rates (see below).

Coding of Audio Signals


bit rates have been F irst steps to reduce audioinstantaneous14 bit based on techniques of companding (e.g., a conversion of uniform PCM into an 11bit nonuniform PCM presentation), and on various forms of block-companding such as 16 to 14 bit scaling in digital satellite broadcasting systems (CCITT Rec. J41,542). The BBC has used the near-instantaneouslycompanded audio multiplex (NICAM) technique for the transmission of sound in broadcast television networks. Such coders provide a sufficient dynamic range for audio coding, but they d o not reduce bit rates efficiently since they neither exploit statistical dependencies between samples nor auditory masking effects. Bit rate reductions by fairly simple means are achieved in the interactive C D (CD-I) which supports 16 bit PCM at a sampling rate of 44.1 kHz and allows for three levels of Adaptive Differential P C M ( A D P C M ) with resolutions of 37.8 kHzi8 bit, 37.8 kHzi4 bit, and 18.9 kHzi4 bit. A good to excellent audio coding performance has been obtained more recently with various frequency domain coders, both in the classes of subband coding (SBC) and adaptive transform coding (ATC). The differences between these proposed

IEEE Communications Magazine

November 199.3

39

The ISOIMPEG Audio coding standard consists of three layers (codes) of increasing complexity and improving subjective pegormance.

I1

coders are in the number of spectral components and in the strategies for an efficient quantization of spectral components and masking of the resulting coding e r r o r s . Frequency-domain coding offers a more direct way than predictive coding for noise shaping and suppression of frequency components that need not to be transmitted. In these coders the source spectrum is split into frequency bands, each frequency component is quantized seperately. Therefore, the quantization noise associated with a particular band is contained within that band. T h e number of bits used t o encode each frequency component varies: components being subjectively m o r e important, a r e quantized more finely, while components being subjectively less important, have fewer bits allocated, or may not be encoded at all. A dynamic bit allocation has to b e employed that is controlled by t h e spectral short-term envelope of t h e source signal, a n d therefore bit allocation information has t o be transmitted t o t h e d e c o d e r efficiently a s side information. O n e frequency domain coder that has been applied sucessfully to coding of audio signals is the already described subband-based C C I T I G.722 wideband speech coder. It provides a good to fair quality and has been usedwith sampling rates of both 16 and 32 kHz [l, 141. More recently, transform-based audio coding schemes have been proposed and tested. One example is Dolbys 128 kbls AC-2 coder [15], a modifiedversion ofwhich has been evaluated in the CCIR process of digital audio broadcast standardization and has shown to be close in performance to the ISO/MPEG Layer I1 audio coding algorithm at its fixed bit rate. A second example is AT&Ts Perceptual Audio Coder (PAC) that extends the idea of perceptual coding to stereo pairs. It uses both L/R (leftiright)and M/S (suddifference) coding, switched in both frequency and time in a signal dependent fashion [ 161. Subjective tests have shown an average increase in MOS score of 0.6. over the dual monophonic mode for the same bit rate. Sonys Adaptive Transform Acoustic Coder (ATRAC) has been developed for portable digital audio, specifically for Sonys magneto-optical MiniDisc (MD) (171. The coder uses a hybrid frequency mapping employing a signal-splitting into threesubbands(0-5.5,5.5 - ll.0,and 11.0-22.0kHz) followed by s u i t a b l e dynamically windowed MDCT transforms (such a technique, described below, is also applied in Layer 3 of the ISOMPEG Audio coder).

The Basics of the ISOIMPEG Audio

Coder
Operating Modes - T h e ISO/MPEG Audio coding standard consists of three layers (codes) of increasing complexity and improving subjective performance. It supports sampling rates of 32, 44.1, and 48 kHz, and bit rates per monophonic channel between 32 and 192 kbls, or per stereophonic channel between 128 and 384 kbls. T h e s t a n d a r d offers t h e single c h a n n e l m o d e , t h e stereo mode, the dual channel mode (to provide bilingual audio programs), and the optional joint stereo mode. In this latter mode the two coders for the left and right channel can support each other by exploiting statistical dependencies and irrelevancies between these channels to compress the audio bit rate to an even higher degree than is possible in monophonic transmission. A so-called intensity stereo mode is optional in the ISO/MPEG standard; if it is used, it will only be effective if t h e required bit r a t e exceeds t h e available bit rate, and it will only be applied to subbands corresponding to high frequencies. Frequency Mapping - In the encoder the 16bit PCM format audio signal is windowed and converted into spectral subband components via a polyphase filterbank consisting of 32 equally spaced bandpass filters (Fig. 5). Such filterbanks perfectly cancel the aliasing of adjacent overlapping bands in the absence of quantization errrors; they are computationally very efficient, since an F F T can be used in the filtering process; and they a r e of moderate complexity and low delay. O n the negative side, the filters a r e equally spaced, and therefore the frequency bands do not correspond well to the critical bands at low frequencies. The filters used are of order 5 11,which implies a n i m p u l s e response of 5.33 ms length ( a t 48 k H z ) ; they a r e designed f o r a high s i d e l o b e attenuation exceeding 96 dB that is necessary for sufficient cancellation of aliasing d i s t o r t i o n caused by quantization noise. T h e shape of the filter impulse response supports temporal masking of pre-echoes in case of a n attack signal (as described below). The filtered bandpass output signals are critically subsampled (decimated), i.e., they are sampled at a rate that is twice the nominal bandwidth of the bandpass filters. At a 48 kHz sampling rate, each band has a width of 750 Hz and the sampling rate of each decimated subband is 48/32 = 1.5 kHz. Therefore we have as many frequency domain samples per second as time domain samples. In the receiver, the sampling rate of each subband is increased to that of the source signal by filling in the appropria t e number of zero samples, and interpolated subband signals appear at the bandpass outputs of the synthesis filter bank. In Layer I11 a higher frequency resolution closer to critical band partitions is achieved by subdividing t h e 3 2 subband signals f u r t h e r in frequency content by applying a 6-point or 18p o i n t modified D i s c r e t e C o s i n e T r a n s f o r m (MDCT) with 50 percent overlap to each of the subbands(Fig.7). The maximum number of frequency components in Layer I11 is therefore 32 x 18 = 576, each representing a bandwidth of 24000/576 = 41.67 Hz.

ISOIMPEG Standardization
The recent ISO/MPEG Audio coding standard combines features of MUSICAM [18] and A S P E C [19]. Its main features will be described in the following leaving out such important issues as framing, synchronization, error protection, etc. The standardization process included extensive subjective tests and objective evaluations of parameters such as complexityand overall delay. The total number of subjects (expert listeners) was around 60, approximately ten test sequenceswere used, and the sessions were performed in stereo with both loudspeakersand headphones [20,21]. It should be noted that critical test itemswere chosen in the tests to evaluate thecoders by theirworst case (not average) performance.

40

IEEE Communications Magazine

November 1993

A crucial
Pa* of frequency domain coding of audio signals is the appearance o pre-echoes. f
Figure 5 . Block str-irctureof ISOJMPEGAitdio ericoder and decoder, Layer~s arid II. I

QUJntiZJtiOn and Bit Allocation - In each of t h e 27 lowest subbands used in the encoding. blocks of 12decimated samples are formed and blockcompanded, i.e., divided by a scalefactor such that the sample of largest magnitude is unity. Each block corresponds to 12 x 32 = 384 input samples; which corresponds to 8 ms of audio at a sampling rate of 48 kHz. The choice of the blocklength is affected by two conflicting requirements: o n one hand, longer blocks reduce the side information bit rate, andon theother hand premaskingisonlyeffective with short blocks (see below). Each spectral component is quantized whereby the number of quantizer levels for each component is obtained from a dynamic bit allocation rule that is controlled by a psychoacoustic model. The model computes S M R , the global masking threshold, for each 12-sample block via an FFT (Fig. 2). T h e bit allocation algorithm selects then one uniform midtread quantizer out of a set of available quantizers such that both the bit rate requirement and the masking requirement are met. T h e procedure starts with the number of bits for the samples and scalcfactors set to zero. I n each iteration step the quantizer signal-tonoise ratio SNR(m) of that subband quantizer is increased that contributes most to an improvedperformance. Forthat purpose, the noise-to-mask ratio NMR(nz) = SMR - SNK(nz) is calculated as the difference ( i n dB) between the actual quantization noise level and the minimum global masking threshold (Fig. 2). The quantized spectral subband components are then transmitted to the receive r together with scalefactor and bit allocation information. Note that the psychoacoustic model is only needed in the cncoder, which makes the decoder less complex, a desirable feature for audio playback and audio broadcasting applications. Two psychoacoustic models, both based on auditory masking as described above, a r e given in t h e informative part of the standard; better o r simpler models may be used instead.

Pre-Echo Control - A crucial part of frequency domain coding of audio signals is the appearance of pre-echoes. Consider t h e case that a silent period is followed by a percussive sound, such as from castanets or triangles, within the same coding block. Such an attack will cause comparably large instantaneous quantization errors. In ATC, t h e inverse transform of t h e decoder will distribute such errors over the block; similarily, in SBC, t h e decoder bandpass filters will spread such errors. In both mappings, pre-echoes occur and can become distinctively audible, especially a t low bit rates with comparably high error contributions. By the time domain effect of pre-masking, p r e - e c h o e s c a n b e m a s k e d if t h e t i m e s p r e a d is of short length (in the order of a few milliseconds).

ISOIMPEG Layers Layer I - Fig. 5 has already shown the block


structure of the ISOiMPEG Audio encoder and decoder for Layers I and 11. T h e Layer I coder uses fixed subband blocks containing 12 decimated samples. Each scalefactor is represented by 6 b and is transmitted for each subband block unless the bit allocation rule indicates that t h e subband block and its scalefactor need not be transmitted at all. For each 12-sample block the SMR is calculated via a 512-point FFT. F o r each subband the bit allocation selects one uniform midtread quantizer out of a set of 15 quantizers with M = 2m -1 levels (ni = 0 o r m = 2 . . . 15 b). 4 b are needed per block for the bit allocation information. The decoding is straightforward: the subband sequences are reconstructed on the basis of 12-sample subband blocks taking into account the decoded scalefactor and bit allocation information. Each time t h e subband samples of all 32 subbands have been calculated, they are applied to the synthesis filterbank, which includes also interpolation and windowing operations, and 32 consecutive 16 b P C M format a u d i o samples a r e

IEEE Communication\ Magazine

Novemhcr lW3

4 1

The ISO/MPEG standardiza tion process is now in its Phase 2; an I S 0 committee drafi will be completed in November 1993.

Figure 6. MOS resultr of ISOIMPEG Audio Coder Layer II at a rate of 128 kbis per monophonic channel. The MOS vahtet are shown as harploty, with three lines on the top. The middle one indicates the mean value, the two remaining ones represent the mean pludminus the 95% confidence k L d . Subjective tests included 10 test item5 and 58 tubjects. The item AN LJ the average over all amssments. All valuer are averaget over l o u d p a k e i and headphone pretentations [20/.

calculated. In the ISO/MPEG subjective tests this Layer I codec had a mean MOS valuc (over 10 test items) of around 4.7 at a rate of 192 kbis per monophonicchanne1,with aworst-case meanvalue forone item still above 4.4.

Layer I / - The ISO/MPEG Audio Layer I1 coder


is basically similar to thc Layer I coder but has a higher complexity and achieves a better performancc according to three modifications. First, the input to the psychoacoustic model is a 1024-point F F T leading to a finer frequency resolution for the calculation of the global signal-to-mask ratio. Second, the overall scalefactor side information is reduced by a factor of around 2: in each subband. blocks of 12 samples are formed and scalefactors of three adjacent 12-sample blocks are calculated (which implies that 3 x 12 x 32 = 1152 input samplesare taken into account). Dependingon their relative values only one. two, or all three scalefactors are transmitted. Only one scalefactor has to be transmitted if the differences are relatively small and only the first one of two adjacent scalefactors has to be transmitted if thc sccond one has a smaller value, such that post-masking can be exploited. I n case of large dynamic changes, all scalefactors may have to be used. Thc selected scalefactor or scalcfactors are again represented by six bits each. The pattern of the transmitted scalefactorswill be coded as 2 bisubband side information called scalefactor select information. Third, a finer quantization with up to 16 b amplitude resolution is provided (which reduces the coding noise). O n the other hand, the number of available quantizers decreases with increasing subband index (which keeps the side information small). Thc decoding follows that of Layer I. Due to the scalefactor selection process, the descaling has to be based on 3 x 12 = 36 subband samples hence introducing additional delay. The total delay (without processing delay) of the Laycr 11 codec is 45 ms at 48 kHz sampling rate. Fig. 6 shows MOS values of the Laycr I1 codcc. as measured in the ISOiMPEG subjective tests. at a rate of 128 kbis per monophonic channel. The mean MOS valuc (over all items) is around 4.8, the worst-case mean value is around 4.6.

Layer///-Figure 7shows the blockstructureofthe I S O i M P E G Layer I11 Audio coder that introduces many new features which are not part of Layers I and 11. The codcr achieves a better performance, especially at low bit rates (64 kbis per monophonic channel) due to an improved timeto-frequency mapping, an analysis-by-synthesis approach for the noise allocation, an advanced p r c - e c h o c o n t r o l , a n d finally by nonuniform quantization with entropy coding. A higher frequency resolution is achieved by employing a hybrid filterbank, a cascade of thc polyphase filterbank and a dynamicallywindowed MDCT transform. The dynamic window switching [ 14, 221 allows t o switch from a higher frequency resolution (18point MDCT corresponding to 12 ms of audio) to a lower frequency resolution (6-point MDCT corresponding to 4 ms of audio) for subbands above a chosen index when a higher time rcsolution is necessary in order to control time artifacts (preechoes) during nonstationary periods of the signal. In principle, a pre-echo is assumed, when an instantaneous demand for a high number of bits occurs. The MDCT output samples are nonuniformly quantized, thus providing both smaller mcan-squared errors and masking, i.e., errors can be larger if the samples to be quantized are large. Huffmaneoding based o n 32 tabulated code tables is applied to represent the quantizer indices in an cfficient way. In addition, run-length coding ofzero value sequences increases the efficiency. A buffer maps the variable wordlength codewords of thc Huffman codc tables into a constant bit rate. A buffer feedback is used to prcvent the buffer from overflow. In order to kecp the quantization noise in all critical b a n d s below t h e g l o b a l m a s k i n g t h r e s h o l d (noise allocation) an iterative analysis-by-synthesis method is employed whereby the process of scaling, bit allocation, quantization and coding of spectral data is carried out within two nested iteration loops. T h e decoding follows that cif the encoding process. A t a r a t e of 64 kbis p e r monophonic channel the mean MOS values (over all items) for Layers I1 and 111, as measured in ISOiMPEG subjectivc tests, are around 3.1 and 3.7, respectively. Obviously the higher complcxity of the Layer I11

42

IEEE Communications MngaLine

November 10c13

coder pays off a t low hit rates. At a 128 kbis joint s t e r e o hit rate seven of cight test items had a MOS value of 4 and above.

ISOIMPEG Phase 2 The ISOIMPEG standardization process is now in its Phase 2; an I S 0 committee draft will be completed in Novcmbcr 1993. Emphasis in the audio coding part of the new activity is on multichannel and multilingual audio and on an extension of the cxisting standard to lower sampling frcquencies ( 16, 22.05, 24 kHz.) and lowcr hit ratcs. The work on 3I2-stereo (multichanncl coding) is carried out in close cooperation with CCI R. Backwards compatibility to t h e two-channel (210.) M P E G PhascIiAudiowill be takenintoaccount,so that that decoder will deliver a corrcct front left and right channcl downmix from t h e multichannel bitstream. In addition. a downmix t o 3/0-, 2i2-sterco formats, and even to a 1/O-monophonic channcl, should be possible. Beyond the Existing Audio Coding Standard The ISOiMPEG standard is the first standard in audio coding (besides the quasi-standard of the CD). It isworthwile to note that its normativc part describes thc decoder and the meaning of the encoded bitstream, hut that the encoder isnot dcfincd, thuslcaving room for an evolutionary improvement of the cncodcr. In particular, different psychoacoustic modclscan bc used rangingfromvery simple ones (including none at all) tovery complexones bascd on quality and implementability requircnicnts (the standard gives two examples of such modcls). Therefore wewill sec in the future diffcrcnt solutionsfor encoding. In addition. a bcttcr undcrstanding of binaural perception and of stereo and multichannel

presentation will lead to ncw proposals for a better use of the joint stereo mode providcd by the standard. Several activities are already underway in high quality coding at lower bit rates. Phase 2 and, in particular, the future Phase 4 of the ISOiMPEG standardization process addrcss such codings at low and very low bit rates. Avery attractive application is the transmission of a stereo audio signal over a 64 kh/s basic rate channcl, or even over transmission channels within the CCITT fast modem projcct, which will s u p p o r t digital transmission a t bit rates up to 24 kbis over the existing analog telephone network. We can also expcct further activities in the field of digital multichannel surround systems. Ongoing rescarch will result in enhanced stereophonic representations by making use of interchannel correlations and interchannel masking effects; stcreo-irrelevant components of the multichannel signal may he identificd and reproduced in a monophonic format to bring the bit ratcs further down. In addition, thc potential of such systems t o deliver n a t u r a l three-dimensional spatial acoustical imageswill lead to ncw proposals. We can also expect solutions for special presentations for people with impairments of hearing or vision that can make use of the multichannel configurations in various ways.

Conclusions
owerful algorithms and for effiP cientare now available tostandards the anaudio coding of wideband speech and signals enhancc quality of services in communication-based and storagebased audio-only and audiovisual applications. In particular, the ISOiMPEG audio coding standard

Much
lower bit rates will be covered by the recently inaugurated ISOIMPEG Phase 4.

(Layer 11) achieves subjectivequality comparable to thatoftheCDatarateof256kblsforastereophonic signal with an independent coding of the left and right channel. Digital sound broadcasting will make use of this Layer I1 standard for the distribution and emission of the audio material. At 128 kb/s, the Layer I11 standard provides the best audio quality for a stereophonic signal, with a quality that is quite close to that of the C D for many, but not all audio test signals. Much lower bit rates will b e covered by t h e recently i n a u g u r a t e d I S O M P E G Phase 4, targeted at very low bit rates of tens of kb and below for the coding of audiovisual information. New algorithms,rather than extensions of existing algorithms, will be needed to meet this ambitious goal.

Acknowledgment
T h e a u t h o r would like to thank t h e following people for their invaluable contributions to an earlier version of this paper: B. Bochow, K. Brandenburg, H. Fuchs, N.S. Jayant, L. van de Kerkhof, A. Sugiyama.

References
[ l IP. Mermelstein, "G.722,ANewCCllTCoding Standard for DigitalTransmission of Wideband Audio Sianals." /E Commun. Mao.. DD. 8.. 15, Jan. 1988. 121 ISO/IEC JTCVSC29, "Information Technology - Coding o f Moving PicturesandAssociatedAudio for Diaital StoraaeMedia at UD to About 1.5Mbiffs-IS11172(Part3.Aud~)". [ 3 ] K. Brandenburg and G. Stoll, "The ISO/MPEG-Audio Codec: A Genericstandard for Coding of HighQuality Digital Audio."92nd AES Convention, Wien. Marz 1992, Preprint no. 3336. [41 G. Theile, "The New Sound Format '3/2-Stereo'," 94th Audio Engineering Society Convention, Berlin, 1993. 151 B. Scharf,"Critical Bands," i n "Foundations o f Modern Auditory Theory". New York, pp. 159 - 202. 161 E. Zwicker and R. Feldtkeller, Das Ohr als Nachrichtenempfanger. Stuttgart: S. Hirzel Verlag. 1967. [71 N. S. Jayant and P. Noll, "DigitalCoding o f Waveforms: Principles and Applications to Speech and Video," Prentice Hall, 1984. 181 J. G. Beerends and 1. A. Stemerdink, "A Perceptual Audio Quality

Measure,"92ndAESConvention,Vienna(March 1992). Preprint3311. 191 K. Brandenburg and T. Sporer: "NMR" and "Masking Flag": Evaluation of Quality Using Perceptual Criteria, 1 l t h AES International Conference, pp. 169-179, Portland 1992. [ l o ] N.S. Jayant. J.D. Johnston and Y. Shoham, "Coding of Wideband Speech," Speech Communication 11 (1992). pp. 127-138. [ l l ] S. R. Quackenbush, "A 7 kHz Bandwidth, 32 kbps Speech Coder for ISDN," Proc. Internat. Conf. Acoust. Speech Signal Process, PaperS1.l, 1991. [12] E. Ordentlich, Y. Shoham, "Low-Delay Code-Excited Linear-Predictive Coding o f Wideband Speech at 3 2 kbps," Proc. Internat. Conf. Acoust. Speech SignalProcess. 7991, paper S1.3. [13] A. Fuldseth e t al., "Wideband Speech Coding at 16 kbiffs for a VideophoneApplication",SpeechCommunication 1, pp. 139 - 148, 1 1992. [14] M. Iwadare. A. Sugiyama. F. Hazu, A. Hirano, and T. Nishitani. ''A128 kb/s Hi-Fi Audio CODEC Based on Adaptive Transform Coding with Adaptive Block Size," IEEEJ. o n Se/. Areas in Commun., vol. 10, no. 1, pp. 138-144, Jan. 1992. [15] G. Davidson, L. Fielder,M. Antill, "HighQuality AudioTransformCoding at 128 kbitds," Proc. o f the ICASSP 7990. (161J.D. Johnston and A.J. Ferreira, "Sum-Difference Stereo Transform Coding," Proc. I lCASSP '92, pp.11-569 - II 572. [17] K. Tsutsui et al., "Adaptive Transform Acoustic Coding for MiniDisc," 93rd AES-convention, San Francisco 1992. preprint 3456. [18] G. Stoll and Y.F. Dehery, "High Quality Audio Bit-rate Reduction System Family for Different Applications, Proc. lEEE Intern. Conf. on Commun. ICC'90, Rec. Vol. 3, No. 322.3, pp. 937-941. 1992. [191 K. Brandenburg. J. Herre. J. D. Johnston, Y. Mahieux, E. F . Schroeder: "ASPEC: Adaptive Spectral Perceptual Entropy Coding of High Quality Music Signals", 90th. AES-convention, Paris 1991, preprint 301 1. [ 2 0 ] T. Ryden, C. Grewin. S . Bergman, "The SR Report o n t h e MPEG/Audiosubjective listeningtests in Stockholm ApriVMay 1991". ISO/IECJTCVSC29MIG II: Doc.-No. MPEG 91/010, May1991. testsinHan[211H. Fuchs,"ReportontheMPEG/Audiosubjectivelistening nover", ISO/IEC JTCI/SC29/WGII: Doc.-No. MPEG 91/331, Nov. 1991. [22] B. Edler, "Coding of Audio Signals with Overlapping Block Transform and Adaptive Window Functions," (in German). Frequenz, vol. 43, pp. 252-256, 1989.

Biography
P T R NOLL [F '821 is a professor of telecommunications at the TechniEE cal University of Berlin. His research interests are in waveform coding and communication theory. He has authored many technical papers in those fields and is coauthor of the book Digital Coding o f Waveforms: Principles and Applications to Speech and Video. He was a rmipient of the 1976 NTG Award (Germany) and of the 1977 IEEE ASSP Senior Award (IEEE Accoustics, Speech, and Signal Processing Society). Since 1991 he has chaired the Audio Subgroup within ISO/MPEG.

44

IEEE Communications Magazine

November 1993

You might also like