Stereo Coding for Audio Compression

Rui Wang, Harold Nyikal, James Yu March 7, 2005
Abstract A perceptual audio coder with stereo coding is implemented and reviewed. The Mid/Side (M/S) stereo coding is incorporated into the baseline perceptual coder. The coder was tested with various audio files ranging from music to speech at 128kbps. Stereo coding is shown to reduce much of the redundancy in stereo signals.



Audio compression has been an increasingly important technique in audio tranmission and storage. Perceptually lossless compression is achieved by exploiting psychoacoustic models that discard information the human auditory system cannot perceive. This induces much higher compression ratios than the usual entropy coding. Usually, this is performed in a transform domain, where information is discarded by quantizing transform bins more coarsely. Joint stereo coding is an extension to the nominal block floating point quantization scheme. The main assumption in stereo coding is that the left and right channels of audio are highly correlated. This is usually the case for speech and music, since the two microphones are spatially close and used at the same time to record the same sounds. This strong correlation suggests that there is redundancy in the stereo signal. First, we review the perceptual model that is at the heart of the compression scheme. Then, the stereo coder is detailed and analyzed. Finally, compression quality results are provided.


Overview of Perceptual Model

The perceptual model is based on the characteristics of the human auditory system. The model eventually dictates the number of bits that will be assigned to be used for each 1

line in the frequency domain. By assigning different number of bits, the coder is essentially quantizing each frequency line with different levels of coarseness based on relative importance. The steps of the perceptual analysis are as follows: 1. The signal is divided into blocks of size N. 2. The FFT is applied to each block. 3. The masking model is used to determine the SMR for each bark subband. 4. A waterfilling algorithm assigns the number of bits to be consumed for each subband. 5. The block is quantized using block floating point quantization. The quantization actually occurs in the MDCT domain. Note that the signal is analyzed in the FFT domain, and processed in the MDCT domain. The MDCT is usually the preferred transform since it has some nice characteristics in terms of implementation, windowing, and audio quality.



The main mechanism behind the audio perceptual coding model is masking. Masking is the act of one particular signal frequency component inhibiting the perceived strength of another frequency component. For each frequency line in the signal, a masking curve is derived in the bark space using F (dz, LM ) = (−27 + 0.37 max(LM − 40, 0)u(dz))|dz| (1)

where LM is the masker’s sound pressure level (SPL) in dB, u(f ) is the unit step function, and dz is the distance to the masker. This shape looks like a triangle that has constant slope on the left, and shallowing slope on the right with respect to the masker’s SPL. This function is calculated for every frequency line, and the final masking curve is the point-bypoint maximum of all the curves and the threshold for hearing curve. The relative importance of each frequency line is determined by the signal to mask ratio (SMR), measured in dB. The actual bit allocation is performed on each subband, where a subband is determined by the standard bark scale. Therefore, a subband with a higher max SMR are allocated more bits than lines with a lower max SMR. Basically, signals that are below the masking threshold will be masked (or hidden) by another (usually stronger) signal component. Thus, we consider these signals to be less important in reconstructing the signal. 2


Bit Allocation

The bit allocation scheme used is based on block floating point. Each subband is assigned a scale factor which is applied to all the lines in the block. Each line in the block is then associated with a mantissa factor. The perceptual coder determines how many mantissa bits, Rb , a subband will receive. A bit pool P is determined for the block based on the desired bitrate and the sampling frequency. A waterfilling algorithm is used to assign bits to each subband based on the SMR. The algorithm is as follows 1. Determine the number of bits in the bit pool P , based on the desired bit rate, block size, and sampling frequency. 2. Sort the subbands by SMR 3. Add one bit to the band with highest SMR, or two bits if it is the first time is being allocated. 4. Decrement the SMR by 6 dB × bits allocated. 5. Decrement the bit pool. 6. Go to step 2, and repeat until bit pool is emptied. Once all the bits have been used, the block is quantized using the block floating point quantization scheme and efficiently packed into the compressed file.



The decoding algorithm is much less complex than the encoding scheme. This is desirably since in many cases the decoder has less processing power than the encoder. No psychoacoustic analysis needs to be performed for the decoder. Only the Rb values are needed for the decoder to correctly dequantize the frequency lines. After that, the inverse MDCT is applied block by block and are overlap added to produce the reconstructed signal.


Stereo Coding

Joint stereo coding is an extension of the psychoacoustic model that takes advantage of the typical high correlation that exists between the signal power spectra of the left and right 3

channels in stereo audio to improve coding gains. There are various ways of achieving this in practice. We chose to implement mid/side (M/S) stereo coding as outlined in [1]. In this method, instead of transmitting the left and right channels, the normalized sum (mid) and difference (side) signals are transmitted. Also, the left and right channel share a common bit pool. Depending on the signal, this can reduce the data rate of the signal by up to 50%. For example, consider a stereo signal with identical left and right channels. The difference of the two channels will be zero and thus the side information (all zeros) can be transmitted with a single bit to say that it is all zeros. This frees up bits, allowing the mid information to be transmitted with twice as many bits. In any case, the left and right signal can be completely reconstructed at the decoder. Hence, the coding gain is roughly 50%. Though stereo signals are seldom like the example above, the side information is usually smaller in value than either the left or right channels, suggesting a reduction in the number of bits. However, cross-channel psychoacoustics play a big role in perception of stereo sound, and because the mid and the side both contain information on the left and right channels, we must perform bit-allocation for the mid and side information based on a cross-channel psychoacoustic model. Here, we detail the different components in stereo coding. The flow chart of the encoding algorithm can be seen in Figure 1. Also, the decoding algorithm is shown in Figure 2.


M/S Decision

The first step in stereo coding is to decide whether to transmit data as left/right or mid/side. There are cases where there are no significant gains in transmitting mid/side information over left/right in certain subbands. In cases like these, the left/right information is transmitted. Our decision for M/S is applied for each subband of the signal. The decision thresholds are
fhigher 2 (lk k=flower fhigher 2 (lk k=flower fhigher

2 rk )

< 0.8
k=flower fhigher

2 2 (lk + rk )


2 rk )

> 0.8

2 2 (lk + rk )


where lk and rk correspond to the FFT spectral line amplitudes computed in the psychoacoustic model, and flower and fupper correspond to the lower and upper lines within a subband. If either of these conditions are met then M/S is transmitted, if not, then L/R is transmitted. This condition allows M/S transmission in cases where the mid and the side differ in energy by a certain threshold (in this case, 80%).


Figure 1: Flow chart of the stereo encoding algorithm.


Figure 2: Flow chart of the stereo decoding algorithm. The values of M/S are calculated as follows M= L+R 2 L−R S= 2 (4) (5)

where L and R are the filter bank spectral line amplitudes. We can see that no actual information is lost in the transformation to M/S. Both the L and R channels can be easily recovered from the M and S channels.


Masking in Stereo

Next, the masking thresholds for M and S need to be calculated. This is a step-wise process. First the equation (1) is applied to each M and S frequency line in the exact manner as in the aforementioned section to calculate the basic masking thresholds, denoted BT HRm and BT HRs [1]. To calculate the stereo masking contributions of the M and S channels, an additional factor, the masking level difference factor (MLD), is calculated at each frequency line and multiplied by each of the M and S masking level thresholds to obtain the masking level difference, 6

MLD factor 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 frequency (barks) 20 25

Figure 3: The MLD factor that is applied to the masking model. denoted M LDm and M LDs . The MLD provides a second level of detectability of noise in the M and S channels based on the masking level differences between the channels [1]. Essentially, the MLD is a measure of how detectable a masked signal in the M channel is in the S channel and vice versa. The equation used to calculate the MLD factor is as follows [1]: M LD = 101.25(1−cos(π
min(z,15.5) ))−2.5 15.5


where z is the frequency in barks. Figure 3 shows what the MLD curve looks like. Now, the MLD factors can be calculated as: M LDm = M LD × BT HRm M LDs = M LD × BT HRm The actual thresholds for M and S are calculated as follows: T HRm = max(BT HRm , min(BT HRs , M LDs )) T HRs = max(BT HRs , min(BT HRm , M LDm )) (9) (10) (7) (8)

The MLD signal essentially substitutes for the BTHR signal in cases where there is a chance of stereo unmasking [1]. 7


Bit Allocation

The bit allocation scheme used is the exact same as in the baseline coder. The main difference is that both channels now share a common bit pool P , and use the SMRs obtained from the masking curves calculated using the addition MLD factor. The waterfilling algorithm is now applied to all the frequency lines of both channels. This allows the algorithm to assign bits to lines regardless of which channel they are in. Essentially, this is the where the coding gains are achieved. If one channel has a much higher SMR than the other (in the case of the M/S representation), then more bits will be applied to that channel than the other.


Packing and Decoding

The only extra information needed for the joint stereo decoder is an additional bit that tells whether each subband is in an L/R or M/S representation. This information is passed alongside the usual bit allocation bits. The decoder parses this information and will convert any M/S representation into L/R for audio playback.



We applied the joint stereo coding algorithm to various audio signals. One of the more important tests was to make sure that the stereo image quality is not affected by coding the signal as M/S. This was rigorously tested by modulating a signal’s channels by sinusoids with different phase offsets. Specifically, we modulated the left channel using a slowly varying cosine and the right channel with a slowly varying sine. This results in the stereo image weaving from left to right. During these tests we did not notice any degradation in the stereo and audio quality. For the rest of the quality tests, we used a constant bitrate of 128kbps on various audio signals. The results were on par with the baseline coder (ie. the quality was very good). Table 1 shows the listening test results in SDG.



Stereo coding has been shown to be very useful in reducing the redundancy in stereo audio signals. One can achieve significant gains in stereo coding, which can be utilized to either 8

Audio Name Castanets Rock Music Pop Music Harpsichord Glockenspiel Bass Singer

SDG -0.5 0 0 -0.25 -0.1 0

Table 1: Listening test results for the audio signals compressed with the stereo coder. All results use the SDG scale. boost the quality of the reconstructed signal or to lower the bitrate while keeping the signal quality constant with respect to the original coder. This is due to the fact that the M/S representation of the signal is essentially lossless. Moreover, stereo coding does not hurt the stereo image when correctly utilizing the stereo masking model and shared bit pool methods. The tests show that the overall quality remains the same or is better. Stereo coding is also popular with standard audio compression techniques, including MP3.


Future Work

There are many possible extensions to stereo coding. One of these is intensity coding. The main concept is that since much of the signal can be redundant in both channels, we can code some parts of it as a mono signal, and multiplex it to a stereo signal. This is usually done for higher frequencies, where the ear is less sensitive to the stereo image. Using this idea will guarantee a gain of 50% within those particular frequency bands. Another possible extension is to use a variable bit pool that will save bits for later use. This will be most dramatic in cases where the M/S representation is severely skewed to one side. Theoretically, we would only need half the number of bits to represent such a signal while keeping the quality constant. These extra bits may be used for more bit starved blocks in the future. For example, it may alleviate pre-echo effects due to powerful transients.

[1] Johnston and Ferreira, Sum-Difference Stereo Transform Coding, Proc. ICASSP, pp.569571, May 1992.


[2] Bosi and Goldberg, Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, 2003.