You are on page 1of 4

204 IEEE SIGNAL PROCESSING LETTERS, VOL. 10, NO.

7, JULY 2003

Speech Probability Distribution


Saeed Gazor, Senior Member, IEEE, and Wei Zhang

Abstract—It is demonstrated that the distribution of speech


samples is well described by Laplacian distribution (LD). The
widely known speech distributions, i.e., LD, Gaussian distribution
(GD), generalized GD, and gamma distribution, are tested as four
hypotheses, and it is proved that speech samples during voice
activity intervals are Laplacian random variables. A decorrelation
transformation is then applied to speech samples to approximate
their multivariate distribution. To do this, speech is decomposed Fig. 1. PDF estimate of a typical speech signal.
using an adaptive Karhunen–Loève transform or a discrete
cosine transform. Then, the distributions of speech components in
decorrelated domains are investigated. Experimental evaluations increases, the pdf of coefficients is believed to converge to GD.
prove that the statistics of speech signals are like a multivariate In practice, the GD assumption for speech is not an accurate
LD. In brief, all marginal distributions of speech are accurately approximation, not only because of the length of data but also
described by LD in decorrelated domains. While the energies of
because the necessary conditions for the central limit theorem
speech components are time-varying, their distribution shape
remains Laplacian. (even in its asymptotical weak form) are not satisfied.
The underlying assumption in most speech processing
Index Terms—Speech coding, speech processing.
schemes is that the properties of the speech signal change
relatively slowly with time. This assumption leads to a variety
I. INTRODUCTION of processing methods involving intervals of about 1–60 ms, in
which frames of the speech signal are isolated and processed as
M ODELING of speech signals is motivated by many ap-
plications of speech processing, such as speech coding
[3], speech recognition [11], speech enhancement [9], and inde-
if they had the same fixed properties [9]. The aim of this letter
is to investigate the distribution of speech samples in these
kinds of time intervals, analogous to investigation of natural
pendent component analysis and computational auditory scene
images in the DCT domain [7], [8].
analysis (e.g., see [6] and [12]). In order to design better speech
processing systems, it is important to have a reasonable approx-
imation for the probability density function (pdf) of speech and II. TIME DOMAIN MODELING
noise. Various statistics of speech signals such as speech kur- In this section, samples of a speech signal are considered as
tosis have been studied in many scenarios and have found a wide samples of a random variable, and the speech signal’s pdf in
range of applications. the time domain is investigated. This is in contrast to the fol-
In the early 1950s, the pdf of speech signals in the time do- lowing sections, where the signal is first transformed into the
main was investigated by Davenport [1]. Early results show that Karhunen–Loève transform (KLT) and DCT domains Subse-
a good approximation is the gamma distribution ( -D), while a quently, the joint distribution of the transformed components
poorer and simpler approximation is reported to be the Lapla- is investigated. The modeling in the time domain is simple to
cian distribution (LD) [2]–[4]. In many applications, the speech understand and to use, provides a useful basis for further inves-
signal is assumed to be Gaussian in order to simplify the deriva- tigations, and motivates us to model the speech signal in other
tion of speech processing algorithms. More recently, it has been domains.
shown that samples of a bandlimited speech signal can be rep- Fig. 1 gives the amplitude distribution of a typical speech
resented by a multivariate Gaussian distribution (GD) with a signal (speech samples are taken from http://www.dailywav.com
slowly time-varying power if samples are taken within short with an SNR 15 dB and about 50% of silence duration) where
time intervals of less than 5 ms [4], [5]. The power variation the axis is depicted using a logarithmic scale, and the Lapla-
makes the long-term distribution a -D or LD. cian curve is plotted after a curve fitting. From the fact that
As each coefficient of a discrete-time Fourier transform the experimental curve has a linear tail and is symmetrical, it
(DFT) is a linear weighted sum of speech samples as random could be concluded that the pdf of this signal follows an LD
variables, a GD is justified by the central limit theorem for as indicated by Davenport [1] and Richards [2]. The portions
speech signals in the DFT domain. This assumption has been of the curve with least probability cannot be estimated accu-
used in several applications, e.g., [10]. As the frame length rately and produce noisy “tails” in the distribution graph of
Fig. 1. The difference between the experimental distribution
Manuscript received November 18, 2001; revised September 30, 2002. The and LD in the central region corresponds to the silence inter-
associate editor coordinating the review of this manuscript and approving it for vals in the speech signal. A generalized Gamma pdf
publication was Dr. Mark Hasegawa-Johnson. (with and ), which includes
The authors are with the Department of Electrical and Computer Engineering,
Queen’s University, Kingston, ON, K7L 3N6 Canada. the LD (i.e., ) and for , commonly referred to as
Digital Object Identifier 10.1109/LSP.2003.813679 Gamma density, can fit better in the central part of the curve for
1070-9908/03$17.00 © 2003 IEEE
Authorized licensed use limited to: the Leddy Library at the University of Windsor. Downloaded on December 08,2023 at 16:37:02 UTC from IEEE Xplore. Restrictions apply.
GAZOR AND ZHANG: SPEECH PROBABILITY DISTRIBUTION 205

(a) (b) (c) (d)


Fig. 2. Comparison of contours for bivariate pdf of two samples of speech (x(t); x(t +  )). (Solid line) Experimental results estimated by
f^ (x; y)
= 1=N (1=4 ) exp( ( x0j 0 j j 0 j
x(t) + y x(t +  ) )= ) where x(t) is a long speech signal, and  is a small number.
0j j j j
(Dashed-dotted) LD f (x; y ) = (1=4a ) exp( ( x + y )=a). (a)  = 25 ms. (b)  = 2:5 ms. (c)  = 1 ms. (d)  = 0:1 ms.

Fig. 3.  test result over a time interval (200 ms) for four pdf hypotheses. jj
Fig. 4. Moment test ( = E [ x ]= E [x ]) value for a speech signal and the
nominal values for different hypotheses over time intervals of 200 ms.

suggested values of in the range of 0.8 to 0.5 [2]. Using TABLE I


a voice activity detector, we omitted silence intervals from the AVERAGE  VALUE FOR DIFFERENT WINDOW LENGTHS
data and noted that the central part fits the LD well. Previously
published results [1], [2] show that speech has a -D. We con-
clude that samples of speech signals during voice activity in-
tervals are Laplacian, and speech as a mixture of silences and
voice activity intervals (depending on the SNR and the silence
duration percentage: ), results in a -D.
Fig. 2 illustrates the contours of bivariate pdf of two samples
of speech. For time intervals less than 1 ms, it can be seen that
the experimental contours are elliptical; thus, the multivariate
density of speech is spherically invariant [4], [5]. For time in-
tervals over 2 ms, the contours are more like squares; therefore,
we conclude that the bivariate pdf looks like an LD and is not
spherically invariant.
One may ask whether only long data samples of speech ex-
hibit an LD. For time intervals greater than 5 ms, using tests de-
scribed in the Appendixes, our experiments show that among the hypotheses. For each speech frame as used in the test, we
following widely known speech pdfs [1]–[6] the LD is favored: calculated the sample variance and the sample mean of
• Gaussian pdf: ; the absolute value . The test statistic
• Laplacian pdf: ; is then calculated and depicted in Fig. 4. Each distribution has a
• Gamma pdf: , ; different nominal test statistic for ; these are illustrated by four
• GGD pdf: (see Appendix B). straight lines. We can see that the LD is a better approximation
First, the test value (see Appendix A) is calculated for the than the others. Again, Fig. 4 shows that the shape of the
above distributions for each 200–ms time frame (for shorter distribution ( ) is closer to LD during active speech intervals
frames, results are similar) where the overlap between succes- compared to silent intervals. The -D is the next candidate in
sive frames is 100 ms. The test result for a speech signal the frames that are a mixture of silence and voice activity. The
is displayed in Fig. 3. The lower the value the better the moment test favors GD for silent intervals.
corresponding hypothesis fits the data. Thus, from Fig. 3, we Table I illustrates the test value of a 20-s speech signal for
conclude that the best fit is obtained under the hypothesis that different pdf hypotheses and frame lengths. The value is cal-
the speech has an LD. We note that only during silent intervals culated within each frame and is averaged over time. Table I
might the value for the Gaussian hypothesis be slightly less shows that the speech signal has an LD within time frames
than that of the Laplacian hypothesis; that is again to say that longer than 5 ms and a GD for time frames of less than 2.5 ms.
speech has an LD during voice activity time intervals. For time frames longer than 0.5 s, a -D or GGD with
Both GD and LD functions are special cases of the gen- is favored, because long frames usually consist of both voiced
eralized Gaussian distribution (GGD) function [6], [8]. In and silent samples. This behavior is justified by Fig. 5, which
Appendix B, a moment test is developed to evaluate these depicts variations of two moment values of an LD-GD mixture.
Authorized licensed use limited to: the Leddy Library at the University of Windsor. Downloaded on December 08,2023 at 16:37:02 UTC from IEEE Xplore. Restrictions apply.
206 IEEE SIGNAL PROCESSING LETTERS, VOL. 10, NO. 7, JULY 2003

(a)

Fig. 5. Moment test values jj


= E [ x ]= E [x ] and  = E [x ]=E [x ]
of a random selection of v 
LD (with probability p) or u 
GD (with
0  0
probability 1 p), i.e., x (1 p)f (x) + pf (x), SNR = E [v ]=E [u ].

(b)

Fig. 6. PDF of three dominant speech signal components in the adaptive KLT
domain proposed in [9]. (Solid line) Experimental. (Dashed line) LD. (c)
Fig. 7. Short-time (200 ms)  test results for KLT coefficients.

These moments, depending on the SNR and the silence duration


percentage , are similar to those of -D, GGD with
, GD, or LD. This is why in some of the literature -D
is believed to be the best model, while in other references GGDs
with or LD are favored [1]–[6].

III. KLT DOMAIN MODELING (a)


We have seen that time domain samples of a speech signal
are very well described by an LD. This gives a first-order pdf
characterization for speech. In order to obtain a higher order
statistical description, we first transform the speech signal into
uncorrelated components. We then show that these components (b)
themselves are distributed like Laplacian random variables with
different parameters. The speech signal is regarded as a nonsta-
tionary process, and yet we will see that the shapes of the pdf
of its components are almost time-invariant and are LDs with
slowly time-varying parameters.
A KLT is used here to decorrelate the data vector and to sep-
arate the speech components [9], [11], [12]. It is assumed that (c)
successive speech samples could be accurately modeled by the Fig. 8. Moment test results for DCT coefficients of a speech signal.
pdf of the KLT components that are uncorrelated random vari-
ables. This modeling is described by the KLT and the energy (or
coefficients are better approximated by the LD rather than by
mean absolute value) of the KLT components. Speech is often
the others. Hence, the joint pdf of samples of a speech signal
viewed as a nonstationary process; therefore, this transforma-
could be approximated by assuming that speech in the KLT do-
tion varies slowly with time and should be updated with the ar-
main is a multivariate random vector with uncorrelated LD ele-
rival of new samples. In this letter, the KLT is adaptively esti-
ments and different, slowly time-varying parameters.
mated by the algorithm proposed in [9]. We consider the three
most dominant components of a speech signal in the KLT do-
IV. DCT DOMAIN MODELING
main obtained by the algorithm in [9]. Fig. 6 shows that the pdfs
of these components (that are uncorrelated) for a 10-s speech The KLT is time varying, data dependent, and needs to be up-
sample are approximated very well by Laplacian pdfs with dif- dated for each new speech sample; therefore, simple orthogonal
ferent parameters. Note that minor components that have very transformations like FFT are used instead in many applications.
small energies are neglected in practice. Of all discrete orthogonal transforms, the DCT is the most pop-
To further investigate the behavior of speech components, the ular, not only for its nearly optimal performance in whitening
test is similarly applied to these three components. The re- lowpass signals, but also for its computational efficiency.
sults are depicted in Fig. 7, where the parameter for each KLT Speech signal vectors are transformed using DCT. Then, the
component is also computed over every 200-ms time frame with moment test is applied to each DCT coefficient. The results for
100-ms overlaps between frames. Again, we see that the KLT three components are depicted in Fig. 8. Fig. 8(a) shows that
Authorized licensed use limited to: the Leddy Library at the University of Windsor. Downloaded on December 08,2023 at 16:37:02 UTC from IEEE Xplore. Restrictions apply.
GAZOR AND ZHANG: SPEECH PROBABILITY DISTRIBUTION 207

only the first coefficient (i.e., the discrete cosine (DC) compo- where and denotes the
nent) is better approximated by the GD hypothesis. We must Gamma function, and and are positive real parameters.
note that this very small low-frequency component is not rep- The shape parameter describes the exponential rate of
resentative of a speech signal, because the DC component is decay; is the square root of the variance. If , the
usually blocked in signal acquisition and represents just noise above is an LD, and if , it is a GD. The maximum-like-
measurement. We conclude from Fig. 8(b) and (c) (as well as lihood estimation of the GGD parameters can be found in
from all other components) that the DCT coefficients of speech [8]. The shape parameter of a GGD can be estimated by
are better approximated by LD. This is to say that speech in the where , , , and
DCT domain also exhibits an LD, with the result that a multi- [7]. The calculation of
variate LD for speech is more accurate than a Gaussian one. is complex and time consuming in real-time applications such
as speech processing. As we can easily show that is a
monotonic function increasing for for the interval of interest,
V. DISCUSSION AND CONCLUSION
we suggest using instead of as a moment value to perform
In previously published results, the distribution of the speech the test.
signal is claimed to have a -D over long time periods [1]–[3], This test evaluates a sample set and finds its best pdf in the
while in some other papers a GD is used. The speech is a spher- class of GGD functions. Under the hypothesis that the shape
ically invariant random process for time intervals less than 2 ms parameter is close to one, the pdf is Laplacian, and if is
[4]. We first demonstrate that a speech signal during voice ac- close to two, the pdf is Gaussian. Thus, using instead of
tivity intervals may be characterized by LD in the time domain. as a decision parameter, the value of is to be compared with
Then two different tests are used to compare widely known and , to verify whether the sample
speech distributions over short time periods. Both tests favor LD set should be considered an LD or GD. In our simulations, the
during voice activity intervals. Tests performed in the KLT and case is also considered where . The
DCT domains lead us to conclude that all major and meaningful moment value for the Gamma distribution is .
components of voiced speech show LD.
In decorrelated transformed domains such as the KLT and
ACKNOWLEDGMENT
DCT, speech components are uncorrelated. Our experimental
results suggest a multivariate Laplacian pdf for speech during Authors would like to thank Associate Editor, the anonymous
its activity intervals. This pdf is simple to use and fits better than reviewers, and H. Warder for their comments and suggestions.
other widely used pdfs in many speech processing applications.
In contrast, similar tests performed on noise gathered from
different acoustic environments clearly illustrate that most of REFERENCES
them are better described by a GD. During silent intervals, both
criteria (i.e., and ) favor the GD hypothesis. [1] W. B. Davenport, “An experimental study of speech wave probability
distributions,” J. Acoust. Soc. Amer., vol. 24, no. 4, pp. 390–399, July
We also noted that these tests, for those frames containing 1952.
both speech and silence (on the edges), favor -D similar to the [2] D. L. Richards, “Statistical properties of speech signals,” Proc. Inst.
expected results from a random mixture selection of a GD and Elect. Eng., vol. 111, no. 5, pp. 941–949, 1964.
an LD. This shows that the difference between our results and [3] M. D. Paez and T. H. Glisson, “Minimum-mean squared-error quantiza-
tion in speech PCM and DPCM,” IEEE Trans. Commun., vol. COM-20,
previously published results [1]–[6] is caused by the inclusion in pp. 225–230, Apr. 1972.
previous studies of silent samples (i.e., low-power noise, which [4] H. Brehm and W. Stammler, “Description and generation of spherically
is well described by a GD) mixed with active speech (which is invariant speech-model signal,” Signal Process., vol. 12, no. 2, pp.
119–141, Mar. 1987.
described by an LD) (see Fig. 5).
[5] J. P. LeBlanc and P. L. De Leòn, “Speech separation by kurtosis maxi-
APPENDIX A mization,” in Proc. ICASSP, vol. 2, 1998, pp. 1029–1032.
TEST [6] G.-J. Jang, T.-W. Lee, and W.-H. Oh, “Learning statistically efficient
features for speaker recognition,” in Proc. ICASSP, 2001.
The test compares experimental data with some given pdfs [7] R. L. Joshi and T. R. Fischer, “Comparison of generalized Gaussian and
and measures the distortion between the pdfs and data by Laplacian modeling in DCT image coding,” IEEE Signal Processing
where the sample space is partitioned Lett., vol. 2, pp. 81–82, May 1995.
into intervals; is the number of samples that fall into the th [8] F. Müller, “Distribution shape of two-dimensional DCT coefficients of
natural images,” Electron. Lett., vol. 29, no. 22, pp. 1935–1936, Oct.
interval; is the theoretical probability that a sample will fall 1993.
into the same th interval; and is the total number of samples. [9] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech en-
The smaller the value, the better the fit. hancement,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 87–95,
Feb. 2001.
[10] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice ac-
APPENDIX B tivity detection,” IEEE Signal Processing Lett., vol. 6, pp. 1–3, Jan.
MOMENT TEST FOR GGD AND -D 1999.
[11] J. Huang and Y. Zhao, “A DCT-based fast signal subspace technique for
Assume that the pdf of the speech signal is modeled with robust speech recognition,” IEEE Trans. Speech Audio Processing, vol.
the zero-mean GGD function 8, pp. 747–751, Nov. 2000.
[12] J. Bell and T. J. Sejnowski, “An information-maximization approach to
blind separation and blind deconvolution,” Neural Comput., vol. 7, pp.
1129–1159, 1995.

Authorized licensed use limited to: the Leddy Library at the University of Windsor. Downloaded on December 08,2023 at 16:37:02 UTC from IEEE Xplore. Restrictions apply.

You might also like