JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.

COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

62

A comparison of mel-frequency cepstral coefficient (MFCC) calculation techniques
Amelia C. Kelly and Christer Gobl
Abstract—Unit selection speech synthesis involves concatenating segments of speech contained in a large database in such a way as to create novel utterances. The sequence of speech segments is chosen using a cost function. In particular the join cost determines how well consecutive speech segments fit together by extracting acoustic parameters from frames of speech on either side of a potential join point and calculating the distance between them. The mel-frequency cepstral coefficient (MFCC) is a popular numerical representation of acoustic signals and is widely used in the fields of speech synthesis and recognition. In this paper we investigate some of the parameters that affect the calculation of the MFCC, particularly (i) the window length used to examine the speech segments, (ii) the time-frequency pre-processing performed on the signal, and (iii) the shape of the filters used in the mel filter bank. We show with experimental results that the choices of (i) – (iii) have a direct impact on the MFCC values calculated, and hence the ability of the distance measure to predict discontinuity, which has a significant impact on the ability of the synthesiser to produce quality speech output. In addition, while previous research tended to focus on sonorant sounds such as vowels, diphthongs and nasals, the speech data used in this study has been classified into the following three groups according to their acoustic characteristics – 1. Vowels (non-turbulent, periodic), 2. Voiced fricatives (turbulent, periodic sounds), 3. Voiceless fricatives (turbulent, non-periodic sounds). The choice of (i) is shown to significantly affect the calculation of MFCC values differently for each sound group. One possible application of these findings is altering the cost function in unit selection speech synthesis so that it accounts for the type of sound being joined. Index Terms—Feature extraction, signal analysis, mel-frequency cepstral coefficients (MFCC), speech synthesis,
————————————————

 A. C. Kelly and C. Gobl are with the Phonetics and Speech Laboratory, Centre for Language and Communication Studies, SLSCS, Trinity College Dublin, Ireland.

——————————  ——————————

1 INTRODUCTION
HE cost function in unit selection speech synthesis [1] is a measure of how well a sequence of candidate units represents the target utterance, which is the input to the system, in the form of text. The cost function has two major components, the target cost and the join cost. In this paper we focus on the join cost, particularly on the measurement of spectral discontinuity. Consider the speech segments /k-ae/ and /ae-t/, that make up the word “cat”. Calculating the spectral discontinuity between these two segments requires that the spectral characteristics of each be quantified by numerically coding the sounds as vectors of acoustic measurements. More specifically, the numeric representation is taken only on the portions of each sound that will be coming into contact, i.e. at the end of the first segment /k-ae/ and at the start of the second segment /ae-t/ and so the speech sounds are windowed before numerically coding the signals. The acoustic measurement used to represent the sound must therefore be one for which perceived changes in the sound are accurately mirrored by numerical changes in the acoustic representation. Here we focus on the MFCC, a non-linear distance metric based on the Fourier transform. The first step in calculating the MFCC values is to get the Fourier transform of a window of the speech signal. The FT of a signal is given by:

T

The FT is a special case of the more general fractional Fourier transform (FRFT). The FRFT is dependent on the parameter a, which determines the angle of rotation in the time-frequency domain, where a is the identity, and a = π/2 is the classical Fourier transform. Angles of a between these two values correspond to a transform that lies somewhere between the time and frequency domains. Numerical representations of speech sounds based on time-frequency analysis have shown to yield promising results in the area of speech recognition. The p-th order fractional Fourier transform of a signal is given by:

where the kernel Kp(u,u') is defined as:

Next the energy of the signal is calculated using: It is convenient then to convert the linear frequency to the mel frequency scale. The signal (represented on the mel scale) is then filtered by a series of uniformly spaced triangular windows of equal width. Frequency values in Hertz (k) are often transformed to mel frequencies (fmel) using [3] [2]:

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

63

The mel filter bank, Фn(k), is constructed as described in [3], where the frequency response of the nth filter is calculated using:

where the boundary points of that filter are

given by: where fhigh and flow are the highest and lowest frequencies contained in the signal, and N is the number of filters. The log energy output of a filter in the mel scale filterbank is calculated by weighting the power value given by P(k) with the frequency response of the appropriate filter Ф(k) such that: The coefficients can then be extracted by calculating the Discrete Cosine Transform (DCT) of the log of the output energies of the filters, using the fol-

lowing equation, taken from [4]: where M is the number of MFCCs, and n=1,2, …, N is the number of mel-scaled filters. Recent studies [5], [6], [7], attempting to compare spectral representations of sound have investigated the effects of changing the width of the windowing function used to derive the acoustic parameterisation. The studies both found that windowing functions of particular widths yielded higher correlations between the acoustic representation and human perception of the join. Clearly window length plays a role in calculating acoustic parameters like the MFCC – since the MFCCs are calculated by first taking the FT distribution of the windowed speech segment, the width of the windowing function will directly affect the MFCC values extracted from the signal. This in turn affects the join cost penalty incurred by the potential concatenation. Despite this very well-known fact, there is remarkable variation in the choices of window length used in the studies undertaken to find the most useful metric. Furthermore there has been little theoretical examination of this effect so the role of windowing can be more intuitively understood. A number of studies [8], [9], [10], [11], [12], [13], investigating which numeric repre-

sentation of sound best reflects human perception have compared different numeric representations of speech sounds by assessing listeners' capability of detecting a join. Although some of these and similar studies often conclude that one metric outperforms the others, the highest correlation achieved between such metrics and human perception of discontinuity is roughly 0.7. Furthermore each study cites a different method as the one that best reflects human detection of joins. For example, Wouters and Macon [8], using a 5 ms frame of speech to compare metrics, concluded that the MFCC outperformed the others, while Chen and Campbell [10] cited the bispectrum as the best metric using 20 ms frames of speech for their calculations. It is not possible to realistically compare the relative performance of these different metrics unless proper attention is paid to one of the first and most fundamental operations in the signal processing stage, namely what choice of window length will produce a numerical representation that best predicts discontinuities in speech. In this paper we investigate (i) the impact of window length, and (ii) the impact of the fractional Fourier transform angle a, and (iii) the shape of the filter used in the mel filter bank (rectangular, triangular or Gaussian), on MFCC calculation and whether the performance of the distance measure varies for three different types of speech sounds – vowels, voiced fricatives and voiceless fricatives. The Euclidean distance was calculated between vectors of MFCC values extracted from windows speech segments in each of the three sound categories. The measurement was first calculated for segments of speech that were naturally consecutive, and then between examples of speech from non-matching phonemes from the same sound category in order to simulate what would be considered a bad join. We have chosen to perform the experiments in this manner so as to remove the statistical uncertainty normally associated with small numbers of human observers, thus providing experimental evidence, the first to our knowledge, that directly measures the influence of (i) – (iii) on MFCC values for different types of speech sounds. The results show that the metric is significantly sensitive to both the type of sound being measured and the length of the function used to window the signals, with shorter window lengths giving better results for groups 1 and 2, and longer window lengths showing better results for group 3. This is turn suggests that the performance of unit selection speech synthesis systems may be improved if the cost function was designed to account for these dependencies.

2 EXPERIMENTAL PROCEDURE
This experiment was designed to examine the effects of changing the value of: i. the window length used to examine the speech segments, and ii. the time-frequency analysis performed on

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

64

the signal, iii. the shape of the filters used in the mel filter bank, on the calculation of MFCCs, and to investigate which types of speech sounds from the following groups are the most sensitive to these changes. Group 1: vowels – non-turbulent, periodic sounds. Group 2: voiced fricatives – turbulent, periodic sounds. Group 3: voiceless fricatives – turbulent, nonperiodic sounds. The MFCC values were first calculated for consecutive segments of the test stimuli. The speech samples examined were 295 examples of /aa/, 531 examples of /i/, and 224 examples of /u/for the vowel category, 590 examples of /f/, 915 examples of /s/ and 323 examples of /sh/ in the voiceless fricatives category, and 163 examples of /v/, 326 examples of /dh/ for the voiced fricatives category. The samples were halved at the midpoint and the MFCC values were extracted for different values of (i), (ii) and (iii). The Euclidean distances between consecutive vectors of MFCCs were then calculated. If the MFCC is a good numerical representation of speech signals, then these values are expected to be low, as the segments of speech are naturally consecutive. The MFCC values were then calculated for nonconsecutive segments of the test stimuli, within each sound category in such a way as to create `bad' joins, i.e. joining spectrally dissimilar speech sounds which still have the same acoustic characteristics as defined by the three sound groups. To create these perceptually `bad' joins, 220 examples of each sound in the category was joined to 220 examples of the other sound in the category. Again, the samples were halved at the midpoint and the MFCC values were extracted for different values of (i), (ii) and (iii). In this case the Euclidean distance between the non-consecutive MFCC vectors were calculated. If the MFCC is a good numerical representation of speech signals, then these values are expected to be high, as the segments of speech are naturally non-consecutive. All speech examples were taken from the RMS (American male) recordings of the CMU Arctic database [14], and sampled at 16 kHz. 100 ms of each sound was examined. Each example was windowed using a Hanning windowing function on either side of the midpoint to minimise spectral leakage introduced by windowing, and the fast fractional Fourier transform (FRFT) was calculated on the windowed segments, with window lengths ranging from 10 ms to 50 ms in steps of 5 ms, for angles of a varying from 0 to π. MFCCs were calculated using a filter bank triangular filters as described by Memon et al. [3]. The Euclidean distance was then calculated between the two vectors of 12 MFCCs representing the speech sounds on either side of the midpoint. The Euclidean distance is calculated using:

where p and q are vectors of MFCC values. The Euclidean distance between MFCC values was used as a measure of the efficacy of the MFCC as an objective measurement of spectral discontinuity. The difference, D, was measured, which is the Euclidean distance calculated between non-consecutive segments minus the Euclidean distance calculated between consecutive segments. This discrimination value D is taken to be a measurement of how well the metric can distinguish between natural and joined speech, and is defined as follows: The results were analysed and are presented in the following section.

3 RESULTS
In this section, the results of the experiment are presented. The general effects of changing parameters (i), (ii) and (iii) are first examined, and then discussed in relation to the particular sound categories. Changes in each of the parameter values impacted to a different degree for each sound category. MFCC calculation was by far the most sensitive to changes in window length, as evidenced by the effect on D values compared with the effect imposed by a. Changes in a had little or no effect on the MFCC values calculated and did not significantly affect the D values. For vowel sounds, the window size significantly affected the calculated value of D, but that changes in a and filter type produced slight changes that were not significant. The MFCC best represents vowel sounds, with the D values significantly higher than for the other sound groups. The metric performed best with window sizes of 25 ms and a = 90 degrees, (FFT).

For voiced fricative sounds, the window size significantly affected the calculated value of, but that the effect of changing a and filter type was insignificant. The MFCC performed the worst for the voiced sound category compared to vowels and voiceless fricatives, and in this category performed best for window lengths of 10 ms and a = 72 degrees.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

65

For voiceless fricative sounds, the window size significantly affected the calculated value of D, but that changes in a and filter type produced slight changes that were not significant. Contrary to vowels and voiced fricatives, the MFCC performed best for voiceless fricatives when a large window length of 45 ms was used, and with a = 27 degrees. The metric did not perform as well as for this category as it did for vowels, but was significantly better than for voiced fricatives.

(iii) the shape of the filters used in the mel filter bank. In this study the MFCC values we calculated for different values of each of these parameters in order to investigate the extent to which they affect the MFCC calculation. Furthermore the values have been tested on three different types of test stimuli, which differ based on their inherent acoustic characteristics. Measuring how the MFCC values are affected by changes in (i), (ii) and (iii) has important impact on unit selection speech synthesis, particularly with a view to optimising the cost function. This study demonstrated the extent to which MFCC values are affected by changes in (i), (ii) and (iii) by showing that certain combinations of these parameters yields a greater distinction between consecutive and nonconsecutive speech for each group of sounds and therefore are more adept at detecting the presence of a large spectral distance between concatenated segments. In a unit selection speech synthesis system, the MFCC values are calculated for the database and a cost is assigned to a sequence of units that reflects how well the segments would join together based on the difference between these measurements. This study has shown objectively not only the importance of carefully selecting the fundamental parameters used to calculate the MFCCs, but also the fact that the choice of these parameters should differ depending on the type of sound being examined. One of the applications of these findings would be to overhaul the traditional unit selection algorithm so that it takes into account the type of sounds that a concatenation occurs in. Future work in this area will be to incorporate these findings into a unit selection speech synthesiser and evaluate the resulting synthesised speech.

CONCLUSION 4 DISCUSSION
The MFCC is frequently used as a numerical representation of speech sounds. Representing a speech signal in the cepstral domain essentially performs a sourcefilter separation, making it possible to represent the spectral characteristics of the speech sound using the first few coefficients. Changes in these values represent changes in the spectral distribution of the signal. Theoretically, this makes the MFCC a good indicator of the similarity of two speech sounds, and provides a method of detecting discontinuities between two concatenated segments of speech. Furthermore, the MFCC is said to be a particularly faithful representation as the frequency scale is warped to a mel scale so that it more closely resembles the non-linear response of the ear. Many studies have been done testing the efficacy of metrics such as the MFCC, and have relied heavily on perceptual results as an indicator. These studies are rarely in agreement however. The calculation of MFCC values relies heavily on a number of fundamental parameters including (i) the length of the windowing function used to segment the speech signal, (ii) the time-frequency analysis performed on the signal and

This study was designed to investigate the efficacy of the MFCC metric as a means of objectively distinguishing between speech sounds. The MFCC values were calculated by first windowing the speech sounds, and a range of window lengths were tested in order to address the lack of consensus of previous studies investigating the performance of the metric. Furthermore the MFCC metric was assessed not only for vowel sounds, but for voiceless and voiced fricatives as well, and accordingly adds to the literature in this important and growing area. In each sound category a few hundred examples of two phonemes were extracted from the American male RMS recordings of the CMU Arctic database. First the MFCC values were calculated on either side of the midpoint of sounds of the same phonetic label. The Euclidean distance between the vectors of MFCC values was taken as a measure of how well the MFCC can predict that two segments of speech are naturally consecutive. The MFCCs were ex-

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

66

tracted using a range of different values for (i) and (ii). In order to measure the ability of the MFCC to predict that segments are non-consecutive, a perceptually “bad” join was modelled, by calculating the Euclidean distance between MFCC vectors was calculated between a few hundred examples of one phoneme type in the sound category and the other phoneme type, a situation that models a very perceptually salient artificial join. Again this was calculated for a number of window lengths and values of a. From these results we can conclude the following points: 1. The ability of the distance measure to predict discontinuity differs significantly with respect to the width of the windowing function. 2. The ability of the distance measure to predict discontinuity differs with respect to the angle between the time and frequency domain that the signal is transformed to, but not significantly so for these examples. 3. The choice of filter type did not significantly affect the ability of the metric to predict discontinuity. 4. The distance measure was more effective at measuring perceptual discontinuity for vowel sounds than it was for voiced fricatives, which in turn performed better than for voiceless fricatives. The last result is of particular interest, as the cost functions in unit selection speech synthesis systems generally use one spectral discontinuity measure to calculate join cost regardless of the speech sound being used. The results of this paper show that there is scope for further research into detecting discontinuities in voiced and voiceless fricatives. We are currently investigating the use of other linear, and bi-linear transforms such as the fractional Fourier transform,wavelets and the Wigner-Ville distribution to calculate MFCCs, with the intention of optimising the unit selection cost function for use with different types of speech sounds.

[3]

[4]

[5]

[6]

S. Memon, M. Lech, N. Maddage and L. He, “Application of the Vector Quantization Methods and the Fused MFCCIMFCC Features in the GMM based Speaker Recognition” in Recent Advances in Signal Processing, Editor: Ashraf A Zaher, Published by InTech, 2009. S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980. B. Kirkpatrick, D. O'Brien and R. Scaife, "A comparison of spectral continuity measures as a join cost in concatenative speech synthesis," Proceedings of the IET Irish Signals and Systems Conference (ISSC), 2006. A. C. Kelly, “Join Cost Optimisation for Unit Selection Speech Synthesis”, poster, Sao Paulo School of Advanced Studies in Speech Dynamics, Sao Paulo, Brazil,

www.dinafon.iel.unicamp.br/spsassd_files/posterA meliaKelly.pdf 2010.
[7] A. C. Kelly and C. Gobl, "The effects of windowing on the calculation of MFCCs for different types of speech sounds," Proc. International Conference on Non-Linear Speech Processing, Gran Canaria, 2011. J. Wouters and M. W. Macon, “Perceptual evaluation of distance measures for concatenative speech synthesis,” International conference on Spoken Language Processing, ICSLP, 1998. E. Klabbers and R. Veldhuis, “On the Reduction of Concatenation Artefacts in Diphone Synthesis,” Proceedings of the Acoustics, Speech, and Language Processing 1998. J. D. Chen and N. Campbell, “Objective distance measures for assessing concatenative speech synthesis,” Proceedings of Eurospeech, 1999. Y. Stylianou and A. Syrdal, “Perceptual and Objective Detection of Discontinuities in Concatenative Speech Synthesis,” Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2001. J. Vepa, S. King and P. Taylor, “New objective distance measures for spectral discontinuities in concatenative speech synthesis,” Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. Y. Pantazis, Y. Stylianou, and E. Klabbers, “Discontinuity Detection in Concatenated Speech Synthesis based on Nonlinear Speech Analysis,” Interspeech, 2005. J. Kominek and A. Black, “The CMU ARCTIC speech databases for speech synthesis research,” Tech. Report CMU-LTI-03-177 http://festvox.org/cmu arctic/, Language Technologies Institute, Carnegie Mellon University, 2003.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Amelia C. Kelly is a Ph.D. student at Trinity College Dublin. Christer Gobl is senior lecturer in Speech Science at Trinity College Dublin.

ACKNOWLEDGMENT
This work was supported in part by Foras na Gaeilge.

REFERENCES
[1] A. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database” Acoustics, Speech , and Signal Processing, 1, 373 – 37,1996. D. O'Shaughnessy, Speech Communication: human and machine. Addison-Wesley, pp. 150, 1987.

[2]

Sign up to vote on this title
UsefulNot useful