You are on page 1of 4

International Conference on Computer and Communication Engineering (ICCCE 2012), 3-5 July 2012, Kuala Lumpur, Malaysia

Arabic Speech Transformation Using MFCC in GMM


Rania Elmanfaloty N. Korany El-Sayed A. Youssef
Electrical Engineering Department Electrical Engineering Department Electrical Engineering Department
Faculty of Engineering Faculty of Engineering Faculty of Engineering
Alexandria, Egypt Alexandria, Egypt Alexandria, Egypt
Rania.elmanfaloty@aiet.edu.eg nokorany@hotmail.com

Abstract— Voice conversion (VC) is a process which modifies s2 (n) s(n) a * s(n) (1)
the speech signal produced by one source speaker so that it
sounds like another target speaker. In this paper the
transformation is determined by using equal Arabic utterances where s(n) is the speech signal , s2 (n) is the output signal
from source and target speakers. A conversion function based and the value of a is usually between 0.9 and 1.0. The z-
on Gaussian mixture model (GMM) is used for transforming transform of the filter is
the spectral envelope described by Mel Frequency Cepstral () = 1 −  ∗   (2)
Coefficients (MFCC). The quality of the transformed
utterances is measured using subjective and objective Voice input
evaluations.

I. INTRODUCTION
Voice conversion is aimed at modifying speech spoken Preemphasis
by one speaker (source) to give an impression that it was
spoken by another specific speaker (target). The voice
conversion process consists of two phases: training and
Windowing
conversion.
In training, a mapping from source features to target
features is created based on training data from both
speakers. In the conversion phase, any unknown utterance ||2
from the source speaker can be converted to sound like the
target speaker. The most popular methods for creating the Magnitude Spectrum

mapping in voice conversion include codebooks, and


Gaussian mixture models (GMMs). Mel Filter Bank
Mel Spectrum
The most common features used in voice conversion are
based on direct use of spectral bands or on the source-filter DCT
theory. Examples of such features include MFCCs (Mel
Frequency Cepstral Coefficients) and LSFs (Line Spectral
Frequencies) [1]. The aim of this paper is to extract the
MFCC, and use them by the GMM for voice conversion of Output
Arabic spoken words.
Figure 1. MFCC block diagram.
The following section describes the steps of feature
extraction using Mel frequency Cepstral Coefficients
(MFCC). Section 2 briefly describes the voice conversion B. Windowing
baseline system .In section 3 the results are evaluated by The input speech signal is segmented into frames of
means of subjective and objective tests. 20~40 ms with optional overlap. Then each frame is
multiplied by hamming window that is defined in equation
(3).
II. THE MEL CEPSTRAL COEFFICIENTS (MFCC)
The overall MFCC process is shown in figure (1). ( ) = 0.54 − 0.46 cos  0≤ ≤ −1 (3)

A. Preemphasis where N is the number of samples in each frame.
In this step the signal is passed through a filter which C. Fast Fourier Transform
emphasizes higher frequencies. This process will increase the
FFT is used to convert each frame from time domain to
energy of signal at higher frequency [3] as shown in figure
frequency domain, so we get the frequency response of each
(2).
frame. The magnitude squared is the power spectrum.

978-1-4673-0479-5/12/$31.00 ©2012 IEEE

734
Triangular filter bank
0.8
1
0.6
0.9
0.4
0.8
0.2
amplitude

0.7
0
0.6
-0.2

-0.4
0.5

-0.6 0.4

-0.8 0.3
0 0.5 1 1.5 2 2.5 3 3.5
samples 4
x 10
0.2
(a)
0.1
0.15
0
0.1 0 1000 2000 3000 4000 5000 6000 7000 8000
Frequency (Hz)
0.05
Figure 3. An example of mel-spaced filter bank.
amplitude

0
To define the baseline system Let x=[x1 x2...........xN ] and
-0.05
y=[y1 y2...........yN ] be parallel sequences of feature vectors
-0.1
(MFCC) of the source and target speech, respectively. Then,
-0.15 the combination of these sequences z= [xT yT ]T is used to
-0.2
estimate the parameters of a Gaussian mixture model (αq ,μq,
¦ q ) with Q components for the joint density p(x; y), where
0 0.5 1 1.5 2 2.5 3 3.5
samples 4
x 10
(b)
αq denotes the prior probabilities of x having been generated
Figure 2. (a) Original wave. (b) After preemphasis a=0.95. by component q, μ is the mean vector and ∑ is the
covariance matrix [5] [6] [7].
D. Mel Filter Bank
In [2], it has been shown that human perception of the B. Conversion Phase
frequency contents of sounds for speech signals does not Within this phase each source feature vector x is
follow a linear scale. Thus for each tone with an actual converted to a target vector y by the conversion function
frequency, f, measured in Hz, a subjective pitch is measured which minimizes the mean squared error between the
on a scale called the ‘mel’ scale. The mel-frequency scale is converted source and the target vectors observed in training
a linear frequency spacing below 1000 Hz and a logarithmic [5] [6] [7]:
Q
spacing above 1000 Hz [2]. The relation between the actual
¦ (P ¦ (¦
yx xx
y y
q ) 1(x P qx )). p(q \ x) (5)
frequency and the mel frequency is given by equation (4). q 1
q q

where
mel ( f ) 2595 * log 10(1 f / 700) (4) ª¦ xx ¦ xy º
¦q «« yx yy »»
q q

To simulate the subjective spectrum a filter bank spaced


uniformly on the mel-scale is used. Figure (3) shows an «¬¦q ¦q »¼
example of mel-spaced filter bank. That filter bank has a is the covariance matrix of each GMM component,
triangular band pass frequency response; each filter’s ª P qx º
magnitude frequency response is triangular in shape and Pq « y»
equal to unity at the centre frequency and decrease linearly to «¬ P q »¼
zero at centre frequency of two adjacent filters [3].
is the mean vector of each GMM component, and p(q / x)
E. Discrete cosine transform (DCT) is a conditional probability of a GMM component q given x
Discrete Cosine Transform (DCT) is used to convert the which is given by equation (6).
log Mel spectrum into time domain. The result of the D q N ( x; P qx , ¦q )
xx

conversion is called Mel Frequency Cepstrum Coefficient p ( q / x) (6)


Q
[4].
¦D N ( x; P , ¦ p )
x xx
q p
p 1

III. THE BASELINE CONVERSION SYSTEM Where αq denotes the prior probabilities of x having been
generated by component q and N is n-dimensional normal
A. Training Phase distribution.
The baseline system, to be defined, requires parallel
utterances of source and target speaker for training.

735
Finally, the Inverse (MFCC) is performed for the N

¦ (x

transformed feature vectors to obtain the converted speech  (,y) = yi ) 2 (7)
 i
signal. i 1
where x={ xi\ i=1,2,…..N} , y={ yi\ i=1,2,…..N} are two
IV. EVALUATION vectors of length N.
A. Experimental Corpus The MSE between MFCC of the target, transformed
signals and the MSE between MFCC of the source,
The corpus used in this work contains some Arabic transformed signals are shown in tables I and II respectively.
words. Each word has the form of Consonant-Vowel- The MSE is calculated for each of the three vowels and for
Consonant (CVC), and it includes one of the three vowels (a, all types of conversion female to female (f2f), male to male
o, e) such as (...... ‫( )ﻗﺎﻝ ﺑﺎﺏ ﺻﺎﻡ ﻓﻴﻞ ﺩﻳﻦ ﻣﻴﻢ ﻓﻮﻝ ﺳﻮﺭ ﻧﻮﺭ‬qal bab (m2m) female to male (f2m) and male to female (m2f).
s’am fel den mem fol sor nor…..) [8]. Each word is Figure 5 summarizes tables I and II.
segmented into 32 ms with overlapping 16 ms. The functions
(melfcc.m, invmelfcc.m) [9] are used within the simulation
Table I. Results of MSE between target and transformed
process. Within these functions, the reconstruction is essentially
white noise filtered by the estimated MFCC spectral envelopes,
mfcc
Type of
which leads to a "whispered" quality [10].
Conversion f2f m2m f2m m2f
B. Subjective evaluation
vowels
MFCC feature vector is extracted for both source and
A 0.1987 0.0948 0.0920 0.1052
target, and then it is reconstructed again from their MFCC
feature vector. The ABX test [5] is applied for the O 0.0711 0.0396 0.0423 0.0678
reconstructed source, the reconstructed target and the E 0.0702 0.0413 0.0367 0.0624
transformed words. In this test the listeners decide whether
the transformed word is closer to the reconstructed source or TABLE II. RESULTS OF MSE BETWEEN SOURCE AND TRANSFORMED MFCC
the reconstructed target words or neither.
The evaluation employs the ABX test for the three types Type of
of vowels and all types of conversion female to female (f2f), Conversion f2f m2m f2m m2f
male to male (m2m) female to male (f2m) and male to vowels
female (m2f). The results show that most of the listeners
A 0.2672 0.0684 0.1457 0.1603
have confirmed that the converted voice belongs to the
reconstructed target. Figure (4) shows the result of the ABX O 0.1887 0.1095 0.1202 0.1100
test for all types of conversion. E 0.1322 0.1203 0.1223 0.1079
% Subjects confirming

100% 0.25 MSE (Target-Transformed)


80% 0.2
MSE (Source- Transformed)
60% 0.15
40% 0.1
20% 0.05
0% 0
f2f m2m m2f f2m f2f m2m f2m m2f
Figure 4. Percentages of subjects confirming for f2f, m2m, m2f
and f2m Figure 5. MSE for the three vowels (f2f, m2m, m2f and f2m)
C. Objective evaluation
The results show that the MSE between target and
MFCC feature vector for the source is compared to that transformed words is often lower than the MSE between
of the transformed signal. Moreover MFCC feature vector source and transformed words; this means that the
for the target is compared to that of the transformed signal.
transformed word is closer to the target word than to the
The (MSE) between the two vectors is computed using
source word. Also it is concluded that f2f and f2m
equation (7) [11].
conversions give best results than m2m and m2f
conversions.

736
300

D. Pitch estimation 250

For the word ( ‫ ﻓﻴﻞ‬fel ) the pitch period is estimated per


frame for the reconstructed source, the reconstructed target 200

and the transformed word. For all types of conversion

pitch
150
female to female (f2f), male to male (m2m) female to male
(f2m) and male to female (m2f) the pitch period was plotted 100
versus frame's number as shown in figure 6.
source
Figure 6 shows that the pitch of the transformed either 50 target
transformed
increases or decreases towards that of the target such that
the transformed word pitch is closer to the target word pitch 0
0 2 4 6 8 10 12 14

than to the source word pitch. frame number

(a)
V. CONCLUSIONS 300

,Q WKLV SDSHU YRLFH FRQYHUVLRQ IRU VRPH 250


$UDELFXWWHUDQFHVXVLQJ0)&&IHDWXUHYHFWRUVDQG
*00LVLQYHVWLJDWHG7KHFRQYHUVLRQLVSHUIRUPHG 200
IRUDOOW\SHVfemale to female (f2f), male to male (m2m)
female to male (f2m) and male to female (m2f)DQGLWLV

pitch
150

HYDOXDWHG E\ PHDQV RI VXEMHFWLYH DQG REMHFWLYH


WHVWV The evaluations show that the transformed word is 100

closer to the target word than to the source word.


50
source
REFERENCES target
transformed
[1] Elina Helander, Jani Nurminen and Moncef Gabbouj “LSF Mapping 0
0 2 4 6 8 10 12 14
For Voice Conversion With Very Small Traning Sets”. Acoustics, frame number
Speech and Signal Processing, 2008. ICASSP 2008. (b)
300
[2] Minh N. Do, ‘Digital Signal Processing Mini-Project: An Automatic source
Speaker Recognition System”, [online], target
250 transformed
http://www.ifp.uiuc.edu/~minhdo/teaching/speaker_recognition
(Accessed: 9 September 2011)
200
[3] Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi “Voice
Recognition Algorithms using Mel Frequency Cepstral Coefficient
pitch

(MFCC) andDynamic Time Warping (DTW) Techniques” 150


JOURNAL OF COMPUTING, VOLUME 2, ISSUE 3, MARCH
2010, ISSN 2151-9617
100
[4] Syed Ali Khayam. “The Discrete Cosine Transform (DCT): Theory
and Application”. ECE 802 – 602: Information Theory and Coding,
50
March 10th 2003
[5] A. Kain, “High Resolution Voice Transformation”, Ph.D. thesis,
Oregon Health and Science University, Portland, USA, 2001 0
0 2 4 6 8 10 12 14
frame number
[6] Rania Elmanfaloty, N. Korany and El-Sayed A. Youssef “Quality of (c)
Arabic Utterances Transformed Using Different Residual Prediction 300
Techniques” International Conference on Signal and Image source
Processing, ICSIP 2010 target
250 transformed
[7] Reynolds, D. A., “Gaussian Mixture Models”, Encyclopedia of
Biometric Recognition, Springer, Journal Article, February 2008.
[8] Arabic/Arabic sounds, Wikipedia [online] 2012, 200
http://en.wikibooks.org/wiki/Arabic/Arabic_sounds (Accessed: 9
September 2011)
pitch

150
[9] PLP and RASTA (and MFCC, and inversion) in Matlab using
melfcc.m and invmelfcc.m, labrosa [online] 2010,
http://labrosa.ee.columbia.edu/matlab/rastamat/ (Accessed: 9 100

September 2011)
[9] Audio Feature Comparison, MET lab [online], 50
http://labrosa.ee.columbia.edu/matlab/rastamat/ (Accessed: 9
September 2011)
0
[10] Zhou Wang, Bovik, A.C. “Mean squared error: Love it or leave it? A 0 2 4 6 8 10 12 14
frame number
new look at Signal Fidelity Measures” Signal Processing Magazine,
IEEE 2009. (d)
Figure 6. Pitch period plot (a) f2f (b) m2m(c) f2m (d) m2f

737

You might also like