Professional Documents
Culture Documents
Abstract— Voice conversion (VC) is a process which modifies s2 (n) s(n) a * s(n) (1)
the speech signal produced by one source speaker so that it
sounds like another target speaker. In this paper the
transformation is determined by using equal Arabic utterances where s(n) is the speech signal , s2 (n) is the output signal
from source and target speakers. A conversion function based and the value of a is usually between 0.9 and 1.0. The z-
on Gaussian mixture model (GMM) is used for transforming transform of the filter is
the spectral envelope described by Mel Frequency Cepstral () = 1 − ∗ (2)
Coefficients (MFCC). The quality of the transformed
utterances is measured using subjective and objective Voice input
evaluations.
I. INTRODUCTION
Voice conversion is aimed at modifying speech spoken Preemphasis
by one speaker (source) to give an impression that it was
spoken by another specific speaker (target). The voice
conversion process consists of two phases: training and
Windowing
conversion.
In training, a mapping from source features to target
features is created based on training data from both
speakers. In the conversion phase, any unknown utterance ||2
from the source speaker can be converted to sound like the
target speaker. The most popular methods for creating the Magnitude Spectrum
734
Triangular filter bank
0.8
1
0.6
0.9
0.4
0.8
0.2
amplitude
0.7
0
0.6
-0.2
-0.4
0.5
-0.6 0.4
-0.8 0.3
0 0.5 1 1.5 2 2.5 3 3.5
samples 4
x 10
0.2
(a)
0.1
0.15
0
0.1 0 1000 2000 3000 4000 5000 6000 7000 8000
Frequency (Hz)
0.05
Figure 3. An example of mel-spaced filter bank.
amplitude
0
To define the baseline system Let x=[x1 x2...........xN ] and
-0.05
y=[y1 y2...........yN ] be parallel sequences of feature vectors
-0.1
(MFCC) of the source and target speech, respectively. Then,
-0.15 the combination of these sequences z= [xT yT ]T is used to
-0.2
estimate the parameters of a Gaussian mixture model (αq ,μq,
¦ q ) with Q components for the joint density p(x; y), where
0 0.5 1 1.5 2 2.5 3 3.5
samples 4
x 10
(b)
αq denotes the prior probabilities of x having been generated
Figure 2. (a) Original wave. (b) After preemphasis a=0.95. by component q, μ is the mean vector and ∑ is the
covariance matrix [5] [6] [7].
D. Mel Filter Bank
In [2], it has been shown that human perception of the B. Conversion Phase
frequency contents of sounds for speech signals does not Within this phase each source feature vector x is
follow a linear scale. Thus for each tone with an actual converted to a target vector y by the conversion function
frequency, f, measured in Hz, a subjective pitch is measured which minimizes the mean squared error between the
on a scale called the ‘mel’ scale. The mel-frequency scale is converted source and the target vectors observed in training
a linear frequency spacing below 1000 Hz and a logarithmic [5] [6] [7]:
Q
spacing above 1000 Hz [2]. The relation between the actual
¦ (P ¦ (¦
yx xx
y y
q ) 1(x P qx )). p(q \ x) (5)
frequency and the mel frequency is given by equation (4). q 1
q q
where
mel ( f ) 2595 * log 10(1 f / 700) (4) ª¦ xx ¦ xy º
¦q «« yx yy »»
q q
III. THE BASELINE CONVERSION SYSTEM Where αq denotes the prior probabilities of x having been
generated by component q and N is n-dimensional normal
A. Training Phase distribution.
The baseline system, to be defined, requires parallel
utterances of source and target speaker for training.
735
Finally, the Inverse (MFCC) is performed for the N
¦ (x
transformed feature vectors to obtain the converted speech (,y) = yi ) 2 (7)
i
signal. i 1
where x={ xi\ i=1,2,…..N} , y={ yi\ i=1,2,…..N} are two
IV. EVALUATION vectors of length N.
A. Experimental Corpus The MSE between MFCC of the target, transformed
signals and the MSE between MFCC of the source,
The corpus used in this work contains some Arabic transformed signals are shown in tables I and II respectively.
words. Each word has the form of Consonant-Vowel- The MSE is calculated for each of the three vowels and for
Consonant (CVC), and it includes one of the three vowels (a, all types of conversion female to female (f2f), male to male
o, e) such as (...... ( )ﻗﺎﻝ ﺑﺎﺏ ﺻﺎﻡ ﻓﻴﻞ ﺩﻳﻦ ﻣﻴﻢ ﻓﻮﻝ ﺳﻮﺭ ﻧﻮﺭqal bab (m2m) female to male (f2m) and male to female (m2f).
s’am fel den mem fol sor nor…..) [8]. Each word is Figure 5 summarizes tables I and II.
segmented into 32 ms with overlapping 16 ms. The functions
(melfcc.m, invmelfcc.m) [9] are used within the simulation
Table I. Results of MSE between target and transformed
process. Within these functions, the reconstruction is essentially
white noise filtered by the estimated MFCC spectral envelopes,
mfcc
Type of
which leads to a "whispered" quality [10].
Conversion f2f m2m f2m m2f
B. Subjective evaluation
vowels
MFCC feature vector is extracted for both source and
A 0.1987 0.0948 0.0920 0.1052
target, and then it is reconstructed again from their MFCC
feature vector. The ABX test [5] is applied for the O 0.0711 0.0396 0.0423 0.0678
reconstructed source, the reconstructed target and the E 0.0702 0.0413 0.0367 0.0624
transformed words. In this test the listeners decide whether
the transformed word is closer to the reconstructed source or TABLE II. RESULTS OF MSE BETWEEN SOURCE AND TRANSFORMED MFCC
the reconstructed target words or neither.
The evaluation employs the ABX test for the three types Type of
of vowels and all types of conversion female to female (f2f), Conversion f2f m2m f2m m2f
male to male (m2m) female to male (f2m) and male to vowels
female (m2f). The results show that most of the listeners
A 0.2672 0.0684 0.1457 0.1603
have confirmed that the converted voice belongs to the
reconstructed target. Figure (4) shows the result of the ABX O 0.1887 0.1095 0.1202 0.1100
test for all types of conversion. E 0.1322 0.1203 0.1223 0.1079
% Subjects confirming
736
300
pitch
150
female to female (f2f), male to male (m2m) female to male
(f2m) and male to female (m2f) the pitch period was plotted 100
versus frame's number as shown in figure 6.
source
Figure 6 shows that the pitch of the transformed either 50 target
transformed
increases or decreases towards that of the target such that
the transformed word pitch is closer to the target word pitch 0
0 2 4 6 8 10 12 14
(a)
V. CONCLUSIONS 300
pitch
150
150
[9] PLP and RASTA (and MFCC, and inversion) in Matlab using
melfcc.m and invmelfcc.m, labrosa [online] 2010,
http://labrosa.ee.columbia.edu/matlab/rastamat/ (Accessed: 9 100
September 2011)
[9] Audio Feature Comparison, MET lab [online], 50
http://labrosa.ee.columbia.edu/matlab/rastamat/ (Accessed: 9
September 2011)
0
[10] Zhou Wang, Bovik, A.C. “Mean squared error: Love it or leave it? A 0 2 4 6 8 10 12 14
frame number
new look at Signal Fidelity Measures” Signal Processing Magazine,
IEEE 2009. (d)
Figure 6. Pitch period plot (a) f2f (b) m2m(c) f2m (d) m2f
737