Professional Documents
Culture Documents
HARMONIC PARAMETERS
Computer engineering department, Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus,
palex@bsuir.by
െܤሺ݊ሻ
Figure 1 - Training scheme ߮ሺ݊ሻ ൌ
ቆ ቇǡ
ܣሺ݊ሻ
Source Converted
speech speech
߮ሺ݊ ͳሻ െ ߮ሺ݊ሻ
݂ሺ݊ሻ ൌ ܨ௦ Ǥ
Source Transformed ʹߨ
harmonic/ harmonic/
noise Envelopes noise where
Speech envelopes transformation envelopes Waveform
parameterization using conversion synthesis ܣሺ݊ሻ ൌ
matrix
ேିଵ
ݏሺ݅ሻܨ௦ ʹߨሺ݊ െ ݅ሻ ʹߨ
ൌ ቆ ܨο ቇ
൬ ߮ ሺ݊ǡ ݅ሻ൰ǡ
Source pitch Statistical
Transformed ʹߨሺ݊ െ ݅ሻܨο ܨ௦ ܨ௦
pitch ୀ
pitch
modification ܤሺ݊ሻ ൌ
ேିଵ
െݏሺ݅ሻܨ௦ ʹߨሺ݊ െ ݅ሻ ʹߨ
Figure 2 - Conversion scheme ൌ ቆ ܨο ቇ ൬ ߮ ሺ݊ǡ ݅ሻ൰Ǥ
ʹߨሺ݊ െ ݅ሻܨο ܨ௦ ܨ௦
ୀ
3. SPEECH PARAMETERIZATION BASED ON
INSTANTANEOUS HARMONIC PARAMETERS
ۓ ܨ ሺ݆ሻ ǡ ݊൏݅
ۖ ୀ
The hybrid deterministic/stochastic model assumes that the ߮ ሺ݊ǡ ݅ሻ ൌ
signal ݏሺ݊ሻ can be expressed as the sum of its periodic and ۔െ ܨ ሺ݆ሻ ǡ ݊ ݅Ǥ
noise parts: ۖ ୀ
ە Ͳǡ ݊ൌ݅
ݏሺ݊ሻ ൌ MAG ሺ݊ሻcos߮ ሺ݊ሻ ݎሺ݊ሻǡ Central frequencies of the filter bands are calculated as the
ୀଵ
instantaneous fundamental frequency multiplied by the
number ݇ of the respective harmonic ܨ ሺ݊ሻ ൌ ݂݇ ሺ݊ሻ.
where MAG ሺ݊ሻ - the instantaneous magnitude of the ݇-th
The procedure goes from the first harmonic to the last,
sinusoidal component, ܭis the number of components,
adjusting fundamental frequency at every step. The
߮ ሺ݊ሻ is the instantaneous phase of the ݇-th component and fundamental frequency recalculation formula can be written
ݎሺ݊ሻ is the stochastic part of the signal. Instantaneous phase as follows:
5141
݂ ሺ݊ሻMAG ሺ݊ሻ
݂ ሺ݊ሻ ൌ Ǥ
ሺ݅ ͳሻ σୀ MAG ሺ݊ሻ
ୀ
The training sets (sequences of ܧ௧ and ܧ௦ ) are estimated where ߤ ௌ and ߪ ௌ are the average pitch and standard
from speech of the source and target speakers using the deviation of pitch of the source speaker respectively, ߤ ் and
technique described above. The envelope vectors consist of ߪ ் are the average pitch and standard deviation of pitch of
concatenated harmonic and noise parts. The training the target speaker respectively, ௧ௌ is a given pitch value of
procedure is iterative and implies dynamic time warping of the source speaker. The parameters ߤ ௌ and ߪ ௌ are estimated
the training sequences at every step. The sequences are during training phase. Harmonic and noise envelopes are
warped in order to minimize the conversion error. The transformed using conversion function and then output
spectral-distortion measure minimized by means of dynamic speech signal is synthesized.
programming is mean log spectral distance between The conversion technique was implemented in real-time.
converted and target sequences. The system was originally designed for mobile
communication application. In order to communicate with
Global System for Mobile Communications (GSM) the
5142
Session Initiation Protocol (SIP) was used. Overall 30ms 6
latency of the analysis-conversion-synthesis sequence was
achieved on a personal computer due to optimized C++ 4 GMM
code. FW
2
LR
5. EVALUATION 0
m1-m2 m1-f1 f1-m1 f1-f2 average
5.1. Conversion systems
(a)
The proposed conversion system was implemented in two 6
versions: for narrow-band (ܨ௦ ൌ ͺkHz) and wide-band
4 GMM
(ܨ௦ ൌ ͶͶǤͳkHz) speech. The narrow-band version was tested
with both uncompressed speech signals and speech signal, FW
2
passed through GSM. LR
0
m1-m2 m1-f1 f1-m1 f1-f2 average
5.2. Database
(b)
Four different voices were used in the experiments – two Figure 4 - Mean opinion scores for wide-band speech
male and two female voices (labeled as m1, m2, f1, f2 (a) – identity, (b) – quality
respectively). The database contained 100 Russian sentences
per speaker. The overall duration of the speech database was 6. CONCLUSIONS
about 3 minutes per speaker. The sentences were uttered
with different intonations, natural for the speakers; however, A system for real-time voice conversion has been proposed.
the text of the sentences was the same in order to create The implemented speech parameterization technique
parallel training sequences. All the sentences were sampled performs deterministic/stochastic decomposition by means
at 44.1kHz, then resampled at 8kHz and transmitted through of instantaneous harmonic analysis. An iterative dynamic
GSM. For training 10 parallel sentences were used for each time warping scheme has been proposed in the training
conversion method. phase that provides accurate alignment of the training
sequences subject to the speaker-depended features of the
5.3. Subjective tests spectral envelopes. The conversion scheme uses only one
linear transformation function, however the method showed
Mean opinion scores tests were carried out in order to good results in comparison with GMM and FW-based
evaluate quality and identity of the reconstructed speech. methods.
The subjective test compared three different methods: the
GMM-based approach (labeled as GMM), frequency 7. ACKNOWLEDGMENTS
warping approach (labeled as FW) and the proposed This work was supported in part by the IT Mobile
technique (labeled as LR). Company (Moscow, Russia).
Ten listeners participated in evaluation experiments.
They were asked to listen converted-target sentence pairs 8. REFERENCES
corresponding to three different bases (uncompressed
44.1kHz, uncompressed 8kHz and GSM 8kHz) and four [1] E. Helander, et. al. “Voice Conversion Using Partial Least
conversion directions. First the listeners rated the sentence Squares Regression,” IEEE Transcations on Audio, Speech, and
pairs on a scale from 1=“different person” to 5=”the same Language Processing, Vol. 18, No. 5, pp. 912-921, July 5 2010.
person” regardless to the speech quality. Then the listeners [2] Y. Stylianou, O. Capp.e, and E. Moulines. Continuous
rated the quality of the reconstructed speech on the scale probabilistic transform for voice conversion. IEEE Trans. Speech
1=”bad – synthetic speech” to 5=”excellent – natural and Audio Processing, Vol. 6, No. 2, pp. 131–142, 1998.
speech” using the same sentence pairs. The listening results [3] D. Erro, A. Moreno and A. Bonafonte “Voice Conversion
for wide-band speech are presented in figure 4. The average Based on Weighted Freqeuncy Warping,” IEEE Transcations on
scores for narrow-band uncompressed speech are listed in Audio, Speech, and Language Processing, Vol. 18, No. 5, pp. 922-
931, July 5 2010.
table 1.
[4] E. Azarov and A. Petrovsky, “Instantaneous harmonic analysis
Table 1 - Average mean opinion scores for narrow-band speech for vocal processing” in Proc. DAFx-09, Como, Italy, September
1-4, 2009.
Identity Quality [5] KY Lee, Y Zhao, Statistical Conversion Algorithms of Pitch
GMM FW LR GMM FW LR Contours Based on Prosodic Phrases. Proceedings of the
Uncompressed 3,1 2,1 3,1 2,5 3,1 3,4 International Conference “Speech Prosody 2004”. (SP 2004)”,
GSM 2,8 2,2 2,9 2,2 2,9 3,0 Nara, Japan March 23-26 2004, CD-ROM.
5143