You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324379412

Voice morphing: An illusion or reality

Conference Paper · February 2018


DOI: 10.1109/ICACS.2018.8333282

CITATIONS READS
0 5,973

5 authors, including:

Ijaz Ahmed Ayesha Sadiq


University of Greenwich COMSATS University Islamabad
15 PUBLICATIONS 52 CITATIONS 8 PUBLICATIONS 32 CITATIONS

SEE PROFILE SEE PROFILE

Muhammad Atif Mudasser Naseer


University of Lahore COMSATS University Islamabad
31 PUBLICATIONS 89 CITATIONS 10 PUBLICATIONS 38 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Ijaz Ahmed on 16 April 2018.

The user has requested enhancement of the downloaded file.


VOICE MORPHING: AN ILLUSION OR
REALITY
Ijaz Ahmed Ayesha Sadiq Muhammad Atif
Department of Computer and Info Sciences Faculty of Information Technology Department of Computer Sciences
DWC, Higher College Technology, UAE Monash University, Australia University of Lahore, Pakistan
iahmed2@hct.ac.ae ayesha.sadiq@monash.edu matifch@gmail.com

Mudasser Naseer Muhammad Adnan


Department of Computer Sciences Department of Computer Sciences
COMSATS Institute of Information Technology, Pakistan University of Lahore, Pakistan
mnaseer@ciitlahore.edu.pk , muhammad.adnan@cs.uol.edu.pk

Abstract—Voice morphing is an emerging technology that a speaker even from a single utterance, often. The heading
gained popularity in industry, particularly in entertainment of a child towards his/her mother’s voice in an instant way
industry to enhance the quality of voice, from the last few tells us how swiftly brain recognizes voices [5]? This swift
decades. However, the voice morphing tools are also being used by
predators to deceive people. Therefore, it poses a great challenge recognition of voice illustrates the fact that how challenging
to public to determine whether a voice is morphed or not. Voice is to create a successful illusion (morphing) of voice?
morphing is a technique to algorithmically change one person’s There are two types of voice morphing techniques, i.e.,
voice into another person’s voice. In this paper, we conducted a cross-speaker and intra-speaker morphing. In cross-speaker
case study that evaluates the hypothesis that do voice morphing technique, speakers speak two different accents or languages,
tools/techniques create a successful illusion in a listener mind?
To evaluate, our hypothesis we propose an elementary voice whereas in intra-speaker technique, speakers speak a native
morphing technique, based on a set of mask vectors; then we language with same accent. The previous attempts to conduct
implement the said technique as a voice morphing tool in a visual the cross-speaker technique have proven less successful as
language Pure Data; and finally we evaluate this technique by compare to intra-speaker technique [8]. Moreover, a significant
designing a test bed. Our results indicate that voice morphing number of voice morphing techniques been discussed in the
technology may create an illusion in a listener mind, effectively.
Index Terms—Voice morphing, human computer interaction,
literature [11]. These techniques implement inter lingual and
social aspect of digital computing. intra lingual voice conversion, based on different models
such as voice based models, mixed voice models and signal
I. I NTRODUCTION based models . Most commonly used techniques are Vector
Quantization [17], Dynamic Frequency Wrapping (DFW) [20],
Voice morphing also known as voice conversion is a tech- and Joint Estimation Analysis Systhesis (JEAS) [14]. Different
nique to convert one person’s voice into another person’s voice objective and subjective evaluations such as global distance
[2]. The original and the converted voices are designated as and cepstral distortion measurements, MOS (Mean Opinion
source and target voices, respectively. In early era of comput- Square) methods and ABX tests are used to check the quality
ing, voice morphing technology was limited to dubbing studio of the morphed voice.
only, but now it has expanded its boundaries to a number The objective of this research is to to evaluate the hypothesis
of other applications such as in CENET (Telecommunication i.e. Do the voice morphing techniques create a successful
Company) , TTS (Text to Speech Synthesis), and SST (Speech illusion in a listener mind? We had two choices to accomplish
to Speech Translation) applications [12]. There exists a variety this objective: (1) perform a statistical analysis on the results
of voice morphing tools that are being used to morph voice for of existing developed techniques to evaluate our hypothesis
online games, to make voice parodies and to remix songs. Such (2) propose our own technique, albeit a basic one, to evaluate
tools can also be used with internet and mobile phone call as the hypothesis. We opted for the second option.
well . These tools can change a voice characteristics (age, sex Our proposed technique is based on a set of mask vector.
etc.) of the original voices [9]. The predators use these tools The focus of our work is on intra-speaker voice morphing;
to deceive a person that may cause emotional disturbance in therefore we used the voices of two native English speakers.
human societies. For example, fake voice changers can be used The contributions of this paper are three folds: (1) pitch shift-
to convert a male’s voice into female’s voice and vice versa, ing and an elementary voice morphing technique based on a
and may help predators in concealing his/her gender. set of mask vectors. A mask vector is a ratio of the amplitudes
On the contrary, voice recognition is a primitive phe- of target and source speakers; (2) the implementation of the
nomenon among human beings; therefore, a listener identifies proposed technique; (3) and an experiment that evaluates the
proposed technique. transformation between the source and target speaker. The
The rest of this paper is organized as follows: Section 2 evaluator was asked to listen three utterances of words i.e. the
describes related work. Section 3 introduces basics of sound original, the transformed and the same word uttered by target.
analysis. Section 4 describes proposed technique and the The subject was asked to give a scale from 1 to 10, where
architecture of the implemented tool. Section 5 discusses a 1 represents the least similarity and 10 ref;ects the highest
test bed and evaluates results. Section 6 concludes the paper similarity between the transformed and original version. The
and discuses future work. result of the experiments showed that on average, the words
resembled more like the target speakers (6.62) and less like
II. R ELATED W ORK the source speakers (3.87) which shows that formants play an
This section describes various existing voice morphing important role in voice conversion and perception of a speaker.
techniques. The focus of study is on the experiments that Rao presented a flexible method to modify shape of local
were designed to evaluate the efficacy of these morphing track and prosody characteristics such as intonation patterns
techniques. In [19], the authors proposed a Subband technique and duration to perform voice morphing on inter-gender (male
to morph voice. They used a pair of words such as (wonderful, to female and female to male) basis [15]. In his experiment, he
wonderful) and (wonderful, boring) to evaluate the proposed recorded ten Hindi sentences and computed different parame-
Subband technique. The first pair used the same word, whereas ters e.g. duration, pitch contour, average pitch, average frame
the second pair used different words. A pair of sound files energy and energy contour (gain contour) for each utterance of
was generated where the first word of the pair was uttered sentences by male and female. Based on the calculated values,
by a speaker, and the second word of the pair was uttered each of the utterance spoken by a female speaker is converted
by the same speaker or a different speaker. In case, if the to its corresponding male spoken utterance, and vice versa.
second word of the pair was uttered by a different speaker, then He tested his proposed method by performing listening tests
the uttered voice was morphed into the first speaker’s voice. on twenty participants. MOS (Mean Opinion score) method
Finally, the participants were asked to determine whether was used on a scale of 5 points to represent distortion level
voices of both words belong to the same speaker or different and quality of speech where 5 stands for Excellent and 1 for
speakers. The experiment used ten words uttered by different Unsatisfactory. The MOS scores indicated that there was no
speakers belonging to different age and sex. The used words perceivable distortion in the morphed voice and desired gender
are phonetically rich and balanced. The voices of four male voice was transformed successfully.
and four female belonging to same geographical area, in Toru et al proposed a voice conversion algorithm based
this case Turkey, were used to evaluate the technique. The on probabilistic models called recurrent temporal restricted
results produced mixed finding, in some case people judged Boltzmann machines (RTRBMs). An RTRBM is a non-linear
the morphed voice and in some case not. probabilistic model used to capture temporal dependencies in
In another experiment [18], the researchers suggested that a time-series data [13]. The proposed system systematically
selection of source and target speakers influence the results of captures the time-dependent unique information of the speaker
voice morphing. The experiment concluded that a person A’s as well as deep latent relationship between the source and
voice can be morphed effectively into a person B’s voice, but target speaker spectral vectors in a single neural network. Toru
not into person C’s voice; though all speakers belong to same claimed that the results of his proposed technique are highly
geographical area and having same accent. The experiment satisfactory.
used a voice database of ten male and ten female speakers. Xu proposed a residual prediction technique to transform
The experiment used regular sentences such as “there is a grey vocal tract parameters of a source speaker?s voice into a
carpet on the floor” to record voices. A subjective listening test target speaker [21]. The objective was to ensure the high
was conducted to evaluate the quality of morphing process. quality of the synthesized speech based on excitations. The
Authors in [4] concluded that the temporal (accent) features proposed strategy consists of two stages (a) the training stage
of source and target speakers play a vital role in voice mor- and (b) the transformation stage. The experimental evaluation
phing. The temporal features include word-final stop closure was performed on 250 parallel utterances of Mandarin Chinese
duration, voice onset time, average voice duration and average spoken by a male and a female respectively. The data was
word duration. The experiment further concluded that all sampled at 16 kHz and quantized for 16 bit per sample. The
temporal features contribute to determine accent except voice experiment used 180 utterances for training purpose and the
onset time. Additionally word-final stop closure duration is rest for the testing purpose. ABX listening test was conducted,
considered the most significant temporal feature to determine where 5 listeners were asked to judge whether an utterance
accent. It is also learned that mid-range frequency (1500- X (the converted speech) sounds closer to utterance A (the
2500 HZ) contributes more in pronunciation rather than accent source speech) or B (the target speech) respectively. The
determination. results indicate that the proposed approach is highly useful.
Rentoz in his work [16] used Mean Opinion Score (MOS) Like all these previous studies, we also morph a speaker
to evaluate the similarity of a source voice into target voice. voice into a target voice; however we use a set of mask vectors
The test focused on pitch features and formant transformation. based on voice amplitude. Furthermore, instead of using a
Five (05) words were selected for evaluating the formant single mask vector, we use a set of mask vectors that give
Voice Voice
Window DFT Log IDFT

Fig. 1. Basic steps of sound analysis

more freedom to our user (of tool) to select the best suitable
mask vector.
III. P RELIMINARIES
This section describes basics of sound, sound analysis and
pitch shifting.
A. Sound Types
labelAA Sound signals can be broadly categorized into
two major types, i.e., voiced and unvoiced signals [5]. The
sound generated by vowels such as /a/, /e/, /i/, /o/, /u/ is of
voiced-type, whereas sound generated by the consonants such
as /r/, /s/, /p/, /x/ is of unvoiced-type. Voiced signals have a
fundamental frequency f0, and the fundamental frequency is Fig. 2. Formant analysis of source and Target Voices
a lowest frequency in a harmonic sequence. Additionally, the
fundamental frequency has two formant frequencies namely Synchronous Overlap) technique to alter pitch [1]. Flanagan
f1 and f2, produced by the excitation signal. The position of and Golden in 1966 introduce Phase VoCoder method to
tongue in mouth determines formant frequencies. The lower implement pitch shifting and time stretching techniques [7].
position of tongue in mouth produces formant frequency f1, Phase VoCoder uses the STFT (Short Time Fourier Transform)
and the forward position of tongue produces formant frequency to calculate frequency and amplitude relationship and then
f2. The unvoiced sounds have no fundamental frequency of process the phase by re-sampling it. Finally, inverse STFT
excitation signal, and this excitation is known as white noise. is used to generate the altered voice.
However about some voices, it is difficult to say that either
they belong to voiced or unvoiced category [10]. Whispering IV. PROPOSED TECHNIQUE AND TOOL
is considered a special kind of sound that also does not has This section describes the details of proposed technique.
fundamental frequency of excitation signal. First, we present formant analysis of source and target voices,
secondly we introduce high level architecture of the morphing
B. Sound Analysis
tool, and finally the proposed morphing technique based on a
Sound analysis is the first step in voice morphing. Sound set of mask vectors.
analysis calculates spectral envelop of a sound, i.e., to mea-
sure frequency, pitch and magnitude. DFT (Discrete Fourier A. Formant Frequency Analysis
Transformation) is widely used in signal processing to measure Praat [6] tool is used to conduct formant frequencies of
spectral envelope. Informally speaking, DFT works like a source and target voices. Figure 2 shows formant analysis of
prism. Likewise a prism divides a light into red, green and two voices belonging to two different persons. The speakers
blue components; similarly DFT divides voice into different utter the vowel, /a/, with a lengthy articulation of /aaaaaa/.
components such as fundamental frequency and amplitude. The green speckles shows person A’s voice and red speckles
The speech is first passed to a window function and then represent person B’s voice. A repetitive experiment of formant
the output of the window function is given to DFT as shown analysis shows that dominant frequency (magnitude) plays a
in Figure 1. After that a logarithmic spectrum of data is vital part in accent identification. Figure 2 shows that there is a
calculated, and then it is taken back into time domain by difference between the dominant frequencies of both speakers,
performing IDFT (Inverse Discrete Fourier Transformation). and particularly there is a clear difference above 3000 HZ
frequency at y-axis; this fact help us to infer that magnitude
C. Pitch Shifting
of a voice has a vital role in accent identification.
Pitch shifting is a technique to change sound pitch without
altering its wavelength. Time stretching is the reverse tech- B. The Architecture of the Tool
nique of pitch shifting that does not change actual pitch rather The architecture of the developed voice morphing tool is
it changes its speed (tempo) only; thus it creates effects of shown in Figure IV-C. Voice bank stores samples of voices of
pitch shifting. Time stretching uses re-sampling technique for target speakers. Spectral analysis is performed on both source
pitch shifting. Re-sampling compresses or expands time scale and target speakers. The tool has two main modules, namely
of a signal, i.e., in re-sampling, a sample of original voice calibration module and masking module. The calibration mod-
is stored in memory buffers and then buffers are read and ule calculates a set of mask vectors, whereas the masking
re-sampled at different rates. Mousa proposed PSOLA (Pitch module uses this set of mask vector to morph the signal as
Voice Bank Masking

Source and Target


Target imaginary & real parts Calculate
Voice Magnitude

Spectral Analysis Compute


Mask Vector
Source Voice
imaginary & real parts Mask Vector

Morphed Morphed
Callibration Signals Synthesizer Voice
Source
Voice
Compute
Magnitude &
Phase
Perform
Source Voice Masking

Fig. 3. Architecture of proposed voice morphing tool

shown in Figure 4. The synthesizer module uses the morphed


(altered) signal to generate the morphed voice.
C. The Calibration Module
The calibration module calculates the mask vector which
is a key to voice morphing, as a poorly defined mask vector
results in poorly morphed voice. Formant Frequency Analysis
of Section (IV-A) indicates that the magnitude of a voice plays
a vital role in the accent determination, so we calculate the
magnitude of source and target voices by using Equation 1.
The FFT analysis of a voice produces a complex number that
can be represented as (a + b) Therefore, the magnitudes of
source and target voices can be calculated by using equation
(1), where subscript s represents source, and subscript t repre-
sents target voices respectively. In Equation 1, |S| represents
the magnitude of source, and |T | represents the magnitude of
target voices. Later on, we calculate the first mask vector M V0
by taking the ratio of the magnitudes of source and target voice
as shown in Equation 2. Similarly, we calculate other mask
vector by using Equation 3, 4, 5 and 6. Basically, Equations 3,
Fig. 4. The masking module
4, 5, and 6 are adding target’s voice characteristics into source
wise, incrementally. By following the pattern, one may add
more properties of target’s voice into source voice as reflected
source and target speakers. Both participants belonged to same
in Equation 7. Thus, the set of mask vector can be defined
region (United Kingdom) and are roughly belong to same age
as M V = {M V0 , M V1 , M V2 , M V3 , M V4 , M V5 , M Vn }. The
group (roughly 28 years of age). They were asked to record a
user of the tool can use any mask vector from the set to create
voice sample by uttering a vowel /a/ with a lengthy utterance
a successful illusion.
of /aaaaa/. The length of the recording voice was around 3
D. The Masking Module seconds. The voice produced by the utterance of vowels is of
The masking module needs three input (a) magnitude of voiced-type (as discussed in Section ??). A voiced-type voice
source voice |Ms | (b) a set of mask vectors M V (c) and bears high presence of fundamental frequency f0, therefore the
phase of source voice as shown in Figure 4. The phase of recorded sample is considered a good candidate to calculate
source voice is calculated by using the equation P hases = voice magnitude.
atan(bs /as ). The magnitude of source voice is multiplied with After that, both participants were asked to speak a fairly
the selected mask vector. Notice that, we do not change the common sentence echoed in the London Metro Trains, i.e.,
phase of the source in the whole morphing process to keep “Please mind the gap”. The recorded voices were then mor-
the naturalness of voice, intact. Finally, the masking module phed to each other by using the proposed technique (men-
performs IFT (Inverse Fourier Transformation). tioned in section IV) and traditional pitch shifting technique
(mentioned in section III-C) which yielded seven conditions
V. EVALUATION as reflected in Table I.
This section describes an experiment that we conducted A single sound file (in wav format) was created for each
to evaluate our hypothesis. We selected two male persons as condition that resulted into seven sound files. A Java appli-
Conditions Description
Cond.1 A’s voice unmorphed and B’s voice unmorphed
Cond.2 A’s voice morphed to B’s voice and B‘s voice unmorphed K Cond. 1 C. 2 C. 3 C. 4 C. 5 C.6 C. 7
Cond.3 A’s voice unmorphed and B’s voice morphed to A’s voice 1 N Y N Y N N Y
Cond.4 A’s voice morphed to B’s voice and vice versa 2 N N Y Y Y N Y
Cond.5 A’s voice pitch shifted to B’s voice and B’s voice unmorphed 3 N Y Y N N N Y
Cond.6 A’s voice unmorphed and B’s voice pitch shifted to A’s voice 4 N Y Y N Y N N
Cond.7 A’s voice pitch shifted to B’s voice and vice versa 5 N N Y N N N N
6 N N Y N N N Y
TABLE I 7 N N N Y N N N
VOICE MORPHING CONDITIONS AND THEIR INTERPRETATIONS 8 N N Y Y N N Y
9 N N Y Y N N Y
10 N Y Y Y N N N
11 N N N Y N N N
cation was developed to randomly play these sound files. 12 N Y Y N N Y N
13 Y N Y Y N N N
Listening voices in a predetermined order can affect the 14 Y Y N Y N N N
recognition process; therefore we preferred to play sound files 15 N Y Y N N Y N
in an arbitrary order. We selected 35 listeners (both male and 16 N N Y N Y N Y
17 Y N N Y N N Y
female) to evaluate the quality of morphed voice. Roughly, 18 Y N N Y N N Y
half of the participants were native English speakers whereas 19 Y Y N N N N N
the other halves were Asian and African. The average age 20 Y Y N N N N N
of participants was around 25 years. This experiment was 21 N N Y Y N N Y
22 N Y N N Y N N
conducted in the premises of an international university, so 23 Y Y N Y N Y N
all participants were student. We provided a headphone to 24 N N Y Y N N N
participants so that they can listen the sound files clearly and 25 Y Y N Y Y N N
26 Y Y N N Y N Y
the surrounding noise may not affect their judgment. After
27 N N N Y Y Y Y
listening each pair of sound files, we asked a simple question 28 N N N Y N N Y
to the listener that whether the two voices belong to a single 29 N Y N Y N Y Y
person or not? The participants had an option to replay the 30 N N Y Y N N N
31 Y Y Y N Y N Y
sound file, if he/she had any confusion in his/her judgment. It 32 N N Y Y Y Y N
is observed that most of the participants provided answers in 33 N N Y Y N N Y
their first attempt for all seven files; however some participants 34 N N Y Y N N N
replayed the sound files for maximum three times before 35 N Y Y Y N Y N
giving his/her final judgment. The Java application stored the TABLE II
J UDGMENT RECORDED BY PARTICIPANTS AFTER LISTENING PAIR OF
participant’s judgment in a spreadsheet automatically as shown VOICES
in Table II. In Table II, the columns represent the conditions,
whereas the rows store the participants’s judgment. The value
”Y” means ”Yes” and value ”N” means ”No”.

A. Results Judg- C. 1 C. 2 C. 3 C. 4 C. 5 C. 6 C. 7
ment
This section discusses the result and statistically evaluates YES 10 16 19 23 9 7 16
the hypothesis. Table III (calculated from Table II) shows the (28.6%) (45.7%) (54.3%) (65.7%) (25.7%) (20.0%) (45.7%)
percentage value of the participants’ judgment. The data of NO 25 19 16 12 28 28 19
(71.4%) (54.3%) (45.7%) (34.3%) (74.3%) (80.0%) (54.3%)
the column (C.1) indicates that the two sampled voices are
significantly different to each other, and the listener can easily TABLE III
P ERCENTAGE OF PARTICIPANTS ’ JUDGMENT
identify the difference. Hence the sampled voices were a good
candidate to conduct this experiment.
We created four test cases, based on our collected data, to
evaluate our hypothesis. These test cases are given in Table IV
and the corresponding hypotheses are given in Table V. We Cases Description Conditions
use Chi-square method to test the significance of these test Case1 Unmorphed Vs. Pitch-Shifting C.1 Vs. C.5 and C.6
cases [3]. Case2 Unmorphed Vs. Both-Way Pitch- C.1 Vs. C.7
Shifting
Table VI shows the calculated values of p by using Chi- Case3 Unmorphed Vs. Morphed C.1 Vs. C.2 and C.3
square method. As for Case 1, the calculated value of p > Case4 Unmorphed Vs. Both-Way Mor- C.1 Vs. C.4
0.05, so we accept the hypothesis and conclude that the pitch phed
shifting technique failed to create a successful illusion for TABLE IV
listener. Similarly, for Case 2 the calculated value of p > 0.05, T EST CASES AND ASSOCIATED CONDITIONS
therefore we accept the hypothesis and conclude that the both-
way pitch shifting technique also failed to create a convincing
illusion for listener. However, the calculated value of p for so that a person cannot cheat another person by using a voice
Case 3 is less than 0.05, therefore the stated hypothesis is morphing tool. Our future work is to develop a technique that
rejected and we conclude that the proposed morphing tech- automatically determines whether a voice is morphed or not.
nique does create a successful illusion to listener. Similarly, the
R EFERENCES
calculated value of p for Case 4 is also less than 0.05. Therefor
we reject our hypothesis and conclude that the proposed [1] Mousa Allam. Voice conversion using pitch shifting algorithm by time
stretching with psola and re-sampling. Journal of electrical engineering,
morphing technique creates a successful illusion. Moreover, (4):57–61, 2009.
the much lower value of p for Case 4 indicates that a rich [2] Jonathan Allen, M. Sharon Hunnicutt, Dennis H. Klatt, Robert C.
mask vector (based on amplitude) can create a significant Armstrong, and David B. Pisoni. From Text to Speech: The MITalk
System. Cambridge University Press, New York, NY, USA, 1987.
confusion. Therefore, the results indicates that the proposed [3] J. Antoch. A guide to chi-squared testing. Computational Statistics and
voice morphing technique can create an effective illusion in a Data Analysis, 23(4):565–566, 1997.
listener’ mind. [4] L. M. Arslan and J. H. L. Hansen. Speech enhancement for crosstalk
interference. IEEE Signal Processing Letters, 4(4):92–95, April 1997.
Cases Hypotheses [5] Jacob Benesty, M. Mohan Sondhi, and Yiteng (Arden) Huang. Springer
Case1 One-way Picth-Shiftinng fails to create any Handbook of Speech Processing. Springer-Verlag New York, Inc.,
illusion in listener’s mind. Secaucus, NJ, USA, 2007.
Case2 Both-Way Pitch-Shifting fails to create any [6] Paul Boersma. Praat, a system for doing phonetics by computer. Glot
illusion in listener’s mind. International, 5(9/10):341–345, 2001.
Case3 One-way Morphing fails to create any illu- [7] J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical
sion in listener’s mind. Journal, 45(9):1493–1509, 1966.
Case4 Both-Way Morphing fails to create any il- [8] Hideki Kawahara, Ryuichi Nisimura, Toshio Irino, Masanori Morise,
lusion in listener’s mind. Toru Takahashi, and H Banno. Temporally variable multi-aspect auditory
morphing enabling extrapolation without objective and perceptual break-
TABLE V down. Acoustics, Speech, and Signal Processing, IEEE International
T EST CASES AND ASSOCIATED H YPOTHESES Conference, 0:3905–3908, 04 2009.
[9] Marianne Latinus and Pascal Belin. Human voice perception. Current
Biology, 21(4):R143 – R145, 2011.
[10] Richard G. Lyons. Understanding Digital Signal Processing. Addison-
Cases Chi-Sq df p Conclusion Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition,
Case1 0.406 1 0.522 Hypothesis Accepted 1996.
[11] Anderson F. Machado and Marcelo Queiroz. Voice conversion: A critical
Case2 2.202 1 0.138 Hypothesis Accepted
survey, 2010.
Case3 4.375 1 0.036 Hypothesis Rejected
[12] Seyed Hamidreza Mohammadi and Alexander Kain. An overview of
Case4 9.689 1 .002 Hypothesis Rejected
voice conversion systems. Speech Communication, 88(Supplement C):65
TABLE VI – 82, 2017.
R ESULTS BY USING C HI -S Q METHOD [13] T. Nakashika, T. Takiguchi, and Y. Ariki. Voice conversion using rnn pre-
trained by recurrent temporal restricted boltzmann machines. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 23(3):580–
587, March 2015.
[14] Arantza Del Pozo and Arantza Del Pozo. Voice source and duration
VI. C ONCLUSION AND F UTURE W ORK modelling for voice conversion and speech repair. Technical report,
2008.
The presented work implements two approaches to morph [15] K. S. Rao and B. Yegnanarayana. Voice conversion by prosody and
the voice, i.e., pitch shifting and the proposed morphing vocal tract modification. In Information Technology, 2006. ICIT ’06.
9th International Conference on, pages 111–116, Dec 2006.
technique based on a set of mask vectors. The statistical [16] D. Rentzos, S. Vaseghi, E. Turajlic, Qin Yan, and Ching-Hsiang Ho.
analysis indicates that the former technique (pitch shifting) Transformation of speaker characteristics for voice conversion. In 2003
failed to produce an illusion of morphed voice, whereas the IEEE Workshop on Automatic Speech Recognition and Understanding
(IEEE Cat. No.03EX721), pages 706–711, Nov 2003.
later produced an effective illusion in a listner’s mind. We [17] K. Shikano, Kai-Fu Lee, and R. Reddy. Speaker adaptation through
learn through experiments that the pitch shifting technique vector quantization. In ICASSP ’86. IEEE International Conference on
introduces artificial or robotics effects in a voice. However, Acoustics, Speech, and Signal Processing, volume 11, pages 2643–2646,
Apr 1986.
pitch shifting technique is useful for cases where one wants [18] O. Turk and L. M. Arslan. Donor selection for voice conversion. In 2005
to convert voice from one gender to another. The results 13th European Signal Processing Conference, pages 1–4, Sept 2005.
of proposed voice morphing technique indicate that the pro- [19] Oytun Turk and Levent M. Arslan. Subband based voice conversion. In
in Proc. Int. Conf. Spoken Language Processing, 2002.
posed technique creates a successful illusion for a listener. [20] H. Valbret, E. Moulines, and J.P. Tubach. Voice transformation using
Additionally, the technique produces more natural voices as psola technique. Speech Communication, 11(2):175 – 187, 1992.
compared to pitch shifting technique as we do not alter the Eurospeech ’91.
[21] Ning Xu and Zhen Yang. A precise estimation of vocal tract parameters
phase of sound in our technique. The phase keeps naturalness for high quality voice morphing. In 2008 9th International Conference
of sound intact, and we embed target voice properties into on Signal Processing, pages 684–687, Oct 2008.
source voice by applying the mask vector based on amplitude.
The amplitude proves a good physical property of a voice that
carries the formant information.
We conclude this study with the fact that voice morphing
techniques are significantly successful in creating an illusion.
This poses a great challenge to social and homeland security

View publication stats

You might also like