Published by Jerin Antony
final year seminar report.
final year seminar report.

Published by: Jerin Antony on Apr 06, 2011
Advanced voice morphing 
Department of ECE 
Govt. College Of Engg. Kannur 
Voice morphing, which is also referred to as voice transformation and voiceconversion, is a technique for modifying a source speaker¶s speech to sound as if it wasspoken by some designated target speaker. There are many applications of voicemorphing including customizing voices for text to speech (TTS) systems, transformingvoice-overs in adverts and films to sound like that of a well-known celebrity, andenhancing the speech of impaired speakers such as laryngectomees. Two keyrequirements of many of these applications are that firstly they should not rely on largeamounts of parallel training data where both speakers recite identical texts, and secondly,the high audio quality of the source should be preserved in the transformed speech. Thecore process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and various approaches have beenproposed for doing this such as codebook mapping, formant mapping, and linear transformations. Codebook mapping, however, typically leads to discontinuities in thetransformed speech. Although some discontinuities can be resolved by some form of interpolation technique , the conversion approach can still suffer from a lack of robustness as well as degraded quality. On the other hand, formant mapping is prone toformant tracking errors. Hence, transformation-based approaches are now the mostpopular. In particular, the continuous probabilistic transformation approach introduced byStylianou provides the baseline for modern systems. In this approach, a Gaussian mixturemodel (GMM) is used to classify each incoming speech frame, and a set of linear transformations weighted by the continuous GMM probabilities are applied to give asmoothly varying target output. The linear transformations are typically estimated fromtime-aligned parallel training data using least mean squares. More recently, Kain hasproposed a variant of this method in which the GMM classification is based on a joint
density model. However, like the original Stylianou approach, it still relies on paralleltraining data. Although the requirement for parallel training data is often acceptable, thereare applications which require voice transformation for nonparallel training data.Examples can be found in the entertainment and media industries where recordings of unknown speakers need to be transformed to sound like well-known personalities.Further uses are envisaged in applications where the provision of parallel data isimpossible such as when the source and target speaker speak different languages.Although interpolated linear transforms are effective in transforming speaker identity, thedirect transformation of successive source speech frames to yield the required targetspeech will result in a number artifacts. The reasons for this are as follows. First, thereduced dimensionality of the spectral vector used to represent the spectral envelope andthe averaging effect of the linear transformation result in formant broadening and a lossof spectral detail. Second, unnatural phase dispersion in the target speech can lead toaudible artifacts and this effect is aggravated when pitch and duration are modified.Third, unvoiced sounds have very high variance and are typically not transformed.However, in that case, residual voicing from the source is carried over to the targetspeech resulting in a disconcerting background whispering effect .To achieve high qualityof voice conversion, include a spectral refinement approach to compensate the spectraldistortion, a phase prediction method for natural phase coupling and an unvoiced soundstransformation scheme. Each of these techniques is assessed individually and the overallperformance of the complete solution evaluated using listening tests. Overall it is foundthat the enhancements significantly improve.
Speech morphing can be achieved by transforming the signal¶s representation from theacoustic waveform obtained by sampling of the analog signal, with which many peopleare familiar with, to another representation. To prepare the signal for the transformation,it is split into a number of 'frames' - sections of the waveform. The transformation is thenapplied to each frame of the signal. This provides another way of viewing the signalinformation. The new representation (said to be in the frequency domain) describes theaverage energy present at each frequency band. Further analysis enables two pieces of information to be obtained: pitch information and the overall envelope of the sound. Akey element in the morphing is the manipulation of the pitch information. If two signalswith different pitches were simply crossfaded it is highly likely that two separate soundswill be heard. This occurs because the signal will have two distinct pitches causing theauditory system to perceive two different objects. A successful morph must exhibit asmoothly changing pitch throughout. The pitch information of each sound is compared toprovide the best match between the two signals' pitches. To do this match, the signals arestretched and compressed so that important sections of each signal match in time. Theinterpolation of the two sounds can then be performed which creates the intermediatesounds in the morph. The final stage is then to convert the frames back into a normalwaveform.However, after the morphing has been performed, the legacy of the earlier analysisbecomes apparent. The conversion of the sound to a representation in which the pitch andspectral envelope can be separated loses some information. Therefore, this informationhas to be re-estimated for the morphed sound. This process obtains an acousticwaveform, which can then be stored or listened to.

