Voice Morphing

CONTENTS
INTRODUCTION PROPERTIES OF SPEECH SIGNALS ITCS MECHANISM AND CLASSIFICATION WHAT IS VOICE MORPHING PROTOTYPE WAVEFORM INTERPOLATION PWI-BASED SPEECH MORPHING THE BASIC ALGORITHM COMPUTATION OF THE CHARACTERISTIC WAVEFORM SURFACE NEW VOCAL TRACT MODEL CALCULATION AND SYNTHESIS CONCLUSIONS REFERENCES
1 |Page
INTRODUCTION
Voice Morphing
Speech signals convey a wide range of information. Of this information the most important is the meaning of the message being uttered. However, secondary information such as speaker identity plays a major role in oral communication. Voice alteration techniques attempt to transform the speech signal uttered by a given speaker so as to disguise the original voice. In addition to that, it is possible to modify the original voice to sound like another speaker, the target speaker. This is generally known as voice morphing. There has been a considerable amount of research effort directed at this problem due to the numerous applications that this technology has. Some examples of that will benefit from successful voice conversion techniques are,
Customization of text-to-speech systems e.g. to speak with a desired voice or to read out email in the senders voice .This could also be applied to voice individuality disguise for secure communications or voice individuality restoral for interpreting telephony. In the entertainment industry a voice morphing system may well replace or enhance the skills involved in producing sound tracks for animated characters, dubbing or voice impersonating.
In the Internet chat rooms, communication can be enhanced by a technology for disguising the voice of the speaker.
Communication systems can be developed that would allow speakers of different languages to have conversations. These systems will first recognize the sentence uttered by each speaker and then translate and synthesize them in a different language. Assisting hearing impaired persons. People with a hearing loss in the high frequency sounds can benefit from a device that can change appropriately the spectral envelope of the speech signal
2 |Page
Approach to the problem

It has been recognized over the years that voice individuality is a consequence of combining several factors. Among these factors segmental speech characteristics such as the rate of speaking, the pitch contour or the duration of the pauses have been shown to contribute greatly to speaker individuality. Furthermore, it has been shown that the linguistic style of the speech that is determined by the rules of language has a great influence on the voice features. In the current state of research the processing of these features of speech byan automatic system is very difficult because high-level considerations are involved. Fortunately, strong experimental evidence has indicated that distinct speakers can be efficiently discriminated at the segmental level by comparing their respective spectral envelopes. It is generally admitted that the overall shape of the envelope together with the formant characteristics ,the nature of which will be explained later on, are the major speaker-identifying features of the spectral envelope. The method we will use for the work of this dissertation is based on the codebook mapping approach that has been the subject of extensive research over the years with respect to voice conversion. This method is based on creating dictionaries of either words, sounds or any chosen speech segment that have the acoustical characteristics stored as codebook vectors or matrices maps them by replacing them on a speaker-tospeaker basis. Furthermore, the analysis will focus solely on the transformation of the speakers formant characteristics. Voice conversion will be performed in two phases. In the first phase, the training, the speech signals of the source and target speakers will be analyzed and the voice characteristics will be extracted by means of a mathematical optimization technique, very popular in the speech processing world, the Linear Prediction Coding (LPC) technique. In order to increase the robustness of the conversion, the high-dimensional speech spaces, where the voice characteristics lie, will be described by a continuous probability density corresponding to a parametric constrained Gaussian Mixture Model that is topologically orientated, by means of the Generative Topographic Mapping (GTM). This model will map the high dimensional data to a 2-D space and at the same time make possible a transformation that will take place at the distribution level rather than transforming the data directly. In the transformation stage, the acoustic features of the source signal as they are represented by the GTM will be transformed to those of the target speaker by means of codebook mapping. That is, by a one to one correspondence, at the distribution level, between the codebook entries of the source speaker and the target speaker. Finally, the transformed features will be used in order to synthesize speech that will, hopefully, resemble that of the target speaker. Speech synthesis will be performed again by means of the Linear Prediction Coding.
3 |Page
Figure to represent Voice morphing implementation
It is agreed by most authors that the mapping codebook approach, although it provides a voice conversion effect that is sometimes impressive, is plagued by its poor quality and its lack of robustness. We do, however, believe that the topologically orientated parametric model that we will use in this dissertation in order to describe our speech space may increase the robustness of the transformation and produce speech of improved quality in terms of resembling the target speaker.
4 |Page
Properties of Speech Signals

In order to have a good model for representing the speech signal, we need to have a good understanding of the process of speech production. In the following section, we present a concise description of the anatomy and physiology of speech production.
The mechanism of speech production

Anatomy and Physiology of the Human Speech Production
The speech production apparatus is comprised of three major anatomical subsystems: The respiratory or sub glottal, the laryngeal and the articulatory subsystem. Figure 2.1 depicts the speech production system. The respiratory subsystem is composed of the lungs, trachea and windpipe, diaphragm and the chest cavity. The larynx and pharyngeal cavity or throat constitutes the laryngeal subsystems. The articulatory subsystem includes the oral cavity and the nasal cavity. The oral cavity is comprised of the velum, the tongue, the lips, the jaw and the teeth. In speech processing technical discussions, the vocal tract is referred to as the combination of the larynx, the pharyngeal cavity and the oral cavity. The nasal tract begins at the velum and terminates at the nostrils. The respiratory subsystem behaves like an air pump, supplying the aerodynamic energy for the other two subsystems. In speech processing, the basic aerodynamic parameters are air volume, flow, pressure and resistance. The main contribution of the respiratory subsystem for speech production is that when a speaker inhales air by muscular adjustments causing an increase in volume of the sub glottal system, then the lungs release air by a combination of passive recoil and muscular adjustments. Speech is the acoustic wave that is radiated from the sub glottal system when air is expelled from the lungs. The laryngeal subsystem acts as a passage for air flow from the respiratory subsystem to the articulatory subsystem. In the laryngeal subsystem, the larynx consists of various cartilages and muscles. For speech production, of particular importance are a pair of flexible bands of muscle and mucus membrane called vocal folds. The vocal folds vibrate to lend a periodic excitation for the production of certain speech types that will be discussed in the next subsection. The vocal folds come together or separate to respectively close or open the laryngeal airway. The opening between the vocal folds is known as the glottis. The articulatory subsystem stretches from the top of the larynx up to the lips and nose through which
5 |Page
the acoustic energy can escape. The articulators are movable structures that shape the vocal tract, determining its resonant properties. This subsystem also provides an obstruction for some cases or generates noise for certain speech types.
Figure: Cross-section of the human vocal system
Classification of speech sounds

Speech signals are composed of a sequence of sounds. These sounds and the transitions between them serve as a symbolic representation of information. Phonemic content is the set of the characteristics found in an acoustic signal that would make a listener distinguish one utterance from another. It is expressed in terms of phonemes that are mainly classified as vowels, consonants, diphthongs and semivowels. English uses about 40 phonemes. The study of the classification of the sounds of speech is called phonetics. A detailed discussion of phonetics is beyond the scope of this chapter. However, in processing speech signals it is useful to know how speech signals are classified as this affects the speech processing technique. Speech sounds can be classified into three distinct categories according to their mode of excitation: voiced, unvoiced and plosive sounds. For the production of voiced sounds the lungs press air through the glottis, the vocal cords vibrates, they subsequently interrupt the air stream and produce a quasi-periodic pressure wave. These pressure impulses are commonly called pitch impulses and the frequency of the pressure signal is the fundamental frequency. All vowels are voiced sounds. For the production of unvoiced or fricative sounds air flow from the lungs becomes turbulent as the air is forced through a constriction at some point in the vocal tract at
6 |Page
a high enough velocity. This creates a broad-spectrum noise source to excite the vocal tract. An example of a fricative sound is sh in the word shall. Plosive sounds result from making a complete closure, building up pressure behind the closure and abruptly releasing it. The rapid release of this pressure causes a transient excitation. An example of a plosive sound is b in the word beer. Therefore, for voiced sounds the excitation is in the form of a periodic train of pulses whereas for unvoiced sounds the sound is generated by random noise. The resonant properties of the vocal tract mentioned earlier are the characteristic resonant frequencies. Since these resonances tend to form the overall speech spectrum, we refer to them as formant frequencies. Formant frequencies usually appear as peaks in the speech spectrum.
Figure: Approximation of vocal tract by lossless tubes
The cross-sectional areas of the tubes are chosen so as to approximate the area function of the vocal tract. If a large number of tubes of short length is used, we can reasonably expect the formant frequencies of the concatenated tubes to be close to those of a tube with a continuously varying area function. The most important advantage of this representation is the fact that the lossless tube model provides a convenient transition between continuous time models and discrete time models
Speech production model

The vocal system experiences energy losses during speech production because of several factors including losses due to heat conduction and viscous friction at the vocal tract walls, radiation of sound at the lips etc. Many detailed mathematical models have been developed that describe each part of the human speech production procedure. These models for sound generation, propagation and radiation can in principle be solved with suitable values of the excitation and vocal tract parameters to compute an out-put speech waveform. Although this may be the best approach to the synthesis of naturally sounding synthetic speech, such detail is often impractical or unnecessary. The approach we will use is a rather superficial, yet efficient, one and examines a model for the vocal tract solely. The most common models that have been used as the basis of speech production separate the excitation features from the vocal tract features. The vocal tract features are accounted for by the time-varying linear system. The excitation generator creates a signal that is either a train of pulses (for voiced sounds) or randomly varying (for unvoiced sounds). The parameters of
7 |Page
the model are chosen such that the resulting output has the desired speech-like properties. The simplest physical configuration that has a useful interpretation in terms of the speech production process is modeling the vocal tract as a concatenation of uniform lossless tubes. Figure 2.3 shows a schematic of the vocal tract and its corresponding approximation as a concatenation of uniform lossless tubes.
Human speech production model
Let us resume what we have learned so far. Sound is generated by two different types of excitation and each mode results in a distinctive type of output. Furthermore, we have learned that the vocal tract imposes its resonances upon the excitation so as to produce the different sounds of speech. Based on these facts and after extensive model analysis it has been concluded that the lossless tube discrete-time model can be represented by a transfer function H (z) .Here and depend upon the area function of the vocal tract, is the z-transform representation and is the number of lossless tubes of equal length the concatenation of which we assume to approximate the vocal tract we are modeling. The poles of the system are the values of the z-transform for which. As the transfer function has the Z-transform components in the denominator only we can say that it is characteristic of an all-pole digital filter. We can thus say that a speech signal can be produced when an excitation signal
Filter representation of speech production
8 |Page
The Challenge
In order to perform voice morphing successfully, the speech characteristics of the source speakers voice must change gradually to those of the target speakers; therefore, the pitch, duration, and spectral parameters must be extracted from both speakers. Then, naturalsounding synthetic intermediates have to be produced.
What is Voice Morphing?

Voice morphing is a technique for modifying a source speakers speech to sound as if it was spoken by some designated target speaker. Research Goals: To develop algorithms which can morph speech from one speaker to another with the following properties 1. High quality (natural and intelligible) 2. Morphing function can be trained automatically from speech data which may or may not require the same utterances to be spoken by the source and target speaker. 3. The ability to operate with target voice training data ranging from a few seconds to tens of minutes.
Key Technical Issues

1. Mathematical Speech Model For speech signal representation and modification 2. Acoustic Feature For speaker identification 3. Conversion Function Involves methods for training and application
Pitch Synchronous Harmonic Model
Sinusoidal model has been widely used for speech representation and modification in recent years.
PSHM is a simplification of the standard ABS/OLA sinusoidal model

S ( n) =
A cos( l + ), [n =0, ....... , N

l =0 l o l
The parameters were estimated by minimizing the modeling error
= [S
n
( n)
s~ (n) ] 2
Time and Pitch Modification using PSHM

Pitch Modification
It is essential to maintain the spectral structure while altering the fundamental frequency. Achieved by modifying the excitation components whilst keeping the original spectral envelope unaltered.
Time Modification
PSHM model allows the analysis frames be regarded as phase-independent units which can be arbitrarily discarded, copied and modified.
Super segmental Cues

Speaking rate, pitch contour, stress, accent, etc. Very hard to model
Segmental Cues
Formant locations and bandwidths, spectral tilt, etc. Can be modeled by spectral envelope. In our research, Line Spectral Frequencies (LSF) are used to represent the spectral envelope.
Source
Target
source and target speech parameters
source to target mapping
sources altered parameters
the new speech signal
Basic Block Diagram
Prototype waveform interpolation

PWI is a speech coding method. This method is based on the fact that voiced speech is quasi periodic and can be considered as a chain of pitch cycles. Comparing consecutive pitch cycles reveals a slow evolution in the pitch-cycle waveformand duration PWI-based speech morphing Prototype waveform interpolation is based on the observation that during voiced segments of speech, the pitch cycles resemble each other, and their general shape usually evolves slowly in time (see [16, 17, 18]). The essential characteristics of the speech signal can, thus, be described by the pitchcycle waveform. By extracting pitch cycles at regular time instants, and interpolating between them, an interpolation surface can be created. The speech can then be reconstructed from this surface if the pitch contour and the phase function (see Section 2.3.1) are known. The algorithm presented here is based on the sourcefilter model of speech production [19, 20]. According to this model, voiced speech is the output of a timevarying vocaltract filter, excited by a time-varying glottal pulse signal. In order to separate the vocal-tract filter from the source signal, we used the LPC analysis [21], by which the speech is decomposed into two components: the LPC coefficients containing the information of the vocal tract characteristics, and the residual error signal, analogous to the derivative of the glottal pulse signal. In the proposed morphing technique, we used the PWI to create a 3D surface from the residual error signal which would represent the source characteristics for each speaker. Interpolation between the surfaces of the two speakers allows us to create an intermediate excitation signal. In addition to the fact that the information of the vocal tract (see Section 2.3.3) is manipulated separately from the information of the residual error signal, it is also more advantageous to create a PWI surface from the residual signal than to obtain one from the speech itself. In this domain, it is relatively
easy to ensure that the periodic extension procedure (see below) does not result in artifacts in the characteristic waveform shape [16]. This is due to the fact that the residual signal contains mainly excitation pulses, with low-power regions in between, and thus, allows a smooth reconstruction of the residual signal from the PWI surface with minimal phase discontinuities. In the proposed algorithm, the surfaces of the residual error signals, computed for each voiced phoneme of two different speakers, are interpolated to create an intermediate surface. Together with an intermediate pitch contour and an interpolated vocal-tract filter, a new voiced phoneme is produced.
The basic algorithm

The morphing algorithm consists of two main stages analysis and synthesis. As most of the speakers individuality is contained in the voiced portion of speech, and because, in preliminary experiments, it was found that morphing the unvoiced sections yielded lowquality utterances, the algorithm is applied on the voiced segments
only. The unvoiced segments were left intact, and concatenated with the interpolated voiced segments. Concatenation of voiced and unvoiced segments was performed by overlapping and adding two adjacent frames, about 20 milliseconds each, one from the voiced phoneme, and the other from the unvoiced one. The two frames are overlapped after each of them is multiplied by a half-left or halfright Hanning or linear windows, to yield a new frame that gradually changes from the voiced segment to the unvoiced segment or vice-versa. Since this operation may shorten the utterance, timescale compensation is carried out for each vocal segment by extending the PWI surface as necessary. The unvoiced segments are taken according to the morphing factor (): from the first speaker where 0 < 0.5, and from the second one where 0.5 < 1.0. The basic block diagram of the algorithm is shown in. In the analysis stage, the voiced segments of both speech signals are marked and each section in one of the voices is associated with the corresponding section in the other. The segmentation and mapping of the speech segments are done semiautomatically. First, a simple algorithm for voiced/unvoiced segmentation is applied, which is based on 3 parameters: the short-time energy, the normalized maximal peak of the autocorrelation function in the range of 316 milliseconds (the possible expected pitch period duration), and the short-time zero-crossing rate. The output of the automatic voiced/unvoiced segmentation is a series of voiced segments for each of the two voices. However, due to the imperfection of the segmentation algorithm, and the dissimilarity of the characteristics of the two voices, and as accurate mapping between the corresponding voiced sections of the two voices is crucial for the success of the algorithm, a manual correction mode has been added to refine the preliminary segmentation. The manual mode allows for making adjustments to the edges of the segments, splitting segments, joining segments, and adding new segments or deleting ones. For the demarcation of phoneme boundaries, the user can be
assisted by the graph of the 2-norm of the difference between the MFCCs of adjacent frames. It was found that, by applying manual segmentation as a refinement of the automatic segmentation, it is possible to reach accurate mapping with only small adjustments. A pitch detection algorithm is applied to both speakers utterances. The pitch detection algorithm is based on a combination of the cepstral method for coarse-pitch period detection, and the crosscorrelation method for refining the results. Pitch marks are obtained, and after pre emphasis, linear prediction coefficients are calculated for each voiced phoneme (either on the whole phoneme as one windowed.
Computation of the characteristic waveform surface

The characteristic waveform surface, which represents the residual error signal derived from the voiced sections [16], is a twodimensional signal that represents a one-dimensional signal, and is constructed as follows: let u(t, ) be the characteristic waveform, where t denotes the time axis, and is a phase variable whose values are in the range [0,2]. The prototype waveforms are displayed along the phase axis, where each prototype is a short segment from the residual signal with a length of one pitch period. Each prototype is considered as a periodic function, with a period of 2. The time axis of the surface displays the waveform evolution. A one dimensional signal can be recovered from u(t, ) by using a specific (t), so that
New vocal tract model calculation and synthesis

It is well known that the linear prediction parameters (i.e., the coefficients of the predictor polynomial A(z)) are highly sensitive to quantization [19]. Therefore their quantization or interpolation may result in an unstable filter andmay produce an undesirable signal. However, certain invertible nonlinear transformations of the predictor coefficients result in equivalent sets of parameters that tolerate quantization or interpolation better. An
example of such a set of parameters is the PARCOR coefficients (ki), which are related to the areas of lossless tube sections modeling the vocal tract [20], as given by the following equation:
The value of the first area function parameter (A1) is arbitrarily set to be 2. A new set of LPC parameters that defines a new vocal tract is computed using an interpolation of the two area vectors (source and target). This choice of the area parameters seems to bemore reasonable, since intermediate vocal tract models should reflect intermediate dimensions [25]. Let the source and target vocal tracts be modeled by N lossless tube sections The new signals vocal tract will be represented by
After calculating the new areas, the prediction filter is computed and the new vocal phoneme is synthesized according to the following scheme: (1) compute new PARCOR parameters fromthe new areas by reversing (2) compute the coefficients of the new LPC model from the new PARCOR parameters (3) filter the new excitation signal through the new vocaltract filter to obtain the new vocal phoneme. When temporal voice morphing is applied, informal subjective listening tests performed on different sets of morphing parameters have revealed that in order to have a linear perceptual change between the voices of the source and the target, the coefficient (t) (the relative part of us(t, )) must vary nonlinearly in time (like the one in Figure 8).When the coefficient changed linearly with time, the listeners perceived an abrupt change from one identity to the other. Using the nonlinear option, that is, gradually changing the identity of one speaker to that of another, a smooth modification of the source properties to those of the target properties was achieved. In another subjective listening test (see below) performed on morphing from a womans voice to a mans voice, and vice versa, both uttering the same sentence, the morphed sound was perceived as changing smoothly and naturally from one speaker to the other. The quality of the
morphed voices was found to depend upon the specific speakers, the differences between their voices, and the content of the utterance. Further research is required to accurately evaluate the effect of these factors.
CONCLUSIONS
In this study, a new speech morphing algorithm is presented. The aim is to produce natural sounding hybrid voices between two speakers, uttering the same content. The algorithm is based on representing the residual error signal as a PWI surface, and the vocal tract as a lossless tube area function. The PWI surface incorporates the characteristics of the excitation signal, and enables reproduction of a residual signal with a given pitch contour and time duration, which includes the dynamics of both speakers excitations that PWI surfaces can be exploited efficiently for speech coding, and therefore, they allow for higher compression of the speech database. The area function was used in an attempt to reflect an intermediate configuration of the vocal tract between the two speakers [25].The utterances produced by the algorithm were shown to be of high quality and to consist of intermediate features of the two speakers. There are at least two modes in which the morphing algorithm can be used. In the first mode the morphing parameter is invariant, meaning, for example, taking a factor of 0.5, and receiving amorphed signal with characteristics which are between the two voices for the whole duration of the articulation. In the second mode, we start from the first (source) speaker (morphing factor = 0), and the morphing factor is changed gradually along the duration of the sentence, so its value is 1 at the end of the sentence. The same morphing factor was used for both the excitation and the vocal tract parameters. In this time-varying version of the algorithm, that is, when morphing gradually from one voice to another over time, smooth morphing was achieved, producing a highly natural transition between the source and the target speakers. This was assessed by subjective evaluation tests, as previously described. The algorithm described here is more capable of producing longer and more smooth and natural sounding utterances than previous studies [1, 3]. One of the advantages of the proposed algorithm is that, in addition to the interpolation of the vocal tract features, interpolation is also performed between the two PWI surfaces of the corresponding residual signals, and thus captures the evolution of the excitation signals of both speakers. In this way, a hybrid excitation signal can be produced, that contains intermediate characteristics of both excitations. Thus, a more natural morphing between utterances can be achieved, as has been demonstrated. Furthermore, our algorithm performs the interpolation of
the residual signal regardless of the pitch information, since the pitch data is normalized within the PWI surface. Therefore, the morphed pitch contour is extracted independently, and can be manipulated separately. In addition, the current approach enables morphing between utterances with different pitches, between male and female voices, or between voices of different and perceptually distant timbres. Kawahara and his colleagues [26, 27] implemented a morphing system based on interpolation between timefrequency representations of the source and the target signals. It appears that the STRAIGHT-based morphing system (see [2, 26]) was able to produce intermediate voices of higher clarity than our algorithm, but the need to assign multiple anchor points for each short segment, using visual inspection and phonological knowledge, is a noticeable disadvantage of that system, which can make it difficult to use for morphing between long utterances. Further research is needed to improve the quality of the morphed signals, which are natural sounding but are somewhat degraded compared to the originals.
REFERENCES
VoiceMorphing using 3DWaveform Interpolation Surfaces and Lossless Tube Area Functions-Yizhar Lavner Gidon Porat HIGH QUALITY VOICE MORPHING Hui Ye and Steve Young Cambridge University Engineering Department Trumpington Street, Cambridge, England, CB2 1PZ Quality-enhanced Voice Morphing using Maximum Likelihood Transformations- Hui Ye, Student Member, IEEE, and Steve Young, Member, IEEE Wavelet-based Voice Morphing ORPHANIDOU C., Oxford Centre for Industrial and Applied Mathematics Mathematical Institute, University of Oxford Oxford OX1 3LB, UK

Voice Morphing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Voice Morphing

Uploaded by

Copyright:

Available Formats

CONTENTS

Approach to the problem

Figure to represent Voice morphing implementation

Properties of Speech Signals

The mechanism of speech production

Figure: Cross-section of the human vocal system

Classification of speech sounds

Figure: Approximation of vocal tract by lossless tubes

Speech production model

Human speech production model

Filter representation of speech production

What is Voice Morphing?

Key Technical Issues

Pitch Synchronous Harmonic Model

PSHM is a simplification of the standard ABS/OLA sinusoidal model

A cos( l + ), [n =0, ....... , N

The parameters were estimated by minimizing the modeling error

Time and Pitch Modification using PSHM

Super segmental Cues

source and target speech parameters

source to target mapping

sources altered parameters

the new speech signal

Basic Block Diagram

Prototype waveform interpolation

The basic algorithm

Computation of the characteristic waveform surface

New vocal tract model calculation and synthesis

You might also like