Report Rahman446 PDF

University of Victoria
Faculty of Engineering
Fall 2007 ENGR446 Report
Pitch shifting of voices in real-time
Sajedur Rahman
02-35163
Computer Engineering
srahman@engr.uvic.ca
January 04th, 2008
in partial fulfillment of the requirements of the

B.Eng. Degree
3490 Upper Terrace Road
Victoria, British Columbia
V8R 6E7
Mr. Duncan Hogg

Co-op Coordinator
Faculty of Engineering
P.O. Box 1700
Victoria, B.C.
V8W 2Y2
January 04th, 2008
Dear Mr. Hogg,
Please accept the accompanying report entitled " Pitch shifting of voices in real-time."
This report is the result of a project I worked on in the fall of 2007. Due to a personal
interest for technologies and software that process vocal data, I have researched available
techniques for modifying the pitch of the voice in an audio signal containing a voice. This
work was also motivated by a previous course project, the documentation for which is
included with this report.
I have analyzed approaches used in the previous project as well as other approaches to
vocal pitch shifting problems in order to make a design for a software that would be able
to change the pitch of a voice in real-time. The subject of this report is the analysis that
was done for the design of the software with reference to the MATLAB prototype used
for the analysis.
Through the course of the project, I have learned a lot about different techniques used to
process audio signals. I believe this knowledge would be extremely valuable for any
future work I do on signal processing.
Sincerely,
Sajedur Rahman
Table of Contents
List of Figures............................................................................................... iii

List of Tables ................................................................................................ iii
1 Glossary.................................................................................................. iv
2 Summary................................................................................................. v
3 Introduction............................................................................................ 1
3.1 Objective of the report........................................................................................ 1
3.2 Background[1] ................................................................................................... 1
3.2.1 Frequency representation ........................................................................... 1
3.2.2 Window functions........................................................................................ 2
3.2.3 Nature of vocal sounds................................................................................ 3
3.2.4 LPC ............................................................................................................. 3
3.2.5 Work done on voice processing .................................................................. 3
3.3 Motivation and overview.................................................................................... 4
3.4 Outline ................................................................................................................ 4
4 Discussion................................................................................................ 5
4.1 Description of the requirements ........................................................................ 5
4.2 Overview of potential solutions.......................................................................... 5
4.2.1 Factors affecting the choice of solution...................................................... 5
4.2.2 Summary of solutions attempted ................................................................. 5
4.3 Attempted approaches ........................................................................................ 6
4.3.1 Shifting in frequency domain ...................................................................... 6
4.3.2 Frequency scaling in frequency domain and re-sampling.......................... 7
4.3.3 LPC and scaling the frequency of the excitation ........................................ 8
4.3.4 Time domain stretching and re-sampling ................................................... 9
4.4 Design of the solution ...................................................................................... 10
4.4.1 Prototype................................................................................................... 10
4.4.2 Algorithm for time stretching [1].............................................................. 11
4.4.3 Algorithm for re-sampling ........................................................................ 11
4.4.4 Performance of the algorithms ................................................................. 12
4.4.5 Optimizations ............................................................................................ 14
4.4.6 Implementation ......................................................................................... 14
5 Conclusion............................................................................................. 16
6 Recommendation.................................................................................. 17
7 List of References ................................................................................. 18
Appendix A ................................................................................................ A-1
Appendix B ................................................................................................ B-1
Page ii
List of Figures
Figure 1 - Definitions of FFT and IFFT from MATLAB documentation .......................... 2
Figure 2 - Block Diagram for LPC Processing ................................................................... 3
Figure 3 - The magnitude spectrum of a sound shifted by a factor of 1.5 .......................... 8
Figure 4 - Block diagram of initial design ........................................................................ 10
Figure 5 - Graph of running time of time stretching algorithm ........................................ 13
Figure 6 – Diagram for optimized system ........................................................................ 14
List of Tables
Table 1 - Running times of time stretching for different orders of operations ................. 13
Page iii
1 Glossary
Analog A quantity that can take any value within a certain range
Artifact A distortion in a signal produced as a side effect of a modification applied
to it
Digital A quantity that can only take specific values within a certain range
MARSYAS Music Analysis, Retrieval and Synthesis for Audio Signals
Nyqust Limit The condition that only frequencies up to half the sampling rate can be
accurately determined
PSOLA Pitch-synchronous Overlap and Add
Sinusoid A quantity (usually a signal level) that varies according to a trigonometric
function of another quantity (usually time)
SOLA Syncrhonous Overlap and Add
Page iv
2 Summary
This report describes a project to design a system that would modify the pitch of a vocal
sound that is sent into the system as an input. The system is intended to be designed such
that the vocal sound with its pitch modified will be sent out as an output in real time. This
system is also required to be implemented without any additional monetary costs.
The design process of the system involved analysis of several approaches to satisfy the
requirements and the selection of useful features from various approaches. The
techniques chosen for the final design involved a time stretching technique that increased
the length of a sound without changing its pitch and a frequency scaling technique which
changed both the pitch and length of a sound. The use of these two techniques in
particular combinations allowed the system to efficiently change the pitch of a sound. A
prototype that worked on recorded voice was developed for the analysis of the
performance of the system.
Due to several constraints if incompatibility and time limit, the implementation could not
be made at the time this report was written.
Page v
3 Introduction
3.1 Objective of the report
The objective of this report is to describe the analysis that was done to create the design
of a software tool that can be used to modify the pitch of a voice contained in a sound
that is sent as an input to the tool.
3.2 Background[1]
All sound can be represented by an electrical time varying analog signal. When converted
to a digital format, this time varying signal can be represented by a series of numbers that
correspond to the level of the signal at specific points in time determined by the sampling
rate.
Once a sound is represented in digital form, mathematical operations can be easily done
on the representation so that if the sound were converted back into analog form and
outputted through a speaker, it would sound different from the original version.
Digital filters and other digital operators that operate on digital signals can bring about
these transformations. The pitch of a sound (vocal or otherwise) is related to the
frequencies of different sound components that are contained in the sound. Therefore, for
modifications on frequencies, it is useful to convert the digital signal into a frequency
representation where the signal is broken down into components with constant
frequencies.
3.2.1 Frequency representation
Almost all periodic sounds can be represented by a set of sinusoids with constant
frequencies and phases per component. Therefore, all sounds can be represented by a
series of numbers that correspond to the amplitudes and phases of the different frequency
components of the sound. The amplitudes and phases can be plotted against the frequency
to produce the graphical representations of magnitude and phase spectrums of the signal
respectively.
The mathematical operation known as Fourier Transform can be used to convert a signal
from a time domain representation to a frequency domain representation. A Discrete
Fourier Transform (DFT) is the corresponding operation for a digital signal. A fast
algorithm known as the Fast Fourier Transform (FFT) is commonly used for real time
processing of digital signals when the frequency components are required for processing.
An Inverse Discrete Fourier Transform (IDFT) is used to convert it back to the time
Page 1
domain representation. The fast algorithm for this is known as the Inverse Fast Fourier
Transform (IFFT).
Figure 1 - Definitions of FFT and IFFT from MATLAB documentation
Due to the Nyquist limit, only frequencies up to half of the sampling rate can be
accurately represented in the digital form. Therefore, any spectrum that is analyzed will
be accurate only up to the half of the sampling rate. For this reason, the sampling rate
ideally needs to be twice as much as the range of human hearing for the purpose of
processing sound.
In these representations,
x(t) stands for a signal in the continuous time domain.

x(n) stands for a signal in the discrete time domain
X(jw) stands for the Fourier Transform of signal x(t)
X(k) stands for the Fourier Transform of signal x(n)
3.2.2 Window functions
A window function is a function which has a very low frequency.
When a sound is processed in blocks, discontinuities near the end of the blocks can lead
to incorrect results for the frequency content of the sound. This is because a large number
of high frequencies are required to represent a sharp rise or fall of a waveform. For this
reason, an analysis block of sound is usually weighed by a window function which
gradually reduces the amplitudes of the sound near the ends of the analysis block.
Page 2
3.2.3 Nature of vocal sounds
The sound for a voice is generated by the voice box at different pitches. When this sound,
known as glottal pulses, passes through the vocal tract, some frequencies get amplified
and some frequencies get attenuated. This is similar to a filtering operation. Due to this
filtering effect, the voice has particular frequency spectrums that are easily identified by
the human ears. The shapes of these spectrums in a graphical representation are known as
formants. For the purpose of processing vocal sounds, the knowledge of these formants is
very useful as it allows for modeling and modification of different aspects of the voice.
3.2.4 LPC
LPC (Linear Predictive Coding) is a method of approximating the filter represented by

the vocal tract by analyzing an audio signal containing voice. This filter can be used to
determine an inverse filter which when applied to a speech sample would separate an
approximation for the glottal pulses of the voice. This approximation is known as the
excitation signal.
The approximated filter, known as the spectral envelope, is an all-pole filter that store
information about which frequencies have high amplitudes in the spectrum. These
frequencies are known as the resonance frequencies of the system. When this spectral
envelope is applied to an excitation signal, only the frequencies contained in the original
speech are amplified, resulting in a speech like sound. If the excitation used for
processing was extracted from the original sound, then the original sound is reproduced
when the spectral envelope is applied on it.
LPC is often used to separate these two components of voice for the purpose of
transmission or modification.
x(n) H1(z) e(n)

(from spectral
envelope)
Spectral envelope
LPC
Figure 2 - Block Diagram for LPC Processing
3.2.5 Work done on voice processing
There has been extensive research and analytical work done on sound processing.
Numerous software and hardware solutions currently exist that address the problem that I
Page 3
have tried to solve in this project. A lot of resources are available for research and
analysis of different ways of getting the intended result from a digital signal processing
system.
Examples of commercially available products that modify the pitch of an inputted voice
can be found at the following websites:
1. http://www.screamingbee.com/
2. http://www.tc-helicon.com/
3.3 Motivation and overview
I have undertaken this project to provide myself with a non-commercial solution for
voice modification as well as to satisfy my interest in learning more about voice
modification techniques.
In this project, I have analyzed several techniques of processing a digital sound signal
with respect to pitch modification to design a simple way to modify the pitch of a vocal
sound. To demonstrate the design I have created a prototype that works on sound files
saved in wave file format. I have used source code that came with the book on digital
audio effects [1]. I have developed the prototype on a personal computer running
Windows XP operating system and intend to implement the system on the same machine.
In a previous course, I have done a project on modification of voices using LPC (Linear
Predictive Coding). I have used part of the analysis from that project to determine the
solution to be used for this project. The report for the LPC project is given in Appendix
B.
3.4 Outline
Section 4 of this report discusses the result of the research and analysis done to determine
an appropriate design for the software with the ability to modify the pitch of a voice with
potential for real time processing.
Section 4 starts with a description of the requirements followed by an overview of the

possible solutions that were analyzed. This is followed by detailed descriptions of the
analysis done on potential solutions. The section ends with a detailed description of the
analysis done on the chosen solution with reference to a prototype developed in
MATLAB.
This is followed by two sections on conclusion and recommendation and the report ends
with the Appendices.
Page 4
4 Discussion
4.1 Description of the requirements
The end result of this project is required to be the design of a system that can modify the
pitch of a voice in real time with no monetary cost associated with the implementation of
the system. The duration and speech content of the output voice must remain the same as
the duration and speech content of the input voice. Only the pitch of the voice is required
to be changed.
The requirement for the solution to be free of monetary costs restricts the usage of any
hardware or any commercially available software as the solution. For this reason, the
most logical solution would be a software-based solution that uses open source software
and programming libraries.
4.2 Overview of potential solutions
As mentioned in Section 3.2.5, there are numerous techniques that can be used to address
the requirements of this project. I have looked at several of the simple ones in order to
determine the correct combination of techniques that would be most effective to satisfy
the requirements.
4.2.1 Factors affecting the choice of solution
The solution that is selected is required to:

• Have a fast running time.
• Produce the least amount of artifacts.
• Use techniques that do not take longer than two weeks to learn since there is only
one person working on the project
• Have no monetary costs for the implementation
4.2.2 Summary of solutions attempted
The first solution attempted was a shifting of frequency done after conversion of the
audio into the frequency domain. It did not produce the intended effect and further
investigation revealed that direct shifting of frequency does not produce a proper pitch
change for all types of sound.
Page 5
The second method attempted was a frequency scaling of the frequencies in the spectrum
and conversion back into the time domain maintaining the original length. This method
changed the pitch of the sound, but introduced significant artifacts in the form of
repetition or loss of speech content.
The third method made use of the LPC work done in the previous project (Appendix B).
The same time of frequency scaling as the second method was used on the excitation part
of the signal, which corresponds to the glottal pulses of the voice. This did not produce
the correct result since the formant of the signal could not be easily changed to match the
pitch change.
The fourth method attempted was a time domain stretching and a re-sampling of the
signal. The time domain stretching would change the length of the audio without
changing its pitch. The re-sampling step would change the length of the audio in the
opposite direction to bring it back to its original length while changing the pitch at the
same time. This produced acceptable results with minor artifacts.
4.3 Attempted approaches
The different techniques that were attempted were analyzed using the MATLAB tool.
This tool was also used for the purpose of building the prototype.
The following subsections describe the different methods that were attempted in order to
satisfy the requirements. Explanations for the limitations and problems of the
unsuccessful attempts are given.
4.3.1 Shifting in frequency domain
The most intuitive way to change the pitch of the sound seemed to be moving the
frequency amplitudes to a new position according to the amount of pitch shifting to be
done.
In this method, the entire signal was converted to the frequency domain using the FFT
algorithm. Analysis of the magnitude spectrum displayed the peaks of the frequencies in
the audio signal. This was stored in an one dimensional vector which was manipulated by
an algorithm that shifted the spectrum by the pitch shifting amount specified.
To increase the pitch, the spectrum was shifted to the right. To decrease the pitch the
spectrum was shifted to the left.
4.3.1.1 Problems
Page 6
However, this method produced the wrong results. As shown in the relationship below, a
frequency shift in the frequency domain would be equivalent to multiplying the time
domain content with a complex exponential signal. This is equivalent to frequency
modulation rather than the increase in pitch.
ejw0tx(t) ↔ X(jw – jw0) [2]
This clearly indicates that frequency shifting is not the correct method for pitch shifting.
4.3.2 Frequency scaling in frequency domain and re-sampling
After further reading on DFT, I determined the correct changes to the frequency domain
that would produce the effect of pitch shifting.
When an audio segment is compressed to a smaller length or expanded to a longer length,

the pitch of the sound is changed. This is equivalent to playing an audio tape at a faster or
slower rate than it was recorded at.
Changing the length of a sound is a time scaling operation in digital signal processing
terms. The corresponding DFT can be obtained by scaling the frequency representation in
the opposite direction.
x(at) ↔ 1/|a| X(jw/a) [2]
This showed that if the spectrum of the sound was scaled up for high pitch and scaled
down for low pitch, then the corresponding pitch change would occur in the time domain
when the spectrum is converted back into the time domain.
MATLAB was used to create functions that convert an input sound into its frequency
domain representation and then scale it according to the shifting factor.
The magnitude spectrums for a sound and its transformed version shifted with a shift
factor of 1.5 are shown in Figure 3.
Page 7
Figure 3 - The magnitude spectrum of a sound shifted by a factor of 1.5
4.3.2.1 Problems
When the converted frequency spectrum was changed back into the time domain
representation, and forced to maintain the original length, the results produced two types
of artifacts:
• Repetitions for the high-pitched conversion
• Overlapped multiples for the low-pitched conversion.
These artifacts showed up because the frequency scaled spectrum was expected to have a
shorter or longer time domain representation. Forcing the length of the IFFT to a fixed
value made the algorithm create multiple copies of the sound to fill the space allocated to
it. There was no easy way to force the time domain representation to resize to the original
length using the IFFT algorithm.
4.3.3 LPC and scaling the frequency of the excitation
Page 8
Using the work done in the LPC project (Appendix B), the excitation and spectral
envelope of the input sound were approximated. Since the excitation is the source of the
sound in this system, modification of the excitation’s pitch was expected to produce a
similar pitch modification in the reconstructed sound.
The excitation was frequency scaled the same way as it was done on the input sound in
the previous method. The output was constructed from the original spectral envelope and
the modified excitation.
4.3.3.1 Problems
This method produced similar artifacts that were produced using the previous method and
a significant amount of additional artifacts since the spectral envelope was not modified.
Further research revealed that modification of spectral envelope was necessary since the
resonance frequencies will have been scaled as well due to the pitch shifting. The
modification of spectral envelope would require considerable amount of work and was
not attempted due to the significant amount of time that needed to be spent learning the
process.
4.3.4 Time domain stretching and re-sampling
The method that proved to be quite successful is the stretching or compression of the
sound sample in time domain to counteract the change of length that occurs due to re-
sampling of the sound.
Time stretching is done by taking segments of the sound and repeating or discarding
similar segments for the periodic parts of the sound in order to change the length of the
sound without any pitch modification. Two methods for time stretching were looked at
for this project.
Since time stretching only changed the length of the sound and no pitch information,
combining it with the FFT and IFFT algorithms that changed both the length and pitch of
the sound, allowed the design of an algorithm that changes the pitch of the sound without
changing the duration. The change in length during re-sampling (IFFT with a different
sampling rate) was compensated by the change in length during time stretching.
4.3.4.1 Methods for stretching in time domain
The two common techniques for time stretching are:

• Synchronous Overlap and Add (SOLA)
• Pitch-synchronous Overlap and Add (PSOLA)
Both of these techniques work in the time domain of the signal. The algorithms for both
methods break the sound into segments that are overlapped according to the similarities
between adjacent segments and added or discarded according to the amount of stretching
that is to be done.
Page 9
In SOLA, the similarities between segments are calculated by a cross-correlation
function. The point of greatest similarity is used to cross fade the two segments and put
them in the buffer of the output sound. This method may produce artifacts if the transient
parts of the speech (usually the consonants) are too rapid and the block used for analysis
contains several of these transient parts. In a situation like this, the time stretched version
of the speech may have some echoes of consonants that are originally supposed to be
pronounced only once.
PSOLA is a variation of the SOLA where the local pitch information of the sound is
determined and is used to avoid any pitch discontinuities when the segments are added.
Determination of the local pitch is not a trivial problem to solve and require a lot of
learning time and computational complexity. Since this method keeps track of the local
pitch, the artifacts involved with rapid transient portions of speech are less likely to
appear in the time stretched sound.
4.3.4.2 Chosen method for time stretching
The method that was chosen for time stretching was SOLA due to its simplicity and the
small amount of work necessary for its implementation.
4.4 Design of the solution
The initial design of the solution had the input sound being processed by the time
stretching algorithm and then by the frequency scaling algorithm. The block diagram for
the system is shown in Figure 4.
Input and Time Frequency

shift factor Output
Stretch Scale
Figure 4 - Block diagram of initial design
4.4.1 Prototype
The prototype was created in MATLAB using modified sample code taken from DAFX –
Digital Audio Effects [1], and source code written by me. The initial design had:
• A main function that reads the input sound file and outputs the processed sound
into an output file (PitchShiftProgram.m)
• A function that determined the order of operation (FunctionDirectPitchShift.m).
Page 10
• A function that performed time stretching on the input based on the shift factor
(SOLA_Time_Stretch.m)
• A function that re-sampled the sound by frequency scaling the spectrum
(properpitchShiftVariableLength.m)
MATLAB functions do not work on input produced in real-time, but the overall
functionality of the algorithms of the prototype could be implemented using
programming languages that work on input in real-time. The purpose of the prototype
was to analyze the algorithms to find the optimal design that produced the output in the
least amount of processing time.
The source code of the prototype and other approaches are given in Appendix A.
4.4.2 Algorithm for time stretching [1]
The algorithm for time stretching is the SOLA algorithm described in DAFX – Digital
Audio Effects. The parameters of the algorithm allow for time stretching in between
factors of 0.25 – 2.00 without any significant distortions.
The algorithm can be summarized as follows:

1. Segmentation of the input into blocks of length N with a time shift of Sa samples.
2. Repositioning of the segments with adjusted time shift, Ss = shift factor * Sa
3. Computation of cross-correlation between adjacent blocks.
4. Determination of the index, k, of the cross-correlation that has the maximum
value. This indicates the point of maximum similarity
5. Using the index k to determine the fading functions that fade out one segment and
fade in the other segment.
6. Overlapping and adding of the faded segments.
In MATLAB documentation [3], the cross-correlation is defined as
It gives a measure of the average or expected value of the multiplication of corresponding

samples for various overlaps between the two segments. The overlap index, m, indicates
the point at which the overlap takes place.
The running time of this algorithm is O(n) where n is the length of the input.
4.4.3 Algorithm for re-sampling
The frequency scaling of the spectrum is done with the following algorithm:
1. The FFT of the input is calculated to give the input spectrum.
Page 11
2. For every frequency position in the modified spectrum, the corresponding
frequency position in the input spectrum is determined.
3. The value at the frequency position in the input spectrum is copied to the output
spectrum.
4. All unnecessary frequencies are discarded.
This algorithm runs in O(n) time where n is the length of the input.
4.4.4 Performance of the algorithms
Since this is a linear time invariant system, the order of operations does not have an effect
on the result. However, the running time varied with the order of time stretching and
frequency scaling done on the input. The running time of the initial prototype also varied
significantly with the shift factor.
This is due to the dependence of the length of the intermediate output on the shift factor.
4.4.4.1 Dependence of intermediate output on shift factor
The intermediate output is the output of one of the two algorithms which is used as the
input to the other algorithm.
If the time stretching is done first, the length of the intermediate output is given by:
L = shift factor * n, where n is the length of input
If frequency scaling is done first, the length of the intermediate output is given by:
L = n / shift factor
Therefore, the length of the input being sent into the second algorithm varied with the
shift factor and whichever algorithm was run first. As shown by the above equations, the
length of the output of the frequency scaling algorithm was inversely proportional to the
shift factor while the same for the output of the time stretching algorithm was directly
proportional to the input length.
4.4.4.2 Running time comparison
Page 12
Due to the output of one algorithm being used as the input to the other algorithm, the
increase in length makes one order of operation less efficient than the other order
depending on the shift factor.
The running times of the time stretching algorithm was compared for each order of
operation. Three values of the running time were measured for each order of operation
for a range of values of the shift factor. An average value was determined for each set of
three values and a graph was plotted showing how the running times varied with the shift
factor.
Time Stretch, Resampling Resampling, Time Stretch

shiftOffset A B C Average A B C Average
0.25 1.734 1.484 1.469 1.562 26.391 26.468 26.469 26.443
0.50 3.312 3.297 3.390 3.333 13.109 13.141 13.157 13.136
0.75 5.078 5.063 5.063 5.068 8.953 8.922 8.890 8.922
1.00 6.703 6.657 6.750 6.703 6.750 6.656 6.812 6.739
1.25 8.422 8.359 8.610 8.464 5.469 5.375 5.422 5.422
1.50 9.985 9.922 9.844 9.917 4.500 4.453 4.516 4.490
1.75 11.359 11.515 11.594 11.489 3.891 3.797 3.906 3.865
2.00 13.015 12.844 12.875 12.911 3.406 3.500 3.391 3.432
Table 1 - Running times of time stretching for different orders of operations
Performance of algorithm for different shifting factors Time Stretch, Resampling

Resampling, Time Stretch
30.000
25.000
20.000
15.000
10.000
5.000
0.000
0.00 0.50 1.00 1.50 2.00 2.50
Figure 5 - Graph of running time of time stretching algorithm
Page 13
Since the running time of each algorithm is directly proportional to input length, the
graphs indicate the expected relationship between L and the shift factor.
4.4.5 Optimizations
Using the results of the analysis done on the running times of the algorithms, the function
that determines the order of operations was optimized to select the order based on the
value of the shift factor. This optimization minimized the running time of the time
stretching algorithm. The running time of frequency scaling algorithm was found to be
quite small compared to the time stretching algorithm and so it as assumed that the
running time of time stretching algorithm dominated the running time of the overall
system.
The diagram of the optimized system is given in Figure 6.
Yes Time Frequency

Stretch Scale
Shift No Output
Input
factor
<= 1
Frequency Time
Scale Stretch
Shift factor
Figure 6 – Diagram for optimized system
4.4.6 Implementation
The implementation of this system is required to be done using a programming language

of high performance. A good choice would be C++. A C++ based open source software
framework called MARSYAS [4] (Music Analysis, Retrieval and Synthesis for Audio
Signals) was chosen for implementation of the system for real-time pitch shifting of a live
input.
Following are the reasons for choosing MARSYAS:

• It is fast since it is based on C++
• It is open source and therefore satisfies the no monetary cost requirement.
• It is designed to specifically work with digital audio signals.
• It has data structures that effectively processes data in real time.
• It has compatibility with MATLAB which would enable reuse of code from the
prototype.
Page 14
I had trouble running MARSYAS on the machine that I chose for the implementation.
Due to these incompatibility issues, the implementation could not be made in time for the
report. Further optimization can be done on the design with analysis of the
implementation environment. This will be possible once MARSYAS is updated with the
recent changes to Windows operating system and C++ development framework.
Page 15
5 Conclusion
Implementation of this simple system will enable the pitch shifting of a voiced input. The
operations of the algorithms have been kept as simple as possible to enable a C++
implementation to efficiently process inputs in real-time.
Page 16
6 Recommendation
Further analysis on the time stretching algorithm can potentially improve its performance
if the parameters of the algorithm are allowed to change according to the input.
The use of LPC analysis can be extended further to allow the use of PSOLA algorithm
instead of SOLA, if the local pitches of the signal can be determined from the excitation.
Once the system is implemented, the length of input processed at a given time can be
analyzed and adjusted to further optimize the running times of the algorithms.
Page 17
7 List of References
Cited References:
[1] Udo Zolzer, DAFX – Digital Audio Effects, England: John Wiley & Sons Ltd.
2002
[2] A. Antoniou, Digital Signal Processing, New York: The McGraw-Hill Companies
Inc, 2006
[3] MATLAB Help Documentation version 6.5.1, The MathWorks
[4] G. Tzanetakis, MARSYAS documentation, University of Victoria
Page 18
Appendix A
Page A-1
% PitchShiftProgram.m
% This program reads a single track audio file and shifts its pitch by
the
% value specified by shiftOffset.
% The length of the audio file is preserved
clear;
%----- USER DATA -----

[DAFx_in,FS] = wavread('Toms_diner.wav'); % Input sound
shiftOffset = 0.25; % Less than 1 represents low pitch. More than 1
represents high pitch
%---- Shifting the input ----

DAFx_out = FunctionDirectPitchShift(DAFx_in, FS, shiftOffset);
%plotSpectrums(fft(DAFx_in.*hanningz(length(DAFx_in)),
length(DAFx_in)), fft(DAFx_out.*hanningz(length(DAFx_out)),
length(DAFx_out)))
%----- output -----

%soundsc(DAFx_out, FS)
%DAFx_out_norm = .99* DAFx_out/max(abs(DAFx_out)); % scale for wav
output
%wavwrite(DAFx_out_norm, FS, 'DirectPitchShifted_With_Time_Stretch')
%4.5 (1.5) 8.92 (0.75) 13.45 (0.5)
%wavwrite(DAFx_out_norm, FS, 'DirectTimeStretched_With_Pitch_Shift')
%9.8 (1.5) 8.85 (0.75) 3.46 (0.5)
function DAFx_out = FunctionDirectPitchShift(DAFx_in, FS, shiftOffset);
%----- USER DATA -----

time_stretch_factor = shiftOffset;
%---- Shifting the input ----

DAFx_out = DAFx_in;
% The order of operation is chosen to be the one that minimizes the

computation time
% of time stretching algorithm.
if (shiftOffset <= 1)
% Time stretch first and then scale frequency since output of
frequency
% scaling is longer than the original sound length
DAFx_out = SOLA_Time_Stretch(DAFx_in, FS, time_stretch_factor);
DAFx_out = properpitchShiftVariableLength(DAFx_out, FS,
shiftOffset);
else
% Scale frequency first and then time stretch since the output of
% frequency scaling is shorter than the original sound length
DAFx_out = properpitchShiftVariableLength(DAFx_in, FS,
shiftOffset);
DAFx_out = SOLA_Time_Stretch(DAFx_out, FS, time_stretch_factor);
end
Page A-2
DAFx_out;
% SOLA_Time_Stretch.m
% Modified from TimeScaleSOLA.m
% Time Scaling with Synchronized Overlap and Add
%
% Parameters:
%
% analysis hop size Sa = 256 (default parmater)
% block length N = 2048 (default parameter)
% time scaling factor 0.25 <= alpha <= 2
% overlap interval L = 256*alpha/2
function Overlap = SOLA_Time_Stretch(DAFx_in, FS, alpha)
Original_length = length(DAFx_in);
Modified_length = ceil(Original_length * alpha);
DAFx_in = DAFx_in';
Sa = 256;
N = 2048;
if Sa > N
disp('Sa must be less than N !!!')
end
if (N > length(DAFx_in))
Overlap = DAFx_in;
return;
end
M = ceil(length(DAFx_in)/Sa);
% Segmentation into blocks of length N every Sa samples

% leads to M segments
Ss =round(Sa*alpha);
L = 256*alpha/2;
if Ss >= N disp('alpha is not correct, Ss is >= N')

elseif Ss > N-L disp('alpha is not correct, Ss is > N-L')
end
DAFx_in(M*Sa+N)=0;
Overlap = DAFx_in(1:N);
% **** Main TimeScaleSOLA loop ****

tic
for ni=1:M-1
ni;
grain=DAFx_in(ni*Sa+1:N+ni*Sa);
XCORRsegment=xcorr(grain(1:L),Overlap(1,ni*Ss:ni*Ss+(L-1)));
[xmax(1,ni),index(1,ni)]=max(XCORRsegment);
fadeout=1:(-1/(length(Overlap)-(ni*Ss-(L-1)+index(1,ni)-1))):0;
fadein=0:(1/(length(Overlap)-(ni*Ss-(L-1)+index(1,ni)-1))):1;
Page A-3
Tail=Overlap(1,(ni*Ss-(L-1))+ index(1,ni)-
1:length(Overlap)).*fadeout;
Begin=grain(1:length(fadein)).*fadein;
Add=Tail+Begin;
Overlap=[Overlap(1,1:ni*Ss-L+index(1,ni)-1) Add
grain(length(fadein)+1:N)];
end;
toc
% **** end TimeScaleSOLA loop ****
Overlap = Overlap';
Overlap = Overlap(1:Modified_length);
function shiftedSamples = properpitchShiftVariableLength(samples, FS,

shiftRatio)
N = length(samples);
Nfft = N;
W = hanningz(N);
shiftOffset = shiftRatio;
shiftedN = Nfft/shiftOffset;
samplesW = samples.*W;
spectrum = fft(samplesW, Nfft);

shiftedSpectrum = zeros(Nfft,1);
if (shiftOffset > 1)
for n = Nfft:-1:1
n;
shifted_n = floor(n/shiftOffset);
if (shifted_n < 1)
shifted_n = 1;
end
shiftedSpectrum(n) = (1/shiftOffset)*spectrum(shifted_n);
end
elseif (shiftOffset < 1)
for n = Nfft:-1:1
n;
shifted_n = ceil(n/shiftOffset);
if (shifted_n > (Nfft/2))
shiftedSpectrum(n) = 0;
else
shiftedSpectrum(n) = (1/shiftOffset)*spectrum(shifted_n);
end
end
else
shiftedSpectrum = spectrum;
end
% Adjust the original spectrum for aliasing

for i = 1:(Nfft/2)
i;
Page A-4
spectrum(Nfft-i+1) = 0; % For removing aliasing
end
shiftedSamples = real(ifft(spectrum, shiftedN));
Page A-5
Appendix B
Page B-1
Voice Modification/Morphing using LPC
M. Sajedur Rahman
Undergraduate Electrical and Computer Engineering
1 Abstract
3 Background/Related Work
This report provides an overall description of a project
to build software that will use Linear Predictive
Coding (LPC) to modify/morph voices. The software 3.1 Related papers
can potentially be used for modification of audio files
and ideally for modifying voices in real time. The The following papers were identified to be related to
report refers to several papers on this topic and the project topic. The papers are summarized below:
summarizes the different aspects of LPC and speech
analysis and synthesis that have been explored in those J. Makhoul. Linear Prediction: A tutorial review.
papers. Additionally, an ideal target system of the Proceedings of the IEEE 64 [2]
project is described along with a timeline and a
realistic scope of the project. Finally, the attempt at This paper discusses linear prediction as it is used in
creating a prototype is described along with the the analysis of discrete signals. The paper starts with
shortcomings and successes of the attempt. comments on several applications of linear prediction.
This is followed by a description of time domain
analysis of linear prediction. The models used are
described and the methods for determining the
2 Introduction/Motivation characteristics of the models are explained with
regards to several types of inputs under appropriate
Voice morphing is the changing of the speech output assumptions. This is followed by a similar analysis of
of a certain voice in some way to make it sound like the spectral domain. Analysis of the error is done all
the same speech output of another voice. This effect throughout and methods of minimizing the error are
can be used in online roleplaying games where the provided. The paper is concluded with discussion on
player’s voice output is morphed to match the data compression that can be done using linear
character’s voice output to account for any expected prediction on discrete signals.
differences in the two voices. This effect can also be
useful in the situation where an animation is to be The last part of the paper may not be relevant to this
created with the number of characters more than the project.
number of available distinct voice actors/sources.
The objective of this project is to create voice- P. Lansky and K. Steiglitz. Synthesis of timbral
morphing software to achieve as many morphing families by warped linear prediction. Computer
effects relevant to voices as possible. Ideally, these Music Journal [5]
effects can be done both in real-time and as batch
processing. This paper is about the synthesis of a musical piece
using Linear Prediction Analysis and synthesis using a
Following are some examples of commercially warped filter. It addresses a problem that is faced
available software that deals with voice morphing: while digitally synthesizing music. The problem is
how to generate sounds that are recognizable as
MorphVOX - belonging to different families of musical instruments
http://www.screamingbee.com/product/products.aspx while allowing instruments in the same family to be
VoiceModeler - http://www.tc- distinguishable from each other as well. The method
helicon.com/VoiceModeler described uses variable recursive digital filters for
synthesis and linear prediction for analysis of the
source sound.
Page B-2
J. A. Moorer. The use of linear prediction of speech 4 Ideal target system
in computer music applications. J. Audio
Engineering Society [6] With infinite time and resources, the ideal target
This paper talks about the use of linear prediction system can be built.
algorithms for speech analysis and synthesis in The ideal target system would be an application with a
computer music applications. Starting with an graphical user interface. The high level functionalities
introduction about linear prediction coding, the of the application should be:
overview of the analysis and synthesis process is • Real-time morphing of voices.
presented. Each step in the process is expanded with • Live output and input capabilities.
focus on the decisions and precautions that are needed
• Allowing the output to be used directly in
to preserve the quality of the output sound. A bit of
other audio applications that accept audio
cross synthesis is explained with regards to computer
input.
music applications. Interpolation and Amplitude
• Ability to produce any type of natural voice
Control techniques are discussed as supporting
with regards to different genders and age.
information for quality of synthesis.
• Ability to produce any type of ‘unnatural’
voices such as perceptive voices of mythical
creatures.
P. Cook. Toward the Perfect Audio Morph ?
Singing Voice Synthesis and Processing. Int. • Ability to do a transition from one output
Workshop on Digital Audio Effects [7] voice to another output voice dynamically.
• Ability to output the audio to a file.
This paper reviews the popular methods used for • Ability to accept an audio file as an input.
synthesizing voice, with focus on the singing voice. It
discusses the strength and weaknesses of each method. The GUI should let the user:
The models discussed are spectral subband vocoders, • Customize the output voice by modifying
linear predictive coding, frequency modulation, different parameters.
formant wave functions, formant filter models, • Save a list of presets for different voices and
sinusoidal models and acoustic tube models. A select them.
discussion of the categories of these methods follows • Use audio files as input and output audio for
along with comments on the general advantage and the software.
disadvantage of each category. Some information on
the way voice is perceived by humans is provided
briefly. The paper concludes with some discussion on
categorical perception of sound and sound morphing.
5 Timeline
3.2 Additional papers The project time period is July 15th, 2007 – August
12th, 2007
The additional papers mentioned below may contain The ideal system cannot be implemented within the
useful information with regards to the project: project timeline. The scope of the project will have to
be limited. The scope can be analyzed in three
B. S. Atal and Suzanne L. Hanauer, Speech scenarios.
Analysis and Synthesis by Linear Prediction of the
Speech Wave, The Journal of the Acoustical Society
of America. [3] 5.1 Best-case scenario
This paper describes a linear prediction method of A C/C++ implementation of the application is created
analyzing and synthesizing speech. It may contain with some sort of user interface to allow some
information that have not been covered in the other morphing such as male to female voice transformation.
papers. The application should be able to accept an audio file
and output an audio file.
D. G. Childers and H. T. Hu, Speech synthesis by
glottal excited linear prediction, The Journal of the
Acoustical Society of America, Volume 96. [4]
5.2 Likely-case scenario
This paper discusses the modeling of the glottal
excitation using linear prediction. It talks about how it
A working prototype is created in MATLAB and a
is used further to synthesize speech.
partial implementation is created in C/C++ with the
same functionality. A GUI does not exist. The
Page B-3
functionalities implemented may be some basic The analysis of the input signal consists of determining
morphing such as pitch shifting. some of the basic properties of the input signal such as
length, sampling rate, length of overlap (if two input
sounds are used) etc.
5.3 Worst-case scenario Depending on the complexity of the transformation

being performed, further information may be required
A prototype is created in MATLAB that shows what to be noted at this stage.
was attempted and which aspects failed. A plan of the
proposed design of the techniques to be used is
created. The prototype will have room for expansion
and given more time can be improved to attempt the 7.2 Envelope-Excitation separation
techniques mentioned in the design.
This is where the principles of LPC are applied. The
spectral envelope of the input signal is approximated
by an all pole filter and the derived filter H1(z) is used
6 Data collection/Available Software to filter the input signal to produce the signal of
excitation. The exact filter used varies with different
The data to be used for testing the application will be implementations of LPC.
sound files containing short speech recordings. Some
of the sample files from the course webpage can be The input signal is windowed and an approximation of
used if they contain speech. the spectral envelope is determined upto a certain
specified order. Increasing the order increases the
The sample code on LPC from the textbook will be accuracy of the approximation at the cost of computing
used as reference for the prototype code. time. The input should be processed in blocks for
efficiency.
The information from the following website may be
helpful as well: Often this stage can be interleaved with the stage for
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebo transformation for more efficient use of resources.
x.html
x(n) H1(z) e(n)

(from spectral
7 Overall Design of the system envelope)
The basic parts of the system can be classified into

four stages:
1. Analysis of input Spectral envelope
2. Separation of spectral envelope and excitation LPC
signal
3. Transformation of envelope and/or excitation
signal Figure 1 - Separation of envelope and excitation
4. Synthesis of output.
In the ideal system, these stages will be run for blocks

of the input sound at a time to produce output sounds 7.3 Transformation
in real time. They may also be run in parallel to
process multiple input sounds or to process the same
input sound in multiple ways. The transformation stage involves modification of
what was obtained from the previous stage. The
A diagram of these stages can be found in page 300 of change is done to either the spectral envelope or the
[1]. The stages are described in more details in the excitation or both, depending on the transformation in
following subsections. question.
In the ideal system, the GUI will allow the user to

choose the transformation to be done on the input
7.1 Analysis sound.
Some examples of transformations are:
Page B-4
1. Pitch shifting of the voice – Changes are done
to the spectral shape and the excitation to
make it seem like the output voice has a 8 Implementation
higher or a lower pitch.
2. Formant modification – The spectral shape is
Due to the short timeline of the project, the scope of
modified corresponding to different formants.
the project was reduced to the worst-case scenario. A
3. Cross synthesis – The excitation of one sound
prototype was created in MATLAB using example
source is used with the spectral envelope of
code from the textbook [1]. The code used for the
another sound source. If two voices are used
prototype is given in Appendix A.
as input sounds, this transformation will make
it seem like one person is saying what the
other person originally said.
4. Synthesized excitation – The original 8.1 Code summary
excitation of the input is replaced by another
excitation synthesized by the system. The original example code used for the prototype can
be found in pages 308, 317-319 of [1]. They are saved
in the following three M-files:
1. LPCCrossSynthesis.m
2. calc_lpc.m
3. hanningz.m
7.4 Synthesis
The following M-files were created as part of the
prototype:
The synthesis stage combines the excitation and the
envelope from the previous stage into the output
1. LPCCrossSynthesisVoices.m – This code
signal. The filter H2(z) is determined from the
accepts two input voices and uses the
envelope similar to how it is done in the separation
excitation of the first voice with the spectral
stage. Again, the exact filter used varies with different
envelope of the second voice to create the
implementations of LPC.
output voice.
2. LPCHighPitch.m – This code accepts an
If the excitation and envelope of the input signal
input voice and shifts the spectrum of its
determined in the separation stage are used as input in
excitation by specified number of frequency
this stage then the exact original input signal would be
(in Hz) before re-synthesis. It leaves the
recovered unless there had been some significant loss
spectral envelope unchanged.
of precision in the previous stages.
a. This code makes use of the helper
function “simplepitchShift.m”
which does the actual pitch shifting
Transformed e(n) y(n)
of the excitation spectrum.
H2(z) 3. LPCSineExcitation.m – This code accepts
an input voice and replaces its excitation with
a constant frequency sine wave.
4. LPCHarmonicExcitation.m – This code
accepts an input voice and replaces its
Transformed envelope excitation with a harmonic series of specified
fundamental frequency and specified number
of harmonics.
Figure 2 – Combination of envelope and
excitation
This stage can also be interleaved with the

8.2 Test data
transformation stage so that each of the transformed
blocks can be added to the output as the consecutive
block is being processed in the transformation stage. The data used to experiment with the code was taken
mostly from the textbook’s[1] website. Some
This can be done if the transformation operation is additional sound files were also used.
causal and does not require the entire signal to be
processed before it can be applied. It is assumed that
this system will not be processing those kinds of
signals as the final product is intended to be run in real 8.3 Effects of transformations
time.
Page B-5
All the transformations produced artifacts since the Along with a pitch detection algorithm, this can
implementations are not refined enough. However, the potentially be modified to use the most common
intended transformations can still be heard in the frequency in the input as the only excitation in the
output sounds. output.
The following summarizes the different effects

obtained from each of the transformations.
8.3.4 Harmonic excitation
8.3.1 Cross synthesis with two voices This is implemented in LPCHarmonicExcitation.m.
This is implemented in LPCCrossSynthesisVoices.m The purpose of this was to add more content to the
output since the sine excitation alone produced very
The purpose of this was to attempt a simple voice little content. This contained, as a test excitation, the
swap. This was run with two input sounds each of sum of a harmonic series of sine waves in the range of
whose excitation was used with the other’s envelope. 100-1000 harmonics. The outputs were always a
For one combination, the output was clear with the “robotic” voice with the words of the speech quite
initial orders chosen for the LPC in the example code. clear. This indicates that the inflexion content of the
However, for another combination the output speech was mostly contained in the excitation.
contained too many artifacts to discern the original
words of the speech. Increasing the orders by
approximately 5 times produced an improved output.
This can potentially be improved with pitch and vowel 9 Conclusion

detection algorithms and restricting the output
excitation to vowels only.
Given more time, implementation of a greater scope
can be made. The experiments run with the code for
this project can serve as good starting points for further
work. Due to the compatibility between MATLAB and
8.3.2 Pitch shifting of excitation
MARSYAS, if future implementations are done in
MARSYAS, some of the MATLAB code mentioned
This is implemented in LPCHighPitch.m along with above can be reused.
the helper function simplepitchShift.m.
This was done as an attempt at increasing the pitch of

the voice. This was run with various amounts of pitch
shifts. The shift in pitch could generally be noticed,
however actual shift of a voice’s pitch requires further 10 Bibliography
modifications such as some formant shift which was
not implemented in this code. Therefore, the artifacts
could not be avoided.
[1] Udo Zölzer, DAFX - Digital Audio Effects, John
Wiley & Sons, 2002
Implementation of the formant change can potentially
improve the output further.
[2] J. Makhoul. “Linear Prediction: A tutorial review”,
Proceedings of the IEEE 64(4):561-580, 1975
[3] B. S. Atal and Suzanne L. Hanauer, “Speech

8.3.3 Sine excitation
Analysis and Synthesis by Linear Prediction of the
Speech Wave” The Journal of the Acoustical Society
This is implemented in LPCSineExcitation.m. of America, 50 (2) 1971
This was done to observe the effect of a particular [4] D. G. Childers and H. T. Hu, “Speech synthesis by
frequency in the output. This was run with fairly glottal excited linear prediction” The Journal of the
accurate results. There was only one main frequency in Acoustical Society of America, 96 (4), 1994
the output with artifacts that could be heard
occasionally. Since the envelope covered more than [5] P. Lansky and K. Steiglitz. “Synthesis of timbral
one frequency, having only one in the excitation did families by warped linear prediction” Computer Music
not produce much content in the output. Journal, 5(3):45-47, 1981
Page B-6
[6] J. A. Moorer. “The use of linear prediction of
speech in computer music applications” Journal of
Audio Engineering Society 27(3):134-140, 1979
[7] P. Cook. “Toward the Perfect Audio Morph ?

Singing Voice Synthesis and Processing” International
Workshop on Digital Audio Effects (DAFX), 1998
Page B-7
Appendix (B) A – Code
Page B-8
Example Code from Textbook[1]
% M-file 9.7
% LPCCrossSynhesis.m
%
% cross-synthesis with LPC
clear;
%----- USER DATA -----

[DAFx_in1,FS] = wavread('moore_guitar.wav'); % sound 1: excitation
DAFx_in2 = wavread('Toms_diner.wav'); % sound 2: spectral env.
long = 400; % block length for calculation of coefficients
hopsize = 160; % hop size (is 160)
order = 20 % order of the LPC
order1 = 6 % order for the excitation
%----- initializations -----

ly = min(length(DAFx_in1), length(DAFx_in2));
DAFx_in1 = [zeros(order, 1); DAFx_in1; zeros(order-mod(ly,hopsize),1)] / max(abs(DAFx_in1));
DAFx_out = zeros(ly,1); % result sound
exc = zeros(ly,1); % excitation sound
w = hanningz(long); % window
N_frames = floor((ly-order-long)/hopsize); % number of frames
%----- Cross-synthesis -----

tic
for j=1:N_frames
k = order + hopsize*(j-1); % offset of the buffer
[A, g] = calc_lpc(DAFx_in2(k+1:k+long).*w, order);
[A1, g1] = calc_lpc(DAFx_in1(k+1:k+long).*w, order1);
% IMPORTANT function "lpc" does not give correct results for MATLAB 6 !!!
gain(j) = g; %
ae = - A(2:order+1); % LPC coeff. of excitation
for n=1:hopsize
excitation1 = (A1/g1) * DAFx_in1(k+n:-1:k+n-order1);
exc(k+n) = excitation1;
DAFx_out(k+n) = ae*DAFx_out(k+n-1:-1:k+n-order)+g*excitation1;
end
end
toc
%----- output -----
DAFx_out = DAFx_out(order+1:length(DAFx_out)) / max(abs(DAFx_out));
soundsc(DAFx_out, FS)
DAFx_out_norm = .99* DAFx_out/max(abs(DAFx_out)); % scale for wav output
wavwrite(DAFx_out_norm, FS, 'CrossLPC')
% M-file 9.3
% calc_lpc.m
function [a,g]=calc_lpc(x,p)
% calculate LPC coeffs via autocorrelation method
% Similar to MATLAB function "lpc"
% IMPORTANT: function "lpc" does not work correctly with MATLAB 6!
% x: input signal
Page B-9
% p: prediction order
% a: LPC coefficients
% g: gain factor
% (c) 2002 Florian Keiler
R=xcorr(x,p); % autocorrelation sequence R(k) with k=-p,..,p

R(1:p)=[]; % delete entries for k=-p,..,-1
if norm(R)~=0
a=levinson(R,p); % Levinson-Durbin recursion
% a=[1, -a_1, -a_2,..., -a_p]
else
a=[1, zeros(1,p)];
end
R=R(:)'; a=a(:)'; % row vectors
g=sqrt(sum(a.*R)); % gain factor
function w = hanningz(n)
%HANNINGZ(N) returns the N-point Hanning window in a column vector.
w = .5*(1 - cos(2*pi*(0:n-1)'/(n)));
Modified Code
% ------Original------
% M-file 9.7
% --------------------
% ------Modified------
% LPCCrossSynthesisVoices.m
% This code accepts two input voices and uses the excitation of the first voice
% with the spectral envelope of the second voice to create the output voice.
%
clear;
%----- USER DATA -----

%[DAFx_in1,FS] = wavread('Toms_diner.wav'); % sound 1: excitation
%[DAFx_in1,FS] = wavread('Toms_diner_6.wav'); % sound 1: excitation
[DAFx_in1,FS] = wavread('I.wav'); % sound 1: excitation
%[DAFx_in1,FS] = wavread('Choice_Judgement.wav'); % sound 1: excitation
%[DAFx_in1,FS] = wavread('Abyss.wav'); % sound 1: excitation
%DAFx_in2 = wavread('Abyss.wav'); % sound 2: spectral env.
DAFx_in2 = wavread('Choice_Judgement.wav'); % sound 2: spectral env.
%DAFx_in2 = wavread('Toms_diner_6.wav'); % sound 2: spectral env.
%DAFx_in2 = wavread('Toms_diner.wav'); % sound 2: spectral env.

Page B-10

tic
for j=1:N_frames
gain(j) = g; %
for n=1:hopsize
excitation1 = (A1/g1) * DAFx_in1(k+n:-1:k+n-order1);
DAFx_out(k+n) = ae*DAFx_out(k+n-1:-1:k+n-order)+g*excitation1;
end
end
toc
%----- output -----
wavwrite(DAFx_out_norm, FS, 'VoicesCrossLPC')
% ------Original------
% M-file 9.7
% --------------------
% ------Modified------
% LPCHighPitch.m
% This code accepts an input voice and shifts the spectrum of its excitation
% by specified number of frequency (in Hz) before re-synthesis.
% It leaves the spectral envelope unchanged.
clear;
%----- USER DATA -----

DAFx_in1 = wavread('moore_guitar.wav'); % sound 1: excitation
[DAFx_in2,FS] = wavread('Toms_diner.wav'); % sound 2: spectral env.
%[DAFx_in2,FS] = wavread('moore_guitar.wav'); % sound 2: spectral env.
shiftOffset = 1000;

exc_shifted = zeros(ly,1);
Page B-11
%shiftedDAFx_in2 = simplepitchShift(DAFx_in2, FS, shiftOffset);
%---- Shifting the excitation ----

for j=1:N_frames
gain(j) = g; %
for n=1:hopsize
excitationSamples = DAFx_in2(k+n:-1:k+n-order);
excitation1 = (A/g) * excitationSamples;
end
end
exc_shifted = simplepitchShift(exc, FS, shiftOffset);

tic
for j=1:N_frames
j
gain(j) = g; %
for n=1:hopsize
%excitation1 = (A1/g1) * DAFx_in1(k+n:-1:k+n-order1);
%shiftedExcitationSamples = shiftedDAFx_in2(k+n:-1:k+n-order);
%excitation1 = (A/g) * shiftedExcitationSamples;
%excitation1 = (A/g) * DAFx_in2(k+n:-1:k+n-order);
%exc(k+n) = excitation1;
DAFx_out(k+n) = ae*DAFx_out(k+n-1:-1:k+n-order)+g*exc_shifted(k+n);
end
end
toc
%----- output -----
%wavwrite(DAFx_out_norm, FS, 'CrossLPC')
%wavwrite(DAFx_out_norm, FS, 'SameAsInput')
wavwrite(DAFx_out_norm, FS, 'PitchShifted')
function shiftedSamples = simplepitchShift(samples, FS, shiftOffsetFrequency)
%shiftOffset = -7000; %offset in number of frequency bins

%[DAFx_in, FS] = wavread('Toms_diner.wav');
%samples = DAFx_in(1:20000); % samples to be shifted
N = length(samples);
Nfft = N;
W = hanningz(N);
shiftOffset = floor((shiftOffsetFrequency * Nfft)/FS);
Page B-12
samplesW = samples.*W;
spectrum = fft(samplesW, Nfft);

shiftedSpectrum = zeros(Nfft,1);
if (shiftOffset > 0)
for n = Nfft:-1:(shiftOffset+1)
n;
shiftedSpectrum(n) = spectrum(n-shiftOffset);
end
elseif (shiftOffset < 0)
for n = 1:1:(Nfft+shiftOffset)
n;
shiftedSpectrum(n) = spectrum(n-shiftOffset);
end
else
shiftedSpectrum = spectrum;
end
% Adjust the spectrum for aliasing

for i = 1:(Nfft/2)
i;
shiftedSpectrum(Nfft-i+1) = shiftedSpectrum(i);
end
figure(1)
subplot(2,1,1);
plot(abs(spectrum), '-red');
title('Original excitation');
subplot(2,1,2);
plot(abs(shiftedSpectrum), '-blue');
title('Shifted excitation');
shiftedSamples = real(ifft(shiftedSpectrum));
shiftedSamples = shiftedSamples(1:N);
%soundsc(shiftedSamples, FS);
% ------Original------
% M-file 9.7
% --------------------
% ------Modified------
% LPCSineExcitation.m
% This code accepts an input voice and replaces its excitation with
% a constant frequency sine wave of frequency sineF
clear;
%----- USER DATA -----

DAFx_in1 = wavread('moore_guitar.wav'); % sound 1: excitation
%[DAFx_in2,FS] = wavread('Toms_diner.wav'); % sound 2: spectral env.
[DAFx_in2,FS] = wavread('moore_guitar.wav'); % sound 2: spectral env.
shiftOffset = 0;
Page B-13
exc = zeros(ly,1);
sineF = 100;
t = zeros(ly,1);
% Sine wave excitation

sampleT = 1/FS;
t = (0:1:(ly-1))'*sampleT;
exc = sin(2*pi*sineF*t);
%exc_shifted = simplepitchShift(exc, FS, shiftOffset);

tic
for j=1:N_frames
j
gain(j) = g; %
for n=1:hopsize
%shiftedExcitationSamples = shiftedDAFx_in2(k+n:-1:k+n-order);
%excitation1 = (A/g) * shiftedExcitationSamples;
%excitation1 = (A/g) * DAFx_in2(k+n:-1:k+n-order);
DAFx_out(k+n) = ae*DAFx_out(k+n-1:-1:k+n-order)+g*exc(k+n);
end
end
toc
%----- output -----
%wavwrite(DAFx_out_norm, FS, 'PitchShifted')
wavwrite(DAFx_out_norm, FS, 'SineExcitation')
% ------Original------
% M-file 9.7
% --------------------
% ------Modified------
% LPCHarmonicExcitation.m
Page B-14
% This code accepts an input voice and replaces its excitation with a harmonic series of
% specified fundamental frequency (sineFBase)
% and specified number of harmonics (harmonics).
clear;
%----- USER DATA -----

DAFx_in1 = wavread('Toms_diner_6.wav'); % sound 1: excitation
%[DAFx_in2,FS] = wavread('Toms_diner.wav'); % sound 2: spectral env.
[DAFx_in2,FS] = wavread('Choice_Judgement.wav'); % sound 2: spectral env.
%[DAFx_in2,FS] = wavread('moore_guitar.wav'); % sound 2: spectral env.
shiftOffset = 0;

exc = zeros(ly,1);
sineFBase = 150;
harmonics = 100;
t = zeros(ly,1);
% Sine wave excitation

sampleT = 1/FS;
t = (0:1:(ly-1))'*sampleT;
for h=1:harmonics
exc = exc + (1/h)*sin(2*pi*sineFBase*h*t);
end
%exc = simplepitchShift(exc, FS, shiftOffset);

tic
for j=1:N_frames
j
gain(j) = g; %
for n=1:hopsize
DAFx_out(k+n) = ae*DAFx_out(k+n-1:-1:k+n-order)+g*exc(k+n);
end
end
Page B-15
toc
%----- output -----
%wavwrite(DAFx_out_norm, FS, 'PitchShifted')
wavwrite(DAFx_out_norm, FS, 'HarmonicExcitation')
Page B-16

Report Rahman446 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Rahman446 PDF

Uploaded by

Copyright:

Available Formats

University of Victoria

Pitch shifting of voices in real-time

January 04th, 2008

in partial fulfillment of the requirements of the

Mr. Duncan Hogg

January 04th, 2008

Dear Mr. Hogg,

List of Figures............................................................................................... iii

3.1 Objective of the report

3.2.1 Frequency representation

Figure 1 - Definitions of FFT and IFFT from MATLAB documentation

x(t) stands for a signal in the continuous time domain.

3.2.2 Window functions

A window function is a function which has a very low frequency.

LPC (Linear Predictive Coding) is a method of approximating the filter represented by

x(n) H1(z) e(n)

Figure 2 - Block Diagram for LPC Processing

3.2.5 Work done on voice processing

3.3 Motivation and overview

Section 4 starts with a description of the requirements followed by an overview of the

4.1 Description of the requirements

4.2 Overview of potential solutions

4.2.1 Factors affecting the choice of solution

The solution that is selected is required to:

4.2.2 Summary of solutions attempted

4.3 Attempted approaches

4.3.1 Shifting in frequency domain

ejw0tx(t) ↔ X(jw – jw0) [2]

4.3.2 Frequency scaling in frequency domain and re-sampling

When an audio segment is compressed to a smaller length or expanded to a longer length,

x(at) ↔ 1/|a| X(jw/a) [2]

4.3.3 LPC and scaling the frequency of the excitation

4.3.4 Time domain stretching and re-sampling

4.3.4.1 Methods for stretching in time domain

The two common techniques for time stretching are:

4.3.4.2 Chosen method for time stretching

4.4 Design of the solution

Input and Time Frequency

Figure 4 - Block diagram of initial design

4.4.2 Algorithm for time stretching [1]

The algorithm can be summarized as follows:

In MATLAB documentation [3], the cross-correlation is defined as

It gives a measure of the average or expected value of the multiplication of corresponding

4.4.3 Algorithm for re-sampling

4.4.4 Performance of the algorithms

4.4.4.1 Dependence of intermediate output on shift factor

L = shift factor * n, where n is the length of input

4.4.4.2 Running time comparison

Time Stretch, Resampling Resampling, Time Stretch

Performance of algorithm for different shifting factors Time Stretch, Resampling

Figure 5 - Graph of running time of time stretching algorithm

The diagram of the optimized system is given in Figure 6.

Yes Time Frequency

Figure 6 – Diagram for optimized system

The implementation of this system is required to be done using a programming language

Following are the reasons for choosing MARSYAS:

[3] MATLAB Help Documentation version 6.5.1, The MathWorks

[4] G. Tzanetakis, MARSYAS documentation, University of Victoria

%----- USER DATA -----

%---- Shifting the input ----

%----- output -----

function DAFx_out = FunctionDirectPitchShift(DAFx_in, FS, shiftOffset);

%----- USER DATA -----

% Main TimeScaleSOLA loop