You are on page 1of 4

REAL-TIME VOICE CHANGER IN MAX/MSP

Shengwen Yang

Department of Electrical and Computer Engineering, University of Rochester


ABSTRACT 2.1. Frequency Shifting
A device raising or lowering the frequency of an input
This paper describes applications of pitch shifting in signal is called frequency shifter, which uses a complex
real-time audio signals as developed for a voice changer. amplitude modulation technique. Unlike the pitch shifter,
The project in this paper explores pitch shifting, time the frequency shifter does not preserve the harmonic
stretching and phase modulation of live signals. Several new relationships between the various tones and harmonics in the
techniques, such as pitch shifting, vocoder and Doppler input signal. For this purpose, sounds processed by a
effect, have been adopted and will be described. frequency shifter start to sound very unnatural with only a
small amount of shift. However, it is feasible to implement a
Index Terms—Voice changing, pitch shifting, vocoder, frequency shifter with all analog circuitry, which means that
Doppler effect, delay window, fundamental frequency, frequency shifters were available as far back as 1950.
formant frequency, real-time, phasor, Max/MSP, C++ Recently, there has been a resurgence of interest in
frequency shifters as effects, and currently models intended
1. INTRODUCTION for using with modular synthesizers have been produced.

Voice Changer is created to modify input voices in real-time 2.2. Pitch Shifting
during a conversation, and it works on every platform. This Pitch scaling or pitch shifting is the opposite to time
type of entertainment device has being amazingly popular stretching, where the process of changing the pitch without
among young people recently. In order to modify, change affecting the speed is used for audio data processing. Similar
and disguise the input human voice in any circumstances by methods can change speed, pitch, or both at the same time,
using a microphone, and to add another dimension of or in a time-varying way. These processes are used to match
creativity with limitless options, voice changers work the pitches and tempos of two pre-recorded clips for mixing
behind the scenes intercepting audio from the microphone when the clips cannot be reperformed or resampled. For
before it goes to the applications, so there’s no need to instance, a drum track containing no pitched instruments
change any configurations or settings in other programs. It’s could be moderately resampled for tempo without adverse
become easier to simply run the program and start creating effects, but a pitched track could not. They are also used to
voice distortions in minutes. generate an effect, such as increasing the range of an
instrument (like pitch shifting a guitar down an octave).
The pitch shifting technique mentioned in this paper is a
2. PITCH SHIFTING sound effects unit that raises or lowers the pitch of an audio
signal by a preset interval. For example, a pitch shifter set to
Pitch shifters are included in most audio processors today, increase the pitch by a fourth will raise each note three
and pitch shifting is provided based on the concepts of diatonic intervals above the notes actually played. Simple
increasing pitch and reducing durations, or reducing pitch pitch shifters raise or lower the pitch by one or two octaves,
and increasing duration. At the far extremes of pitch shifting, while more sophisticated devices offer a range of interval
the resultant sound bears little resemblance to the original. alterations.
Usually there are two synthesis methods. One is based on a
bank of sine wave oscillators, and the other is based on an
inverse FFT. These techniques can also be used to transpose 3. VOCODER
an audio sample while holding speed or duration constant.
This may be accomplished by time stretching and then Vocoder is a category of voice codec that analyzes and
resampling back to the original length. Alternatively, the synthesizes the human voice signal for audio data
frequency of the sinusoids in a sinusoidal model may be compression, multiplexing, voice encryption, voice
altered directly, and the signal might be reconstructed at the transformation, etc.
appropriate time scale. The human voice consists of sounds created by the opening
and closing of the glottis by the vocal cords, which produces
a periodic waveform with many harmonics. This basic

1
sound is then filtered by the nose and throat (a complicated 5. MAX/MSP
resonant piping system) to produce differences in harmonic
content (formants) in a controlled way, generating a wide Max/MSP is a visual programming language written in C++
variety of sounds used in speech and conversation. The and produced by Cycling '74, that helps us build complex,
vocoder examines speech by measuring how its spectral interactive programs without any prior experience writing
characteristics change over time. This results in a series of code. MSP is a DSP plug-in for Max, allowing real-time
signals representing these modified frequencies at any audio synthesis. Max/MSP is especially useful for building
particular time as the user speaks. In simple terms, the signal audio, MIDI, video, and graphics applications where user
is split into a large number of frequency bands (the larger interaction is needed. Max/MSP is split into several parts -
this number, the more accurate the analysis) and the level of "Max" handles discrete operations and MIDI, this is the
signal present at each frequency band gives the easiest place to start getting familiar with the tool, where
instantaneous representation of the spectral energy content. "MSP" deals with signal processing and audio.
Therefore, the vocoder is dramatically reducing the amount
of information needed to store speech, from a complete
recording to a series of numbers. Since the vocoder process
sends only the parameters of the vocal model over the
communication link, instead of a point-by-point recreation
of the waveforms, the bandwidth required to transmit speech
can be reduced significantly.

4. PHASOR

In physics and engineering, a phasor (a portmanteau of


phase vector), is a complex number representing a
sinusoidal function whose amplitude (A), angular frequency Fig 2. A well-organized Max patch
(ω), and initial phase (θ) are time-invariant. It is related to a For the above purposes, we use Max as a platform to
more general concept called analytic representation, which implement with its larger community, as well as more
decomposes a sinusoid into the product of a complex interesting existing projects, more libraries, and its internal
constant and a factor that encapsulates the frequency and documentations. Max is easier to work with because of its
time dependence. But in this project, we use phasor, key-bindings, which makes Max a more active development
including a low-frequency oscillator, as a sound processor than Pure Data. Besides, prettier GUI objects makes it
for filtering a signal by creating a series of peaks and popular because of more options, where the patch cords in
troughs in the frequency spectrum. The position of the peaks Max are not just simply straight lines, but are smoothly
and troughs of the waveform being affected is typically curved that can be routed around.
modulated so that they vary over time, creating a sweeping
effect.
6. IMPLEMENTATION

6.1. Voice Changing


Voice changers alter the tone or pitch, add distortion to the
user's voice, or a combination of all of the above and vary
greatly in price and sophistication. In this paper, we discuss
Fig 1. Phasors at both audio and asub-audio frequencies feasible programs which are capable of changing the pitch
and timbre of the user's voice, applying special effects, and
In this project, the phasor used in Max patch is a performing graphic equalization almost in real-time by
non-bandlimited sawtooth-waveform signal generator which stretching fundamental frequency and formant frequency.
can be used as an audio signal or a sample-accurate Fundamental frequency and Formant frequency are two
timing/control signal. For 'smoother' sounding sawtooth essential features of human voice, whose distribution
oscillator, but one which isn't as suitable for timing/control. determines how different people sound like.
When specifying the behavior of a phasor by interval rather People Fundamental Formant frequency
than frequency, we make use of the tempo-relative Max frequency
time format syntax.
Men Voice [50, 180] Low
Women Voice [160, 380] Middle
Children Voice [400, 1000] High
Fig 3. Distribution of formant frequencies

2
6.2. Implement 𝑓𝑜𝑢𝑡 = 𝑓𝑖𝑛 ∗ (1 − 𝑝𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ∗ (𝑑𝑤)/1000)
In this project, we create several Max patches to implement Where 𝑓𝑜𝑢𝑡 is frequency in, while 𝑓𝑖𝑛 is frequency that
voice changers with diverse functions. For example, a basic comes out. 𝑝𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 refers to the frequency of the
real-time voice changer with an effect of “World of phasor, and 𝑑𝑤 refers to a delay window.
Warcraft”, an advanced one with an effect of “Man, Woman If the input frequency is 100Hz, then the frequency out is
and Child voice converting”, and an ultimate version of 60Hz; if the former is 1000Hz, and the latter is 600hz.
“Robot sound with customized tones” based on vocoder. Therefore, we shift pitches down. Of course, when phasor
number is negative, the opposite happens and the pitches
shift up.

Fig 4. Max patch of a Real-time voice changer


Fig 6. Max patch of a voice changer
With Max/MSP, we try to create a voice changer mentioned
above by taking normal voices and shifting their pitches. In However, some annoying clicks resulted from the phasor
the first patch implemented with an effect of Warcraft, tapin obviously are introduced when we shift pitches down. Since
and tapout objects have been used based on the principle on phasor is going from 0 to 1, and jumping right back down to
which this shifter will operate is a key to the Doppler Effect. 0, so the jump makes the noise sound like click. To deal
Besides, we create a phasor to vary the delay, between 0 and with this, we try to window the sound, multiplying the
1, and scale it by 100. For example, when the phasor sound by 0 as it approaches those clicks. By setting a cosine
number is positive, the sounds are taken and increased by rap, and subtracting, only for adjusting and scaling the
the amount that they’re delayed over time, so that’s what phasor, we make those clicks 0. Plus, to make it sound better,
decreases the frequency. If we set phasor as 2, the output we’ve decided to add another phasor working together with
voice is lower; if 4, the output is much lower. the original one but shifted by a certain number of degrees
by duplicating the original one and adding 0.5 to the new
phasor. So, another phasor with 180 degrees out of phase
with the original one has been generated. Last but not least,
we put two scopes below in the patch to analyze the
corresponding waveforms.

7. CONCLUSION

Voice changer is more than an entertainment device among


youngsters, or some novelty invention appearing on TV
screen, but also a research topic that has a large amount of
academic principles and knowledge behind, ranging from
pitch shifting, vocoder to Doppler effect. The plugins in a
voice changer contain some novel and musically useful
applications of the phase vocoder algorithm. However, there
are apparent directions for further development of other new
Fig 5. Max patch of a Real-time voice changer researches. In pitch shifting, more could be done with other
types of pitch sieves. Also, a compositional algorithm could
The simple relationship between frequency in and frequency intelligently control the sifting process. In the future, a
that comes out is as follows: cross-synthesis technique will be developed, able to analyze

3
multiple real-time signals with one sound’s harmonics as a
pitch sieve for the other. As this work was concentrated on
developing new and funny applications of pitch shifting and
phase vocoder, improvements such as phase locking and
multi-resolution peak detection would make obvious
progress to the overall quality of processed sounds.

8. REFERENCE

[1] John Wiley & Sons. (2011) DAFX - Digital Audio


Effects (Second Edition)
[2] Speech recognition, Wikipedia: The Free Encyclopedia.
Wikimedia Foundation, Inc., 14 Mar. 2015.
[3] Effects unit, Wikipedia: The Free Encyclopedia.
Wikimedia Foundation, Inc., 28 Feb. 2015.
[4] Daryl Ning, Developing an Isolated Word Recognition
System in MAX.
[5] C Zhang, T Tan. Forensic science international, 2008 -
Voice Disguise and Automatic Speaker Recognition.
[6] Tom Erbe, UC San Diego Department of Music, PVOC
KIT: New Applications of the Phase VOCODER.
[7] US apprication 2151091, Homer W. Dudley, "Signal
Transmission", published May 21, 1939, assigned to Bell
Telephone Laboratories, Inc.
[8] J. Hindmarsh (1984). Electrical Machines & their
Applications (4th ed.). Elsevier. p. 58. ISBN
978-1-4832-9492-6.
[9] Brewster, David M. (2003). Introduction to Guitar Tone
and Effects: A Manual for Getting the Sounds from Electric
Guitars, Amplifiers, Effects Pedals and Processors. Hal
Leonard. p. 28. ISBN 978-0-634-06046-5.

You might also like