You are on page 1of 7

A Mechanical Voice System: Construction of Vocal Cords and its Pitch Control

Toshio Higashimoto and Hideyuki Sawada

Dept. of Intelligent Mechanical Systems Engineering, Faculty of Engineering, Kagawa University 2217-20, Hayashi-cho, Takamatsu-city, Kagawa, 761-0396, JAPAN
Abstract: A mechanical model of the human vocal system is being developed based on a mechatronics technology under a feedback control. While various ways of vocal sound production have been actively studied so far, a mechanical construction of vocal system is considered to advantageously realize natural vocalization with its fluid dynamics. Several motors are used to manipulate the mechanical voice system. The system is able to learn the relations between motor control papameters and the produced vocal sounds using an auditory feedback with neural networks, by mimicking a human vocalization. This paper introduces the construction of vocal cords and its adaptive control for the pitch learning. The mechanical system generates vowel and consonant sounds of different pitches by dynamically controlling the vocal cords, vocal tract and nasal cavity. established. We are developing a mechanical voice generation system together with its control mechanism for voice production imitating human vocalization. The fundamental frequency and the spectrum envelope determine the principal characteristics of a sound. The former is the characteristic of a source sound generated by a vibrating object, and the latter is operated by the work of the resonance effects. In vocalization, the vibration of vocal cords generates a source sound, and then the sound wave is led to a vocal tract, which works as a filter to determine the spectrum envelope. We have constructed a motor-controlled mechanical model with a vocal cord and a vocal tract so far [9]-[11]. By introducing an auditory feedback mechanism with an adaptive control algorithm of pitch and phoneme, the system is able to autonomously acquire the control method of the mechanical system to produce stable vocal sounds imitating human vocalization [12]. In the system, an artificial vocal cord used by people who had to remove their vocal cords because of a glottal disease was used. The vibration of a rubber with 5mm width stretched over a plastic body made vocal sound source. The tension of the rubber was manipulated by applying tensile force with a motor, so that the fundamental frequency of a generated vocal sound was changed easily. We paid attention to the quality of a sound generated by the voice system to be close to a human, and worked to develop human-like vocal cords. This paper describes the construction of vocal cords and its control for changing fundamental frequency, together with an adaptive learning mechanism.

1. Introduction
Only humans are able to use words as the primary media in the verbal communication, while almost all animals have voices. Different vocal sounds are generated by the complex movements of vocal organs under the feedback control mechanisms using an auditory system. Vocal sounds and human vocalization mechanisms have interested many researchers for a long time [1][2], and computerized voice production and recognition have become the essential technologies in the recent developments of flexible human-machine interfaces studies. In the researches of sound production, various ways and techniques have been reported. Algorithmic syntheses have taken the place of analogue circuit syntheses and became widely used techniques [2]-[5]. Sound sampling methods and physical model based syntheses are typical techniques, which are expected to provide different types of realistic vocal sounds [6]. In addition to these algorithmic synthesis techniques, a mechanical approach using a phonetic or vocal model imitating the human vocalization mechanism would be a valuable and notable objective. Several mechanical constructions of a human vocal system to realize human-like speech have been reported [1][7][8]. In most of the researches, however, the mechanical reproductions of the human vocal system were mainly directed by referring to X-ray images and FEM analysis, and control methods for natural vocalization have not been considered so far. In fact, since the behaviors of vocal organs have not been sufficiently investigated due to the nonlinear factors of fluid dynamics yet to be overcome, the control of mechanical system has often the difficulties to be

2. Human Vocal System and Voice Generation

Human vocal sounds are generated by the relevant operations of vocal organs such as the lung, trachea, vocal cords, vocal tract, tongue and muscles. In human verbal communication, the sound is perceived as words, which consist of vowels and consonants. The lung has the function of an air tank, and the airflow through the trachea causes the vocal cord vibration as the source sound of a voice. The glottal wave is led to the vocal tract, which works as a sound filter as to form the spectrum envelope of the voice. The fundamental frequency and the volume of the sound source is varied by the change of the physical parameters such as the stiffness of the vocal cords and the amounts of airflow from the lung, and these parameters are uniquely controlled when we utter a song. In contrast, the spectrum envelope, which is necessary for the pronunciation of words consisting of vowels and consonants, is formed based on the inner shape of the vocal tract and the mouth, which are governed by the complex movements of the jaw, tongue and muscles. Vowel sounds are radiated by the relatively stable configuration of the vocal tract, while the short time dynamic motions of the vocal apparatus produce consonants generally. The dampness and viscosity of organs greatly influence the timbre of generated sounds, which we may experience when we have a sore throat. Appropriate configurations of the vocal tract for the production of phonemes are acquired as infants grow by repeating trials and errors of hearing and vocalizing vocal sounds.

3. Mechanical Model for Vocalization

3-1. Configuration of Mechanical Voice System As shown in Figure 1, the mechanical voice system mainly consists of an air compressor, artificial vocal cords, a resonance tube, a nasal cavity, and a microphone connected to a sound analyzer, which correspond to a lung, vocal cords, a vocal tract, a nasal cavity and an audition of a human. The air in the compressor is compressed to 8000 hpa, while the pressure of an air from lungs is about +200 hpa larger than the atmospheric pressure. A pressure reduction valve is applied at the outlet of the air compressor so that the pressure is reduced to be nearly equal to the air pressure through the trachea. The valve is also effective to reduce the fluctuation of the pressure in the compressor during the operations of compression and depression process. The decompressed air is led to the vocal cords via an airflow control valve, which works for the control of the voice volume. The resonance tube is attached to the vocal cords for the modification of resonance characteristics. The sound analyzer plays a role of the auditory system. It realizes the pitch extraction and the analysis of resonance characteristics of the generated sound in real time, which are necessary for the auditory feedback control. The system controller manages the whole system by listening to the produced sounds and generating motor control commands, based on the auditory feedback control mechanism.

Lung and Trachea

Vocal Cords

Vocal Tract & Nasal Cavity


Learning Part

Vocal Cords Air Flow Air Compressor Motor1 Motor2

Nasal Cavity Resonance Tube 5 Motors Auditory Feedback Microphone

Pitch-Motor Controller

Phoneme-Motor Controller

Sound Analyzer

System Controller
Figure 1: System Configuration

3-2. Construction of Resonance Tube and Nasal Cavity The human vocal tract is a non-uniform tube about 170mm long in man. Its cross-sectional area varies from 0 to 20cm2 under the control for vocalization. A nasal tract with a total volume of 60 cm3 is coupled to the vocal tract. Nasal sounds such as /m/ and /n/ are normally excited by the vocal cords and resonated in the nasal cavity. Nasal sounds are generated by closing the soft palate and lips, not to radiate air from the mouth, but to resonate the sound in the nasal cavity. The closed vocal tract works as a lateral branch resonator and also has effects of resonance characteristics to generate nasal sounds. Based on the difference of articulatory positions of tongue and mouth, the /m/ and /n/ sounds can be distinguished with each other. In the mechanical system, a resonance tube as a vocal tract is attached at the sound outlet of the artificial vocal cords. It works as a resonator of a source sound generated by the vocal cords. It is made of a silicone rubber with the length of 180 mm and the diameter of 36mm, which is equal to 10.2cm2 by the cross-sectional area as shown in Figure 2 and 3. The silicone rubber is molded with the softness of human skin, which contributes to the quality of the resonance characteristics. In addition, a nasal cavity made of a plaster is connected to the intake part of the resonance tube to vocalize nasal sounds like /m/ and /n/.

By actuating displacement forces by stainless bars from the outside, the cross-sectional area of the tube is manipulated so that the resonance characteristics are changed according to the transformations of the inner areas of the resonator. DC motors are placed at 5 positions xj (j=1-5) from the intake side of the tube to the outlet side as shown in Figure 2, and the displacement forces Pj(xj) are applied according to the control commands from the phoneme-motor controller. A nasal cavity is coupled with the resonance tube as a vocal tract to vocalize human-like nasal sounds by the control of mechanical parts. A sliding valve as a role of the soft palate is settled at the connection of the resonance tube and the nasal cavity for the selection of nasal and normal sounds. For the generation of nasal sounds /n/ and /m/, the sliding valve is open to lead the air into the nasal cavity as shown in Figure 4(a). By closing the middle position of the vocal tract and then releasing the air to speak vowel sounds, /n/ consonant is generated. For the /m/ consonants, the outlet part is closed to stop the air first, and then is open to vocalize vowels. The difference in the /n/ and /m/ consonant generations is basically the narrowing positions of the vocal tract. In generating plosive sounds /p/ and /t/, the mechanical system closes the sliding valve not to release the air in the nasal cavity. By closing one point of the vocal tract, air provided from the lung is stopped and compressed in the tract as shown in Figure 4(b). Then the released air generates plosive consonant sounds like /p/ and /t/. Sliding valve Open



60 Intake side mm 40
(a) Airflow Control for Nasal Sound /n/

Figure 2: Construction of Vocal Tract and Nasal Cavity

Sliding valve Closed

(b) Airflow Control for Plosive Sound /p/ Figure 4: Motor Control for Nasal and Plosive Sound generation

Figure 3: Structural View of Mechanical System

4. Vocal cords for Pitch Control

4-1. Construction of Artificial Vocal Cords

The characteristic of a glottal wave, which determines the pitch and the volume of human voice, is governed by the complex behavior of the vocal cords. It is due to the oscillatory mechanism of human organs consisting of the mucous membrane and muscles excited by the airflow from the lung. Although several researching reports about the computer simulations of these movements are available[13], we have focused on generating the wave using a mechanical model[9]. In this study, we constructed new vocal cords with two vibrating cords molded with silicone rubber with the softness of human mucous membrane. Figure 5 shows the picture. The vibratory actions of the two cords are excited by the airflow led by the tube, and generate a source sound to be resonated in the vocal tract. Here, assume the simplified dynamics of the vibration given by a strip of a rubber with the length of L. The fundamental frequency f is given by the equation

We constructed three vocal cords with different kinds of softness, which are hard, soft and medium. The medium one has two-layered construction: a hard silicone is inside with the soft coating outside. Figure 7 shows examples of sound waves and its spectra generated by the three vocal cords. The waveform of the hard cords is approximated as periodic pulses, and a couple of resonance peaks are found in the spectrum domain. The two-layered cords generate an isolated triangular waveform, which is close to the actual human one, and its power in the spectrum domain gradually decreases as the frequency rises. In this study, the two-layered vocal cords are employed in the mechanical voice system. Figure 8 shows the vocal cords integrated in the control mechanism.

f =

1 2L

S D ,




by considering the density of the material D and the tension S applied to the rubber. This equation implies that the fundamental frequency varies according to the manipulations of L, S and D. The tension of rubber can be manipulated by applying tensile force to the two cords. Figure 6 shows the schematic figures how tensile force is applied to the vocal cords to generate sounds with different frequencies. By pulling the cords, the tension increases so that the frequency of the generated sound becomes higher. For the voiceless sounds, just by pushing the cords, the gap between two cords are left open and the vibration stops. The structure of the vocal cords proved the easy manipulation for the pitch control, since the fundamental frequency can be changed just by giving tensile force for pushing or pulling the cords.


Low pitch

Airflow Two cords

High pitch 1

Gap between Cords

Airflow Airflow
Figure 5: Vocal cords High pitch 2 Figure 6: Manipulations for Pitch Changes

3000 2000

Time period

120 80


[msec] 0 10 20 30

-1000 -2000

[dB] 40 0 0 1000 2000 3000 [Hz]

120 80 [dB] 40 0


(a) Hard vocal cords

400 Time period 200 Amplitude 0 0 10 20 [msec] 30

-200 -400





(b) Soft vocal cords

2000 1000 Amplitude 0 0 10 20 30 [msec] Time period

120 80

40 0

-1000 -2000





(c) Two-layered vocal cords

Figure 7: Waveforms and Spectra of three vocal cords

Fundamental Frequency [Hz]

Vocal cords

300 250 200 150 100

20 15 Length of vocal cords [mm] 10

Tensile force DC motor for force control

2 3 4 Tention of Rubber [0.01gf]

Figure 8: Vocal cords and Control Mechanism

Figure 9: Relation between Tensile force and Fundamental Frequency

4-2. Pitch Control of Vocal Cords

Figure 9 shows experimental results of pitch changes using the two-layered vocal cords. The fundamental frequency varies from 110 Hz to 250 Hz by the manipulations of a force applying to the rubber. The relationship between the produced frequency and the applied force is not stable but tends to change with the repetition of experiments due to fluid dynamics. The vocal cords, however, reproduce the vibratory actions of human actual vocal cords, and are also considered to be suitable for our system because of its simple structure. Its frequency characteristics are easily controlled by the tension of the rubber and the amount of airflow. For the fundamental frequency and volume adjustments in the voice system, two motors are used: one is to manipulate a screw of an airflow control valve, and the other is to apply a tensile force to the vocal cords for the tension adjustment.
4-3. Adaptive Pitch Control

with the desired pitches for the vocalization. The other is a motor-phoneme map which associates 5 motor positions with phonetic features of the produced sounds. Then in the performance phase, the system is able to utter words by referring to the obtained maps while pitches and phonemes of produced voices are adaptively maintained by hearing its own outputs. The results of the pitch learning based on the auditory feedback is shown in Figure 10, in which the system acquired the sound pitches from C to G. The system was able to acquire vocal sounds with desired pitches.

5. Conclusions
In this paper the construction of a mechanical voice system to vocalize like human was described. By introducing the dynamic control of the vocal tract and vocal cords, the mechanical system vocalizes by changing pitches. The mechanical vocal system so far could have an ability to produce vowel sounds and some consonant sounds. All consonant sounds production and tonguing skill will be our future work to achieve natural voice generation, which can be flexibly used in the human-machine verbal communications. We are now working to develop a training device for auditory impaired persons to interactively train the vocalization. The mechanical system reproduces the vocalization skills just by listening to actual voices. Such persons will be able to learn how to move vocal organs, by watching the motions of the mechanical system. A mechanical construction of the human vocal system is supposed not only to have advantages to produce natural vocalization rather than algorithmic synthesis methods, but also to provide simulations of human acquisition of speaking and singing skills. Further analyses of the human learning mechanisms will contribute to the realization of a speaking robot, which learns and sings like a human. The proposed approach to the understandings of the human behavior will also open a new research area to the development of a human-machine interface.

Not only adjusting but also maintaining the pitch of output sounds is not easy tasks due to the dynamic mechanism of vibration, which is easily disturbed by the fluctuations of the tensile force and the airflow. Stable output has to be obtained no matter what kind of disturbance applies to the system. Introducing an adaptive control mechanism would be a good solution for getting such robustness[10],[11]. An adaptive tuning algorithm for the production of a voice with different pitches using the mechanical voice system is introduced in this section. The algorithm consists of two phases. First in the learning phase, the system acquires two maps in which the relations between the motor positions and the features of produced sounds are described. One is a motor-pitch map which associates motor positions with fundamental frequencies. It is acquired by comparing the pitches of output sounds

200 Tuning Freq. Target Freq. Frequency [Hz]


This work was partly supported by the Sound Technology Promotion Foundation, and also by the Grants-in-Aid for Scientific Research, the Japan Society for the Promotion of Science (No. 15700164).


1 6 11 16

Tuning Step Number








[1] Y.Hayashi, "Koe To Kotoba No Kagaku", Houmei-do, 1979 (in Japanese) [2] J.L.Flanagan, "Speech Analysis Synthesis and

Figure 10: Experimental Result of Pitch Tuning

Perception", Springer-Verlag, 1972 [3] X.Rodet and G.Benett, "Synthesis of the Singing Voice", Current Directions in Computer Music Research, PIT Press, 1989 [4] K.Hirose, "Current Trends and Future Prospects of Speech Synthesis", Journal of the Acoustical Society of Japan, pp.39-45, 1992, (in Japanese) [5] Ph.Depalle, G.Garcia and X.Rodet, "A Virtual Castrato", Proc.ICMC, pp357-360, 1994 [6] J.O.Smith III, "Viewpoints on the History of Digital Synthesis", Proc. ICMC, pp.1-10, 1991 [7] N.Umeda and R.Teranishi, "Phonemic Feature and Vocal Feature -Synthesis of Speech Sounds Using an Acoustic Model of Vocal Tract", Journal of the Acoustical Society of Japan, pp.195-203, Vol.22, No.4, 1966 [8] K.Nishikawa, K.Asama, K.Hayashi, H.Takanobu and A.Takanishi, "Development a Talking Robot", Proc. IEEE/ISJ Int'l Conference on Intelligent Robots and Systems, 2000 [9] H.Sawada and S.Hashimoto, "Adaptive Control of a Vocal Chord and Vocal Tract for Computerized Mechanical Singing Instruments", Proc.ICMC, pp444-447, 1996 [10] H.Sawada and S.Hashimoto, "Mechanical Const-ruction of a Human Vocal System for Singing Voice Production", Advanced Robotics, International Journal of Robotics Society of Japan, Vol.13, No.7, pp.647-661, 2000 [11] H.Sawada and S.Hashimoto, "Mechanical Model of Human Vocal System and Its Control with Auditory Feedback", JSME International Journal, Series C, Vol.43, No.3, pp.645-652, 2000 [12] T.Higashimoto and H.Sawada, "Vocalization Control of a Mechanical Vocal System under the Auditory Feedback"", Journal of Robotics and Mechatronics, Vol.14, No.5, pp.453-461, 2002 [13] K.Ishizaka and J.Flanagan, "Synthesis of Voiced Sounds from a Two-Mass Model of the Vocal Chords", Bell Syst. Tech. J., 50, 1223-1268, 1972