Professional Documents
Culture Documents
Chilin Shih
UIUC EALC/Linguistics
TTS
Text-to-Speech (TTS): Given text as input,
produces speech as output
Speec
Text TTS
h
Von Kempelen’s Speaking Machine
The Voder
The first attempt of an
electrical speech
synthesizer
Developed by Homer
Dudley in the 1930’s
The Voder
Demo in the 1939
World’s Fair in New
York and San Franscisco
The Voder
TTS Training Package
Festival Alan W. Black
The Festival Speech
Synthesis System
www.cstr.ed.ac.uk/proje
cts/festival/
Author: Alan Black
Built Festival at the
University of Edinburgh
Currently at CMU
Bell Labs TTS
The work reported here represents more than 10 years
of multi-lingual TTS research at Bell Labs.
Letter To
Sound Rules Rules used where
Phonemes dictionary derivation fails
(30~40)
Prosody
Models Intonation and
duration
Diphone and
polyphone Speech Speech output
concatenation Synthesis
From Text to Speech
Hello, world.
Phone: h&l”O | w”Rld ||
Intonation: LH HL
Duration: h[7]&[10]l[9]”O[16]
SPEECH
Challenges
4 br furnished
711 W. Elm, U.
Parking included, ethernet conn. avail.
4-BDRM HOUSE
n I n f > r s e v & n
0.284 1.62
Time (s)
Intonation Modeling
0.312
-0.3433
0.15 2.93
Time (s)
350
75
0.15 2.93
Time (s)
Hz
time
h e l o
Synthesis
Methods
Rule-based approach :
• Flexible control of voice parameters such as formant
• Small footprint.
• Poor voice quality
Concatenative approach :
• Require larger memory, more computations
• Higher voice quality
• Hard to create new voices
Speech Production Model
Pitch Period
Glottal
Impulse Vocal Tract
Model
Train
G(z) Parameters
Random
Noise
Generator
Source-Filter Model
Klatt Synthesizer (KLATT80)
RNP RNZ R1 R2 R3 R4 R5
RGZ AV Cascade Vocal Tract Transfer Function
Impulse RGP
Generator A1 R1
RGS AVS SW
Diff
AN RNP
Voicing Source
AH A2 R2
Diff
Noise A3 R3
LPF Radiation
Generator Characteristic
A4 R4
Noise Source AF
A5 R5
A6 R6
AB
Digital Resonator: y(nT)=Ax(nT)+By(nT-T)+Cy(nT-2T)
C=-exp(-2*pi*BW*T)
B=2*exp(-pi*BW*T)*cos(2*pi*F*T) Parallel Vocal Tract Transfer Function
A=1-C-B
where, BW is formant bandwidth and F is formant
frequency.
Rule-based synthesizer
1. The phonetic-to-acoustic transformation for each
phoneme in the phonetic string is not independent.
• Coarticulation: The production of one phone is
highly influenced by its neighbors.
• Hence, simply storing one example phone for
each phoneme will not produce good quality
speech synthesis.
2. Relies on good rules to describe coarticulations.
• become complicated in order to achieve certain
degree of naturalness.
3. Close relationship between the model and phonation
mechanism.
Rule-based synthesizer (cont.)
• Annotated by hand *d dI In na am mi ik k*
• “Best” instance of
diphone used
• Static Signal
Postprocessing
New Approach
dynamic
Large Acoustic Inventory
*d ... Linguistic
Preprocessing
• Annotated by hand
• Managed programmatically *d dIn nam mik*
• Arbitrarily sized units
• Dynamic
Signal
Postprocessing
Modification of Pitch & Duration
Pitch Synchronous Overlap Add (PSOLA)
Analysis-by-synthesis
Original Copy
Pitch change
Lower Low Copy High Higher
Duration change
Slower Slow Copy Fast Faster
Synthesis Methods Comparisons
Rule-based Concatenation
Formant LPC PSOLA
Semi
Analysis Automatic Automatic
automatic
Voice
Poor Medium High
quality
Simple Poor
Concatenation Simple (smoothing (pitch, phase
parameters) mismatches)
f0
200.0 200.0
100.0 100.0
0.0 0.1 time (s) 0.2 0.3 0.0 0.1 time (s) 0.2 0.3 0.4
f0
f0
200.0 200.0
100.0 100.0
50.0
0.0 0.1 time (s) 0.2 0.3 0.4 0.0 0.1 time (s) 0.2 0.3 0.4
Typically, people classify surface variations
into categories, or templates.
Pre-planning.
Communication constraints:
G dt p 2 2 p 2 2 p 2 “Effort”
ri ttarget dt ( p p ) ( y y ) 2 p y 2
i
“Error”
y is the ith pitch target and a bar denotes an average over a target.
y yi (t )
Intonation Model