Text-to-Speech Synthesis: Chilin Shih UIUC EALC/Linguistics

Text-to-Speech Synthesis
Chilin Shih
UIUC EALC/Linguistics
TTS
Text-to-Speech (TTS): Given text as input,
produces speech as output
Speec
Text TTS
h
Von Kempelen’s Speaking Machine
The Voder
The first attempt of an
electrical speech
synthesizer
Developed by Homer
Dudley in the 1930’s
The Voder
Demo in the 1939
World’s Fair in New
York and San Franscisco
The Voder
TTS Training Package
Festival Alan W. Black
The Festival Speech
Synthesis System
www.cstr.ed.ac.uk/proje
cts/festival/
Author: Alan Black
Built Festival at the
University of Edinburgh
Currently at CMU
Bell Labs TTS
The work reported here represents more than 10 years
of multi-lingual TTS research at Bell Labs.
 Text analysis: Richard Sproat

 Prosody: Chilin Shih, Greg Kochanski, Julia Hirschberg, Jan van
Santen
 Signal processing: Joe Olive, Minkyu Lee, Luis Oliveira
 Search algorithm and graphic interface: Dan Lopresti
 Product development: Mike Tanenblatt
 And numerous others on language development
Current TTS systems
Google: Built by Deepmind, using WaveNet to train TTS with
Google’s TTS datasets
Apple: Siri was built by SRI International and Nuance
Communications
Amazon: Amazon Alexa and Echo was built in house
Many other companies are also building their own speech
recognition and text-to-speech synthesis
Many companies are actively building language technologies
for many languages
Linguistlist.org has build many best-practices examples to use
language technologies to help linguists to do fieldwork and to
preserve endangered languages
Text-to-Speech blocks
Text
ASCII Text “Dr. Smith lives at 111
Normalization
Smith Dr.”
Syntactic/Semantic “Lives” (verb) vs.

Parser
“Lives” (noun)
Dictionary Pronunciation dictionary

morphemic decomposition
rhyming
Letter To
Sound Rules Rules used where
Phonemes dictionary derivation fails
(30~40)
Prosody
Models Intonation and
duration
Diphone and
polyphone Speech Speech output
concatenation Synthesis
From Text to Speech
Hello, world.

Phone: h&l”O | w”Rld ||
Intonation: LH HL
Duration: h[7]&[10]l[9]”O[16]

SPEECH
Challenges
A TTS system needs to predict everything

that is not written.
Written text is not a very good representation

of speech!
Text Analysis
Convert written text into phone strings
text  tekst
speech  spEC
Number and abbreviation expansion
3.6  three point six
Ave  Avenue
With instruction to pause
Go 3.6 miles and turn left onto University Ave.
 Go three point six miles || and turn left | unto
University Avenue.
Ambiguity
Number expansion
1234, 1.2, $3.40, 8/9, 1/2
 See me on 1/2 at 10 am
WWII, Henry VIII
Abbreviation
1 ft, 2 ft, St. John St., 5%
Acronym
AA, AAA, UIUC, SUNY
Homograph
Playing bass / fishing for bass
Spontaneous Abbreviation
From Daily Illini:
Interpret with possible local place names
Match abbreviation with likely rental/housing vocabulary
4 br furnished
711 W. Elm, U.
Parking included, ethernet conn. avail.
4-BDRM HOUSE
505 E Clark, C. - EFF

Morphology & Phonology
English plural s
[s] after voiceless consonants
[z] after voiced consonants
[&z] after [s] or [z]
Russian % (percent) has scores of perfectly regular
possible expansions, based on
Number (1% != 2% != 5%)
Case (5% != 5% discount != with 5% discount)
Gender
Name Pronunciation
Dictionary is not enough
Guessing the origin
In Spanish, [j] is [h], [ll] is [y]
 Jose, La Jolla
In India, [dh] is [d], [gh] is [g]
 Sondhi, Raghaven
How do you say Rajan, Beijing?
Variations
Bernstein, Marcia
Prosody
Prosody is also not written, which needs to be
predicted .
The prosody component typically includes duration
prediction and pitch prediction (tone and intonation)
with loudness modification. Recent systems may also
include voice quality and emotion modification
Duration Prediction
Each sound has a duration value which varies with
phone identity, context, position, stress, …
n I n f > r s e v & n
0.284 1.62
Time (s)
Intonation Modeling
0.312
-0.3433
0.15 2.93
Time (s)
350
75
0.15 2.93
Time (s)
Generating appropriate intonation contour from text.

Speech Synthesis for TTS
• Phone sequence Speech

• Duration Synthesis Acoustic Speech
• Pitch Block
Hz
time
h e l o
Synthesis
Methods
Rule-based approach :
• Flexible control of voice parameters such as formant
• Small footprint.
• Poor voice quality
Concatenative approach :
• Require larger memory, more computations
• Higher voice quality
• Hard to create new voices
Speech Production Model
Pitch Period
Glottal
Impulse Vocal Tract
Model
Train
G(z) Parameters
Vocal Tract Radiation

Model Model
V(z) R(z)
Random
Noise
Generator
Source-Filter Model
Klatt Synthesizer (KLATT80)
RNP RNZ R1 R2 R3 R4 R5
RGZ AV Cascade Vocal Tract Transfer Function
Impulse RGP
Generator A1 R1
RGS AVS SW
Diff
AN RNP
Voicing Source
AH A2 R2
Diff
Noise A3 R3
LPF Radiation
Generator Characteristic
A4 R4
Noise Source AF
A5 R5
A6 R6
AB
Digital Resonator: y(nT)=Ax(nT)+By(nT-T)+Cy(nT-2T)
C=-exp(-2*pi*BW*T)
B=2*exp(-pi*BW*T)*cos(2*pi*F*T) Parallel Vocal Tract Transfer Function
A=1-C-B
where, BW is formant bandwidth and F is formant
frequency.
Rule-based synthesizer
1. The phonetic-to-acoustic transformation for each
phoneme in the phonetic string is not independent.
• Coarticulation: The production of one phone is
highly influenced by its neighbors.
• Hence, simply storing one example phone for
each phoneme will not produce good quality
speech synthesis.
2. Relies on good rules to describe coarticulations.
• become complicated in order to achieve certain
degree of naturalness.
3. Close relationship between the model and phonation
mechanism.
Rule-based synthesizer (cont.)
4. Our understanding of the rules are not sophisticated

enough for high quality synthetic speech.
5. Large number of control parameters required.
6. Poor speech quality (buzzy), but small foot print.
7. Very flexible for voice conversion.
8. DecTalk (KlattSyn), MITALK, INFOVOX, etc.
Concatenative synthesizer
1. Concatenative Synthesis
• Glue pre-recorded speech segments
• Coarticulation information are stored in these units
• Possible unit could be syllables, triphones or
diphones
2. Need to modify prosody (pitch and duration) according to
the target prosody
• require large database for higher voice quality
3. Most commercial/research TTS systems are
concatenative system
Diphone-based Approach
dynamic
Diphone Table
Linguistic
*d *d Preprocessing
am am
dI dI
dInamik
ik ik
In In Target-to-Unit
k* k* Mapping
mi mi
na na
• Annotated by hand *d dI In na am mi ik k*
• “Best” instance of
diphone used
• Static Signal
Postprocessing
New Approach
dynamic
Large Acoustic Inventory
*d ... Linguistic
Preprocessing
... dIn ...

dInamik
... mik* Optimal Unit
Selection
... nam ...
• Annotated by hand
• Managed programmatically *d dIn nam mik*
• Arbitrarily sized units
• Dynamic
Signal
Postprocessing
Modification of Pitch & Duration
Pitch Synchronous Overlap Add (PSOLA)
Requires highly accurate pitch markers – need

separate EGG signal recording or manual pitch
marking.
Demonstration – Prosody Modification
Analysis-by-synthesis
Original Copy
Pitch change
Lower Low Copy High Higher
Duration change
Slower Slow Copy Fast Faster
Synthesis Methods Comparisons
Rule-based Concatenation
Formant LPC PSOLA
Semi
Analysis Automatic Automatic
automatic
Voice
Poor Medium High
quality
Simple Poor
Concatenation Simple (smoothing (pitch, phase
parameters) mismatches)
Computation Low Medium Low
DB Size Small Medium Large

Challenges
1. Human like sound quality.
 Improved analysis/synthesis techniques (waveform,
PSOLA, RELP, LPC, formants).
 Inventory selection, concatenation smoothing.
 Rule derived parameters.
2. Natural prosody.
 Modeling of segment duration, intonation and
loudness.
 Syntactic analysis for proper phrasing.
 Semantic analysis for stress and focus.
3. Text processing.
 Expansion of numerals, symbols, abbreviations.
 Homograph disambiguation.
TTS Research Development
Two examples:
Prosody modeling
Voice conversion
Goal:
Naturalness
Expressiveness
Flexibility
Research direction:
Explain speech from first principle
Finding the right parameters
Prosody Modeling
Chilin Shih (UIUC)

Greg Kochanski (Oxford, Google)
Chinese Lexical Tones
Tonal Variations
Tone 1: High Level Tone 2: Rising

300.0 300.0
f0 (Hz)
f0
200.0 200.0
100.0 100.0
0.0 0.1 time (s) 0.2 0.3 0.0 0.1 time (s) 0.2 0.3 0.4
Tone 3: Low Falling-(Rising) Tone 4: High Falling

300.0 300.0
f0
f0
200.0 200.0
100.0 100.0
50.0
0.0 0.1 time (s) 0.2 0.3 0.4 0.0 0.1 time (s) 0.2 0.3 0.4
Typically, people classify surface variations
into categories, or templates.
Different templates are chosen for different

occasions.
A neural network model of Mandarin tones
may use more than 100 templates.
It misses the connection among templates.
And still have problems accounting for data.

The Challenge: Tonal Distortion
People Talk Nearly As Fast As Possible
Basic Assumptions Used in Modeling
Pre-planning.
Balance articulatory effort and communication needs

(Lindblom, Ohala).
A dynamical model for the muscles that control f0

(Hill).
Physiological constraints:
Communication constraints:
• When things are important:
• When things are not important:

We further propose:
• Speaker shifts weights dynamically

as they speak.
• This is the prosodic strength.

Modeling Math
p (t )  arg minG  R 
p (t )

G   dt p 2   2 p 2   2 p 2  “Effort”
p (t ) is the muscle tension (~frequency) at time t.
R  si2 ri Each target encodes some linguistic information, ri

itargets
is the error of the ith target, and si is its importance.

ri  ttarget dt  ( p  p )  ( y  y ) 2    p  y 2
i
 “Error”
y is the ith pitch target and a bar denotes an average over a target.
y  yi (t )
Intonation Model
The prosodic modeling is based on Stem-ML

(Soft Template Mark-up Language).
A set of mathematically defined tags with
value attributes (tag value).
For example: Tone prosodic strength
Allowing user-defined accent shapes, phrase
curves, and other speaker specific parameters.
Generation: Tag value  F0

Learning: Tag, F0  Tag value
The Rest of the Model.
A model is a sequence of templates (i.e. points
representing tone/accent shapes).
Each template has a strength.
For tone languages, there is one template per tone.
Templates are stretched to fit duration.
Representing F0 As Tone Strength
Model Fits Over a Range of Speeds.
Fun Applications
Imitation of prosodic styles

Martin Luther King Jr.
TTS simulation
Learning music embellishment
Embellishment from Dinah Shore’s
Bicycle Built for Two
TTS simulation
Moving embellishment
TTS concert—voice conversion, amplitude
modification, embellishment, stylistic rules
Applications of Prosody Modeling and Voice
Conversion
 Book on tape
 Entertainment industry
Reviving old actors
Voices for animation
Commercials
 Game developers & toy manufacturers
 Virtual world developers
 Assistive technologies: Generating voices for people

with difficulty speaking
The Future of TTS
Better DSP
Retain voice quality while changing intonation,
duration, amplitude, and source.
Prosody modeling
Modeling natural duration and intonation.
Common theme
Survey language phenomena—study what the
speaker wants to communicate.
Find the best parameter to control variations.
Working in all domains: human and machine.

Text-to-Speech Synthesis: Chilin Shih UIUC EALC/Linguistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text-to-Speech Synthesis: Chilin Shih UIUC EALC/Linguistics

Uploaded by

Copyright:

Available Formats

Text-to-Speech Synthesis

 Text analysis: Richard Sproat

Syntactic/Semantic “Lives” (verb) vs.

Dictionary Pronunciation dictionary

A TTS system needs to predict everything

Written text is not a very good representation

505 E Clark, C. - EFF

Generating appropriate intonation contour from text.

• Phone sequence Speech

Vocal Tract Radiation

4. Our understanding of the rules are not sophisticated

... dIn ...

Requires highly accurate pitch markers – need

Computation Low Medium Low

DB Size Small Medium Large

Chilin Shih (UIUC)

Tone 1: High Level Tone 2: Rising

Tone 3: Low Falling-(Rising) Tone 4: High Falling

Different templates are chosen for different

It misses the connection among templates.

And still have problems accounting for data.

Balance articulatory effort and communication needs

A dynamical model for the muscles that control f0

• When things are important:

• When things are not important:

• Speaker shifts weights dynamically

• This is the prosodic strength.

p (t ) is the muscle tension (~frequency) at time t.

R  si2 ri Each target encodes some linguistic information, ri

The prosodic modeling is based on Stem-ML

Generation: Tag value  F0

Imitation of prosodic styles

 Assistive technologies: Generating voices for people

You might also like