You are on page 1of 53

Text-to-Speech Synthesis

Chilin Shih
UIUC EALC/Linguistics
TTS
Text-to-Speech (TTS): Given text as input,
produces speech as output

Speec
Text TTS
h
Von Kempelen’s Speaking Machine
The Voder
The first attempt of an
electrical speech
synthesizer
Developed by Homer
Dudley in the 1930’s
The Voder
Demo in the 1939
World’s Fair in New
York and San Franscisco
The Voder
TTS Training Package
Festival Alan W. Black
The Festival Speech
Synthesis System
www.cstr.ed.ac.uk/proje
cts/festival/
Author: Alan Black
Built Festival at the
University of Edinburgh
Currently at CMU
Bell Labs TTS
The work reported here represents more than 10 years
of multi-lingual TTS research at Bell Labs.

 Text analysis: Richard Sproat


 Prosody: Chilin Shih, Greg Kochanski, Julia Hirschberg, Jan van
Santen
 Signal processing: Joe Olive, Minkyu Lee, Luis Oliveira
 Search algorithm and graphic interface: Dan Lopresti
 Product development: Mike Tanenblatt
 And numerous others on language development
Current TTS systems
Google: Built by Deepmind, using WaveNet to train TTS with
Google’s TTS datasets
Apple: Siri was built by SRI International and Nuance
Communications
Amazon: Amazon Alexa and Echo was built in house
Many other companies are also building their own speech
recognition and text-to-speech synthesis
Many companies are actively building language technologies
for many languages
Linguistlist.org has build many best-practices examples to use
language technologies to help linguists to do fieldwork and to
preserve endangered languages
Text-to-Speech blocks
Text
ASCII Text “Dr. Smith lives at 111
Normalization
Smith Dr.”

Syntactic/Semantic “Lives” (verb) vs.


Parser
“Lives” (noun)

Dictionary Pronunciation dictionary


morphemic decomposition
rhyming

Letter To
Sound Rules Rules used where
Phonemes dictionary derivation fails
(30~40)

Prosody
Models Intonation and
duration
Diphone and
polyphone Speech Speech output
concatenation Synthesis
From Text to Speech
Hello, world.

Phone: h&l”O | w”Rld ||
Intonation: LH HL
Duration: h[7]&[10]l[9]”O[16]

SPEECH
Challenges

A TTS system needs to predict everything


that is not written.

Written text is not a very good representation


of speech!
Text Analysis
Convert written text into phone strings
text  tekst
speech  spEC
Number and abbreviation expansion
3.6  three point six
Ave  Avenue
With instruction to pause
Go 3.6 miles and turn left onto University Ave.
 Go three point six miles || and turn left | unto
University Avenue.
Ambiguity
Number expansion
1234, 1.2, $3.40, 8/9, 1/2
 See me on 1/2 at 10 am
WWII, Henry VIII
Abbreviation
1 ft, 2 ft, St. John St., 5%
Acronym
AA, AAA, UIUC, SUNY
Homograph
Playing bass / fishing for bass
Spontaneous Abbreviation
From Daily Illini:
Interpret with possible local place names
Match abbreviation with likely rental/housing vocabulary

4 br furnished
711 W. Elm, U.
Parking included, ethernet conn. avail.

4-BDRM HOUSE

505 E Clark, C. - EFF


Morphology & Phonology
English plural s
[s] after voiceless consonants
[z] after voiced consonants
[&z] after [s] or [z]
Russian % (percent) has scores of perfectly regular
possible expansions, based on
Number (1% != 2% != 5%)
Case (5% != 5% discount != with 5% discount)
Gender
Name Pronunciation
Dictionary is not enough
Guessing the origin
In Spanish, [j] is [h], [ll] is [y]
 Jose, La Jolla
In India, [dh] is [d], [gh] is [g]
 Sondhi, Raghaven
How do you say Rajan, Beijing?
Variations
Bernstein, Marcia
Prosody
Prosody is also not written, which needs to be
predicted .
The prosody component typically includes duration
prediction and pitch prediction (tone and intonation)
with loudness modification. Recent systems may also
include voice quality and emotion modification
Duration Prediction
Each sound has a duration value which varies with
phone identity, context, position, stress, …

n I n f > r s e v & n

0.284 1.62
Time (s)
Intonation Modeling
0.312

-0.3433
0.15 2.93
Time (s)

350

75
0.15 2.93
Time (s)

Generating appropriate intonation contour from text.


Speech Synthesis for TTS

• Phone sequence Speech


• Duration Synthesis Acoustic Speech
• Pitch Block

Hz

time
h e l o
Synthesis
Methods
Rule-based approach :
• Flexible control of voice parameters such as formant
• Small footprint.
• Poor voice quality
Concatenative approach :
• Require larger memory, more computations
• Higher voice quality
• Hard to create new voices
Speech Production Model
Pitch Period

Glottal
Impulse Vocal Tract
Model
Train
G(z) Parameters

Vocal Tract Radiation


Model Model
V(z) R(z)

Random
Noise
Generator

Source-Filter Model
Klatt Synthesizer (KLATT80)
RNP RNZ R1 R2 R3 R4 R5
RGZ AV Cascade Vocal Tract Transfer Function
Impulse RGP
Generator A1 R1
RGS AVS SW
Diff
AN RNP
Voicing Source

AH A2 R2
Diff

Noise A3 R3
LPF Radiation
Generator Characteristic
A4 R4
Noise Source AF
A5 R5

A6 R6

AB
Digital Resonator: y(nT)=Ax(nT)+By(nT-T)+Cy(nT-2T)
C=-exp(-2*pi*BW*T)
B=2*exp(-pi*BW*T)*cos(2*pi*F*T) Parallel Vocal Tract Transfer Function
A=1-C-B
where, BW is formant bandwidth and F is formant
frequency.
Rule-based synthesizer
1. The phonetic-to-acoustic transformation for each
phoneme in the phonetic string is not independent.
• Coarticulation: The production of one phone is
highly influenced by its neighbors.
• Hence, simply storing one example phone for
each phoneme will not produce good quality
speech synthesis.
2. Relies on good rules to describe coarticulations.
• become complicated in order to achieve certain
degree of naturalness.
3. Close relationship between the model and phonation
mechanism.
Rule-based synthesizer (cont.)

4. Our understanding of the rules are not sophisticated


enough for high quality synthetic speech.
5. Large number of control parameters required.
6. Poor speech quality (buzzy), but small foot print.
7. Very flexible for voice conversion.
8. DecTalk (KlattSyn), MITALK, INFOVOX, etc.
Concatenative synthesizer
1. Concatenative Synthesis
• Glue pre-recorded speech segments
• Coarticulation information are stored in these units
• Possible unit could be syllables, triphones or
diphones
2. Need to modify prosody (pitch and duration) according to
the target prosody
• require large database for higher voice quality
3. Most commercial/research TTS systems are
concatenative system
Diphone-based Approach
dynamic
Diphone Table
Linguistic
*d *d Preprocessing
am am
dI dI
dInamik
ik ik
In In Target-to-Unit
k* k* Mapping
mi mi
na na

• Annotated by hand *d dI In na am mi ik k*
• “Best” instance of
diphone used
• Static Signal
Postprocessing
New Approach
dynamic
Large Acoustic Inventory

*d ... Linguistic
Preprocessing

... dIn ...


dInamik
... mik* Optimal Unit
Selection
... nam ...

• Annotated by hand
• Managed programmatically *d dIn nam mik*
• Arbitrarily sized units
• Dynamic
Signal
Postprocessing
Modification of Pitch & Duration
Pitch Synchronous Overlap Add (PSOLA)

Requires highly accurate pitch markers – need


separate EGG signal recording or manual pitch
marking.
Demonstration – Prosody Modification

Analysis-by-synthesis
Original Copy
Pitch change
Lower Low Copy High Higher
Duration change
Slower Slow Copy Fast Faster
Synthesis Methods Comparisons

Rule-based Concatenation
Formant LPC PSOLA
Semi
Analysis Automatic Automatic
automatic
Voice
Poor Medium High
quality

Simple Poor
Concatenation Simple (smoothing (pitch, phase
parameters) mismatches)

Computation Low Medium Low

DB Size Small Medium Large


Challenges
1. Human like sound quality.
 Improved analysis/synthesis techniques (waveform,
PSOLA, RELP, LPC, formants).
 Inventory selection, concatenation smoothing.
 Rule derived parameters.
2. Natural prosody.
 Modeling of segment duration, intonation and
loudness.
 Syntactic analysis for proper phrasing.
 Semantic analysis for stress and focus.
3. Text processing.
 Expansion of numerals, symbols, abbreviations.
 Homograph disambiguation.
TTS Research Development
Two examples:
Prosody modeling
Voice conversion
Goal:
Naturalness
Expressiveness
Flexibility
Research direction:
Explain speech from first principle
Finding the right parameters
Prosody Modeling

Chilin Shih (UIUC)


Greg Kochanski (Oxford, Google)
Chinese Lexical Tones
Tonal Variations

Tone 1: High Level Tone 2: Rising


300.0 300.0
f0 (Hz)

f0
200.0 200.0

100.0 100.0

0.0 0.1 time (s) 0.2 0.3 0.0 0.1 time (s) 0.2 0.3 0.4

Tone 3: Low Falling-(Rising) Tone 4: High Falling


300.0 300.0

f0
f0

200.0 200.0

100.0 100.0

50.0
0.0 0.1 time (s) 0.2 0.3 0.4 0.0 0.1 time (s) 0.2 0.3 0.4
Typically, people classify surface variations
into categories, or templates.

Different templates are chosen for different


occasions.
A neural network model of Mandarin tones
may use more than 100 templates.

It misses the connection among templates.

And still have problems accounting for data.


The Challenge: Tonal Distortion
People Talk Nearly As Fast As Possible
Basic Assumptions Used in Modeling

Pre-planning.

Balance articulatory effort and communication needs


(Lindblom, Ohala).

A dynamical model for the muscles that control f0


(Hill).
Physiological constraints:

Communication constraints:

• When things are important:

• When things are not important:


We further propose:

• Speaker shifts weights dynamically


as they speak.

• This is the prosodic strength.


Modeling Math
p (t )  arg minG  R 
p (t )


G   dt p 2   2 p 2   2 p 2  “Effort”

p (t ) is the muscle tension (~frequency) at time t.

R  si2 ri Each target encodes some linguistic information, ri


itargets
is the error of the ith target, and si is its importance.


ri  ttarget dt  ( p  p )  ( y  y ) 2    p  y 2
i
 “Error”

y is the ith pitch target and a bar denotes an average over a target.
y  yi (t )
Intonation Model

The prosodic modeling is based on Stem-ML


(Soft Template Mark-up Language).
A set of mathematically defined tags with
value attributes (tag value).
For example: Tone prosodic strength
Allowing user-defined accent shapes, phrase
curves, and other speaker specific parameters.

Generation: Tag value  F0


Learning: Tag, F0  Tag value
The Rest of the Model.
A model is a sequence of templates (i.e. points
representing tone/accent shapes).
Each template has a strength.
For tone languages, there is one template per tone.
Templates are stretched to fit duration.
Representing F0 As Tone Strength
Model Fits Over a Range of Speeds.
Fun Applications

Imitation of prosodic styles


Martin Luther King Jr.
TTS simulation
Learning music embellishment
Embellishment from Dinah Shore’s
Bicycle Built for Two
TTS simulation
Moving embellishment
TTS concert—voice conversion, amplitude
modification, embellishment, stylistic rules
Applications of Prosody Modeling and Voice
Conversion
 Book on tape
 Entertainment industry
Reviving old actors
Voices for animation
Commercials
 Game developers & toy manufacturers
 Virtual world developers

 Assistive technologies: Generating voices for people


with difficulty speaking
The Future of TTS
Better DSP
Retain voice quality while changing intonation,
duration, amplitude, and source.
Prosody modeling
Modeling natural duration and intonation.
Common theme
Survey language phenomena—study what the
speaker wants to communicate.
Find the best parameter to control variations.
Working in all domains: human and machine.

You might also like