You are on page 1of 57

Speech synthesis


Text-to-speech synthesis

•  The automatic transformation from (electronic) text to speech

•  The speaker is defined by the system design

–  Single speaker

•  In contrast to speech recognition, the aim is not to handle all speakers
and all normal pronunciation variants, but to render one spoken
realization of the text that is perceived as natural and intelligible

•  A text contains orthographic words, numbers, abbreviations,
mnemonics and punctuation

–  Linguistic analysis of the text is necessary to

•  Interpret symbols

•  Analyze the grammatical structure

•  Infer the semantic interpretation of the text


Torbjørn Svendsen

Speech Synthesis

The processing steps of TTS












Text analysis: text normalization; analysis of document structure, linguistic

–  Output: tagged text


Phonemic analysis : homograph disambiguation, morphological analysis,
letter-to-sound mapping

–  Output: tagged phone sequence


Prosodic analysis: intonation; duration; volume

–  Output: control sequence, tagged phones


Speech synthesis: voice rendering

–  Output: synthetic speech

Torbjørn Svendsen

Speech Synthesis

Text-to-speech synthesis









•  Speech synthesis concerns the waveform generation from the
annotated symbol sequence (typically phone sequence)

•  Philosophy: Rule-based vs. data driven synthesis

•  Method: Articulatory synthesis; formant synthesis; concatenative
(waveform) synthesis


Torbjørn Svendsen

Speech Synthesis







•  Different strategies give different quality but also different consistency
of quality

•  No strategy can currently provide consistent high quality (but it is
getting closer)

•  Limited domain gives high quality within the application domain


Torbjørn Svendsen

Speech Synthesis

The synthesis space Speech knowledge Intelligibility Flexibility Naturalness Bit rate Units Processing needs Cost Complexity Vocabulary (Figure adapted from Granström) 6 Torbjørn Svendsen Speech Synthesis .

Main methods •  Formant synthesis •  Concatenative. or Waveform synthesis •  Articulatory synthesis 7 Torbjørn Svendsen Speech Synthesis .

bandwidth and optionally. amplitude •  E.Formant synthesis Annotated phones Formant tracks Formant synthesis Rule system Pitch contour Synthetic speech •  Normally rule based (knowledge driven) system. H i ( z) = 1 ! 2e !"bi 1 cos(2"f i ) z !1 + e ! 2"bi z ! 2 2nd order filter with resonance in fi and a bandwidth bi 8 Torbjørn Svendsen Speech Synthesis . but can also be data driven •  Each formant can be specified with center frequency.g.

Implementation •  Cascade or parallel implementation •  Voiced sounds typically use cascade. unvoiced sounds use parallel implementation •  LPC-filters can also be used –  Normally poorer quality 9 Torbjørn Svendsen Speech Synthesis .

Klatt formant synthesizer 10 Torbjørn Svendsen Speech Synthesis .

LPC synthesis Pulse generator LPC Filter Noise generator •  Example: –  –  –  –  –  G A(z) Original LPC-coded. all unvoiced/voiced Speed: Halved/doubled Pitch: Halved/doubled Melody 11 Torbjørn Svendsen Speech Synthesis .

Rule-based formant generation •  Formants are slowly varying –  Update rates of 5-10ms sufficient •  Target values describe stationary conditions. {Fi. Bi} •  Rules describe transition between phones Parameters describe transition shape Specific rules for all transition types 12 Torbjørn Svendsen Speech Synthesis .

Rule based formant generation 13 Torbjørn Svendsen Speech Synthesis .

Formant synthesis •  Flexible •  Produces intelligible speech with few parameters •  Simple implementation –  Rule derivation is complex. e. development can be costly •  Limited naturalness •  NOTE: Given sufficient training data. formant generation can be data driven. al) –  Similar approach for LPC-based synthesis more recently by Tohkura 14 Torbjørn Svendsen Speech Synthesis .g. using an HMM in production mode for generating the formant tracks (Acero et.

tongue placement and height. lip rounding. –  Acoustics. fluid mechanics form basis •  Limited success –  Complex theory –  Computational difficulties and complexity 15 Torbjørn Svendsen Speech Synthesis . ….Articulatory synthesis •  The waveform production is performed by describing the movement of the articulators –  Jaw opening.

formant and pitch mismatch) –  Lead to an enormous database to cover all phonetic and prosodic events •  How much modification is possible before degradation is audible? 16 Torbjørn Svendsen Speech Synthesis .Synthesis by concatenation •  Concatenation of stored waveform fragments –  Optional modification of the fragments (duration. formants) •  Dilemma:Use of unmodified fragments will either –  Produce audible distortion at concatenation points (phase mismatch. pitch.

  How to perform prosodic modification of the selected sequence? 17 Torbjørn Svendsen Speech Synthesis . recording conditions.  Which unit? 2. level –  Segmentation and labeling – consistency. automation? 3. effort. reading style –  Annotation – type.  How to select the best sequence of units from the acoustic inventory 4.  How to design the acoustic ”library” (inventory)? –  Content.Some central issues 1.

Some central issues 1. automation? 3.  Which unit? 2.  How to design the acoustic ”library” (inventory)? –  Content.  How to perform prosodic modification of the selected sequence? 18 Torbjørn Svendsen Speech Synthesis .  How to select the best sequence of units from the acoustic inventory 4. effort. reading style –  Annotation – type. recording conditions. level –  Segmentation and labeling – consistency.

Which unit? •  Longer unit leads to better quality.1. but –  requires more data to be stored –  Is more context dependent 19 Torbjørn Svendsen Speech Synthesis .

Unit requirements •  Low concatenation distortion –  Longer units -> less concatenation points –  Units containing attractive concatenation points •  Low prosodic distortion –  Small inventory means prosodic modification necessary –  Modification introduces distortion •  Unit should be generalizable –  Need to be able to synthesize sequences that were not in the original inventory (except for limited domain synthesis) •  Unit should be ”trainable” –  Finite training data sufficient to estimate or predict all units 20 Torbjørn Svendsen Speech Synthesis .

Coverage •  Complete coverage of all phonetic and prosodic events is impossible –  Large Number of Rare Events 21 Torbjørn Svendsen Speech Synthesis .

g. IBM.g. needs to be reduced •  E. phone HMM-state (IBM) 22 Torbjørn Svendsen Speech Synthesis . AT&T) –  Half-phones (AT&T).Some possible unit choices •  Context independent phonemes –  Bad concatenation properties •  Context dependent phonemes –  Reduces discontinuity problems –  Large number (~125k). generalized triphones or phonetic decision trees •  Diphones (dyads) –  ~2500 possible units –  Reduces discontinuity problems –  Widespread use •  Sub-phonemic units –  Increased use (e.

Some posible unit choices (cont. words and phrases –  Mainly used for limited domain applications •  Fixed message repertoire –  Potentially good quality –  Demands large storage •  And much data collection –  Computationally demanding •  Complex search in large database –  Syllables or demi-syllables most interesting 23 Torbjørn Svendsen Speech Synthesis .) •  Syllables.

reading style –  Annotation – type. recording conditions. effort.  How to design the acoustic ”library” (inventory)? –  Content.  How to select the best sequence of units from the acoustic inventory 4. level –  Segmentation and labeling – consistency. automation? 3.  How to perform prosodic modification of the selected sequence? 24 Torbjørn Svendsen Speech Synthesis .  Which unit? 2.Some central issues 1.

appropriately annotated •  Voice talent very important for resulting quality •  Design choice: Rely on prosodic modification by signal processing or aim for good coverage of natural prosodic variation in database •  Prosodic modification at synthesis – PSOLA type synthesis –  –  –  –  25 Typically diphone units Normally desirable to have nearly constant (neutral) F0 Nonsense words/sentences with (near) full diphone coverage Small database (~5 minutes of speech contains the essential units) Torbjørn Svendsen Speech Synthesis .Designing the acoustic inventory •  Recordings from one speaker.

variable units Requires search for the best unit sequence Rich phonetic and prosodic context •  Typically ”real” texts –  Text selection: •  Start with large number of natural sentences •  Analyze sentences.Designing the acoustic inventory (2) •  Unit selection synthesis – rely on natural prosodic variation –  Representative speech – speaking style defined by database –  Many representations of each phonetic unit •  Gives prosodic variation –  –  –  –  Large database Facitilitates longer units. typically) •  Design supplementary sentences to improve coverage 26 Torbjørn Svendsen Speech Synthesis . predict phonetic and prosodic realization •  Use some greedy algorithm to obtain the best coverage possible with a small number of sentences (2000-4000.

•  Unit inventory must be chosen with care •  Fall-back solutions must exist for non-covered units 27 Torbjørn Svendsen Speech Synthesis . the probability of encountering a unit not in the database approaches certainty for a small sequence of randomly selected sentences.LNRE P(unit) •  Large number of units with small probability of occurrence •  If database units are selected randomly.Coverage .

speech can be segmented and annotated manually –  Phonemic and prosodic annotation can be detailed •  For unit selection databases automation is necessary –  Automatic or semi-automatic methods for segmentation in phonemic and prosodic units –  Annotation can be fairly high-level without loss of quality –  Annotation level and cost function for unit selection are closely linked 28 Torbjørn Svendsen Speech Synthesis .Annotation •  For small databases.

automation? 3. effort.  How to design the acoustic ”library” (inventory)? –  Content.Some central issues 1.  How to perform prosodic modification of the selected sequence? 29 Torbjørn Svendsen Speech Synthesis . reading style –  Annotation – type.  How to select the best sequence of units from the acoustic inventory 4. recording conditions.  Which unit? 2. level –  Segmentation and labeling – consistency.

formant tracks) •  Search problem –  Must define an object function to be minimized 30 Torbjørn Svendsen Speech Synthesis . Optimal unit string •  Selection problem arises when there are several possible choices for the unit sequence •  Traditional diphone synthesis has only one exemplar of each unit –  Trivial solution •  Selection is made based on desire for naturalness and minimum discontinuity due to –  –  –  –  Different phonetic contexts Segmentation errors in the database Acoustic variability Prosodic differences (pitch discontinuity.3.

du(*) Transition cost..! j +1 ) Unit cost..t 2 .!N } # Candidate segment sequence T = {t1.Object function for search Lattice of candidate units Sequence of target units N N #1 j =1 j =1 d(!.T) ! ! 31 Torbjørn Svendsen Speech Synthesis .t j ) + " dt (! j .T) = " du (! j ...!2 ....t N } . dt(*) ! = {!1.Target units ˆ = argmin d(!..

Object function for search (2) •  How to choose unit and transition cost functions? –  Empirical or data driven •  Empirical strategy: –  Transition cost: •  •  •  •  If two segments originally spoken in succession. 32 Torbjørn Svendsen Speech Synthesis . Based on empirical data. cost as sum of prosodic and coarticulary cost Prosodic cost proportional to difference in F0 (or logF0) at boundary Coarticulary cost based on empirical knowledge of perceived distance –  Unit cost •  Contribution from prosody and context •  Prosody cost proportional to difference in F0 •  Contextual cost by using a unit from a different phonetic context. dt(*)=0 Otherwise.

e. spectral distance in the transition area (distance between the end frame of preceeding unit and first frame of succeding unit) •  (Optional) prosodic cost. magnitude of log(F0) difference –  Unit cost •  Based on context •  Examples: –  Same context means no cost.g. different context gives infinite cost –  Generalized triphones(GT): Unit belongs to same GT means no cost. e.g. e. otherwise cost is infinity –  Phonetic decision trees.Object function for search (3) •  Data driven cost function –  Transition cost •  Measure of spectral discontinuity.g. no cost for units at same leaf node 33 Torbjørn Svendsen Speech Synthesis .

clustering of units –  Initial search using units representing each cluster –  Search refinement by selecting best cluster member as selected unit 34 Torbjørn Svendsen Speech Synthesis .g.Optimal unit string selection •  Given –  The object function to be minimized –  A target sequence from the TTS front end –  A unit inventory •  The minimization can be performed using standard dynamic programming techniques (Viterbi-style) •  Similar to HMM decoding. but no probabilities. only cost values •  Search can be further simplified by e.

level –  Segmentation and labeling – consistency.Some central issues 1.  How to select the best sequence of units from the acoustic inventory 4. recording conditions. reading style –  Annotation – type.  How to design the acoustic ”library” (inventory)? –  Content. automation? 3. effort.  How to perform prosodic modification of the selected sequence? 35 Torbjørn Svendsen Speech Synthesis .  Which unit? 2.

duration .4.duration and pitch •  Example: 36 Torbjørn Svendsen Original – duration – pitch – duration and pitch Speech Synthesis . Prosodic modification •  Techniques for prosodic modification (pitch. duration) mandatory when unit inventory is small •  Also desirable for unit selection synthesis due to LNRE •  Main issue: How to be able to achieve (at least moderate) prosodic modification of a unit (sequence) without introducing annoying distortion •  Example 1: Original .pitch .

Produces irregular pitch periods SOLA: Analysis window placed at position which gives max correlation btw. fixed distance between analysis windows.(Synchronous) Overlap and Add 2N N •  •  37 OLA: Time-scale modification. windows Torbjørn Svendsen Speech Synthesis .

Pitch Synchronous OLA (PSOLA) •  Window is pitch synchronous –  centered around an excitation pulse –  Duration equal to two pitch periods. 2*T0 •  Allows for simple modification of pitch frequency •  Can also modify duration •  Unvoiced sounds: –  Fixed window length (< 10 ms) –  Can invert every other repeated segment in order to avoid periodicities when expanding duration •  Can provide high quality as long as the degree of modification is relatively low (<2) 38 Torbjørn Svendsen Speech Synthesis .

F0 change T=1.PSOLA – principle.25*T0 Original Epochs Shift Re-harmonized signal 39 Torbjørn Svendsen Speech Synthesis .

PSOLA – duration and F0 modification 40 Torbjørn Svendsen Speech Synthesis .

T0 can be changed without changing spectral envelope (exact match at f=k*F0) •  Window type and degree of change will determine distortion outside pitch harmonics (interpolated values.PSOLA principle e( n ) = " ! $ (n # kT ) k = #" 0 s(n) •  s(n) determines spectral envelope x ( n ) = e( n ) * s ( n ) = " ! s(n # kT ) k = #" 0 •  Using an appropriate. pitch synchronous window. correctness determined by window sidelobes) 41 Torbjørn Svendsen Speech Synthesis .

ts(j+1).How to determine the synthesis epochs •  •  •  •  ts(j) – time instance for pitch pulse (epoch) i in the synthesis Ps(t) – desired pitch period at time t If Ps(t) is slowly varying.ts(j)=Ps (ts(j)) Exact: t s ( j +1) " P (t )dt s t s ( j + 1) ! t s ( j ) = ts ( j ) t s ( j + 1) ! t s ( j ) –  Next pulse offset by mean pitch within the synthesis interval –  Iterative calculation 42 Torbjørn Svendsen Speech Synthesis .

Synthesis epoch calculation 43 Torbjørn Svendsen Speech Synthesis .

Pitch modification. no time scaling Original pitch : Pa (t ) = t a (i + 1) " t a (i ) Desired pitch : Ps (t ) = ! (t ) Pa (t ) t s ( j +1) # ! (t ) P (t )dt a t s ( j + 1) " t s ( j ) = ts ( j ) t s ( j + 1) " t s ( j ) Pa (t ) is piecewise constant ! (t ) is normally constant or linear 44 Torbjørn Svendsen Speech Synthesis .

Changing the time scale ta ts = D(ta ) = ! ! (" )d" 0 Presume ! (" ) = ! (reduce speed when ! > 1) Similar derivation to pitch epoch determination ! ts ( j +1) " ts ( j) = ts ( j+1)/! ! Pa (t)dt ts ( j )/! ts ( j +1) " ts ( j) If Pa (t) # Pa (constant in interval): ts ( j +1) " ts ( j) = ! Pa 45 Torbjørn Svendsen Speech Synthesis .

Changing the time scale 46 Torbjørn Svendsen Speech Synthesis .

All the modifications Changing both time and pitch : # t s ( j + 1) ! t s ( j ) = 47 Torbjørn Svendsen t s ( j +1) / # " $ (t ) P (t )dt ts ( j ) /# a t s ( j + 1) ! t s ( j ) Speech Synthesis .

Epoch positioning in database •  Database must be annotated with pitch pulse location •  Accurate positioning necessary for good performance •  Automatic methods using pitch estimation techniques give reasonably good results •  Use of laryngograph (electroglottograph – EGG) during recording is recommended –  Measures resistance over vocal cords – dependent on glottal opening –  Peak picking of derivative of EGG signal 48 Torbjørn Svendsen Speech Synthesis .

Epochs from EGG signal Speech signal EGG signal Detected pulse locations •  Peak picking on EGG or its time derivative •  Accurate epoch and F0 estimation •  Voiced/unvoiced determination 49 Torbjørn Svendsen Speech Synthesis .

PSOLA limitations •  Amplitude mismatch •  Voiced fricatives –  Increased buzzyness •  All modification will introduce distortion and unnaturalness –  Degree dependent on amount of modification •  Limits on maximum modification 50 Torbjørn Svendsen Speech Synthesis .

Phase mismatches •  Wrong positioning of pitch pulses in database 51 Torbjørn Svendsen •  Causes glitches in output Speech Synthesis .

Pitch mismatches •  Correct F0 and pulse position 52 Torbjørn Svendsen •  Different F0 in segments cause spectral and waveform discontinuities Speech Synthesis .

Hidden Markov models are used to model speech production –  Task of recognizer is to find the model that best explains the observed utterance •  If the HMM is used for generating observations. the produced feature vector sequence can be used to produce speech from a given unit sequence (phone sequence) –  –  –  –  •  53 The feature vectors must be suitable for speech production Combination of continuous and discrete elements Modifications to HMM theory are necessary to facilitate the generative mode Potential for efficient and flexible synthesis Basis for HMM-based synthesis .HMMs for synthesis •  In speech recognition.

LPC-type speech generation .HMM-based speech synthesis •  •  54 Training from database Produce excitation and filter parameters for e.g.

a hybrid solution •  Speech training database •  HMM based system for prediction and unit selection •  Experimental system •  Very good evaluation in international competition (Blizzard Challenge 2010) TRAINING DATA INPUT TEXT TTS frontend Analysis Target model construction HTS training State alignment Voice database Candidate list construction Selection & boundary decision Waveform concatenation 55 .Text-to-speech synthesis.

AT&T NextGen •  HMM synthesis •  Hybrid HMM/Unit selection •  Limited domain unit selection synthesis 56 Torbjørn Svendsen Speech Synthesis .A few examples •  Diphone synthesis: Festival. Arne. Infovox •  Unit selection synthesis: Festival.

Unit selection . “diphone synthesis”.Summary •  Data-driven vs synthesis by rule. •  Single unit realization and annotation of waveform library. requires units to be prosodically modified. Definition. Torbjørn Svendsen Speech Synthesis . •  Unit selection synthesis aims to use natural prosody and minimal prosodic modification. •  Current synthesis generation is concatenative – waveform synthesis. •  Issues in waveform synthesis: –  –  –  –  57 Unit definition. Prosodic modification.