Chapter05 AKE Eng v2.0

CHAPTER 5
Speak
Version 2.0
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 1
Update(s) in Version 2.0
• All Page: Updated the wordings
• Page 12, 15, 26, 31, 32, 35, 36, 38, 47: Added new slide
Content
Awareness
1. Human speech production and the information encoded in speech signal
2. What text-to-speech (TTS) synthesis is and how it works
3. Various TTS application and the difficulties of TTS
2 Knowledge
1. The historical background of speech synthesis
2. Modern TTS approaches and the key processes involved
3. The approaches to measure TTS
Ethics
1. Fairness and security issues that arise from TTS
Awareness
Spoken Language and Intelligence
• Ability to process human language is considered a

sign of intelligence
• Technology that can generate output speech is

core to AI
• Let’s look at how humans produce speech!
Human Speech Production
Source: Loosejocks
What is Text-to-Speech (TTS) Synthesis ?
• Speech synthesis, more specifically

known as text-to-speech (TTS), is a
comprehensive technology that
involves many disciplines such as
acoustics, linguistics, digital signal
processing and statistics
• The main task of TTS is to convert written

input into speech output. This video
shows how TTS works
Source: The ScienceElf
Discussion
Can you suggest some applications of TTS?
TTS Applications and Challenges
Example 1 Example 2 Example 3 Example 4
Text-to-visual-
Helping people with Helping people with Automatic speech (TTVS) in
reading disabilities speaking disabilities Announcement pronunciation
System training
Source: Understood Source: Bloomberg Source: Ivona Source: CUHK
Speech Characteristics
Speech contains rich information What kind of information
such as: should we put in machine-
• Meaning generated speech?
• Intent
• Accent
• Age
• Attitude
• Education level
• Emotions
• Gender
• Health
• Language proficiency
• Personality
Task 1: Characteristics of Speech
Listen and describe the speech characteristics in the following table:
Speech Volume Speed Pitch/Tone
Angry
Elderly
Task 1: Characteristics of Speech
Answers:
Speech Volume Speed Pitch/Tone
Angry Stronger Faster Higher
Elderly Weaker Slower Lower
Summary of Task 1
The ideal TTS is one that can generate speech with
rich information for human communication
Task 2: Difficulties of TTS
Listen to the following recordings and determine:
Which of them is/are recorded by humans? Which is/are recorded by machines?
Test 1 A B
Test 2 A B
Test 3 A B
Test 4 A B
Task 2: Difficulties of TTS
Test 1︰“That girl did a video about Star Wars lipstick”
A︰Human B: Machine
Test 2︰“She earned a doctorate in sociology at Columbia University.”

A︰Machine B: Human
Test 3︰“George Washington was the first President of the United States.”
A︰Machine B: Human
Test 4 ︰“I'm too busy for romance.”

A︰Human B: Machine
Summary of Task 2
For neutral speech, TTS generates very human-like outputs
For expressive speech, we can easily identify the differences

between human- and machine-generated outputs
TTS definition
TTS accepts written input and generates
spoken output
Summary of
Awareness
TTS challenges
The long-term goal of TTS is to be able to
synthesise all speech characteristics for
effective communication with listeners
Knowledge
Linguistic Hierarchy
Sentence “I love Hong Kong”
I love Hong Kong

Speech synthesis Grammar pronoun verb noun
starts from textual
input and ends I
Words / Phrases love
with a synthesised Hong Kong
waveform
Phones aɪ lʌv hɔŋ kɔŋ
Acoustics
Historical Background
Von Kempelen’s speaking machine Von Kempelen’s speaking machine is quite
• An early example of TTS technology designed in similar to the human speech production system
1791
• It was created to mimic the windpipe and Can you see the
articulators of the human speech production similarities
system between both
voice production
processes?
From speaking machines to electrical systems, speech

synthesis and TTS have now developed to incorporate
statistical algorithms and deep learning
Watch:
https://www.youtube.com/watch?v=oIjkzZGe2I8
Current Development
Speech synthesis today uses statistical algorithms and deep learning
approaches to improve its results
Listen to the two recordings of synthesised speeches below, it is obvious that

the speech produced with modern TTS is much clearer and more natural
Previous TTS Modern TTS

(using Big Data, Cloud Computing
and Machine Learning)
Modern TTS Approaches: Training Process
Modern TTS Approaches: Testing Process
Testing process of an optimised TTS system
Text Analysis
Text analysis is a procedure that involves the translation of input text into
symbolic linguistic forms
Steps of text analysis include:
Text Normalisation
Text-to-Phoneme Mapping
Prosody Prediction
Task 1: Text Normalisation
Convert the following statements into words ONLY,

e.g., ~10% around ten percent
1. The time now is 1:25pm, and the score of the basketball match is 3:2
2. It was in 1998 that he sold 1998 boxes of chocolate and
earned $200 instead of $20 million
3. 我外號｢高人｣，有成150cm高。
4. 她弟弟出生於2014/3/31號。
Task 1: Text Normalisation
1. Convert the statements into words ONLY, e.g., ~10% around ten percent
• The time now is 1:25pm, and the score of the basketball match is 3:2.
Answer: The time now is one twenty-five pm, and the score of the basketball match is
three to two.
• It was in 1998 that he sold 1998 boxes of chocolate and earned $200 instead of $20
million.
Answer: It was in nineteen ninety-eight that he sold one hundred and ninety-eight boxes
of chocolate and earned two hundred dollars instead of twenty million dollars.
• 我外號｢高人｣，有成150cm高。
Answer: 我外號｢高人｣，有成一百五十公分高。
• 她弟弟出生於2014/3/31號。
Answer: 她弟弟出生於二零一四年三月三十一號。
Summary of Task 1
Text normalisation is the procedure that transforms

written text into "speak-able" words
Task 2: Text-to-Phoneme Mapping
Convert the following underlined texts into phonetic labels:
Sentence 1:
“There are many books to read but I have read them all.”
Sentence 2:
"當單志堅單獨行街時，看見匯豐銀行旁的唱片行正在進行聖誔大特賣"
Task 2: Text-to-Phoneme Mapping
1. Convert the following underlined texts into phonetic labels:
• Sentence 1: “There are many books to read but I have read them all.”
Answer:
Read: (verb) /rɛd/, (past tensed)/ri:d/
• Sentence 2: "當單志堅單獨行街時，看見匯豐銀行旁的唱片行正在進行聖誔大特賣"
Answer:
單︰(surname) /sin6/, (verb/adj) /daan1/
行: (walk) /hang4/, (bank) /hong4/, /hong6/, /haang4/
Summary of Task 2
• Linguistic analysis is needed for the TTS system to select the correct
pronunciations
• Text-to-phoneme mapping can be challenging due to:
Prosody: e.g. 上海市長江大橋（上海市長江大橋) or（上海市長江大橋）
Alternate pronunciation: e.g. "read" (/riːd/ or /rɛd/), "單" (/daan1/ or /sin6/).
Proper nouns: 費先生 /bei3/ /sin1/ /saang1/，單志堅 /sin6/ /zi3/ /gin1/
Part of Speech, e.g. record: rec’ord (verb) versus r’ecord (noun)
• Text-to-phoneme mapping may use pronunciation dictionaries and

letter-to-sound rules to facilitate correct mapping Pronunciation dictionaries
Pronunciation dictionaries refer to the language model

dictionary for mapping text to phonemes
Letter-to-sound rules are used for words that are absent from
the dictionary, especially for proper names or foreign words,
e.g. 'Scooby Doo'
Prosody
• In linguistics, prosody is concerned with the elements of speech that are not individual
phonetic segments (vowels and consonants) but are properties of syllables and larger
units of speech
• These include paralinguistic functions such as intonation, stress and rhythm
• While words mean what is being said, prosody refers to how words are being said.
• Prosodic features include pitch, duration, energy (loudness) and more - they all help
convey emotions and emphasis, and express out words in different ways
Prosodic Control
Example of prosodic control
We can actually go up there
Is it a question?
Yes No
Intonation rises Intonation falls
We can actually go up there? We can actually go up there.
Waveform Generation
TTS Evaluation
Objective Measurement
• Objective measurement is the comparison of waveform similarities between
natural and synthesised speech
• The figure below compares the waveform between human and synthesised
speech by showing the details of natural speech that seem absent in
synthesised speech, e.g. signal distance
Human Speech
Synthesised Speech
TTS Evaluation
Subjective Measurement
• Subjective measurement is the comparison of opinion scores for various factors. To get the
opinion score of naturalness or quality, a group of people would be asked to rate the
quality of synthesised speech based on listening tests, e.g. mean opinion score MOS
• MOS is the most frequently used method to evaluate the quality of generated speech. On
average, MOS has a range of 0 to 5, whereas the range of MOS of real human speech is 4.5
to 4.8
[Li et al., 2018]
Task 3: TTS Evaluation by Subjective Measurement
Subjective Measurement
• Listen to the four synthesised speech audio Speech 1

files below and rate the speech based on its
“intelligibility”, “naturalness” and “quality” Speech 2
on a 5-point scale (1 = poor, 5 = excellent)
Speech 3
• Collect the ratings by your peers and get
the average scores of four audio files Speech 4
Task 3: TTS Evaluation by Subjective Measurement
Intelligibility Naturalness Quality
Rating(1 = Poor, 5 = Excellent)
Speech 1 Your rating
Rating of classmate 1
Average Rating
Average Rating
Average Rating
Average Rating
Summary of Task 3
The figure below shows an example of how we collect the rating
scores from a group of listeners and get the average rating as the
subjective evaluation
Subjective Score
Example Evaluation:
Speech 1 3, 2, 2, 3, 3 | Avg = 2.6 Speech 1
Speech 2
Speech 2 4, 3, 3, 3, 3 | Avg = 3.2
Speech 3
Speech 3 1, 2, 2, 2, 3 | Avg = 1.8

Speech 4
Speech 4 4, 4, 3, 3, 4 | Avg = 3.6 0 0;5 1 1;5 2 2;5 3 3;5 4

Subjective Score
Extension of TTS
Voice Conversion Visual Speech Synthesis
Conversion of TTS synthesised • Also known as talking head or
speech to the speech of the avatar
target speaker • Focuses on the speech and lips
synchronisation of an avatar or
Original Voice Conversion Target talking head
Demo
1
2 Visit this
website for an
interactive
Demo: CUHK’s technology demonstration
https://ttsdem
o.com/
Speech Synthesis Technology
Current approaches to developing speech
synthesis technologies require big data for
training
Summary of Evaluation
Evaluation of TTS includes objective and
Knowledge subjective evaluation
Text Analysis
Text is analysed for normalisation,
phoneme mapping, and prosodic prediction
to produce a linguistic representation
Waveform Generation
Maps linguistic representation to
audio by training on labelled big data
Ethics
Task 1: Fairness Issues
When you explore various TTS demonstrations,

why is Cantonese often unavailable?
Task 1: Fairness Issues
1. When you explore various TTS demonstrations, why is Cantonese often
unavailable?
• TTS for Cantonese is often not available

• There is a lack of annotated data in Cantonese for training TTS
Summary of Task 1
Big data is necessary to train a TTS
Languages that do not have sufficient data will

not be well supported by the TTS
For instance, the current release of Microsoft

Neural Voices only supports four languages
Task 2: Security Issues
Watch the video and consider:
What security issues can appear if

this technology is used to generate
fake voices?
Source: Bloomberg Quicktake
Task 2: Security Issues
1. What security issues can appear if this technology is used to generate
fake voices?
• Fake news, phone scams, voice lock overrides, etc.
Summary of Task 2
“A person's voice is part of their identity”
Voice impersonation attacks:
By generating personalised fake voices, criminals can steal other
peoples’ identities to conduct phone scams or break through the
voice lock of smartphones, i.e. attacks on voice biometric systems
Fairness Issue
Low-resource languages of minority
Summary communities cannot be well-supported
of Ethics by TTS due to the lack of annotated
training data
Security Issue
Personalised fake voices are
dangerous and potentially harmful if
they are misused for identity theft

Chapter05 AKE Eng v2.0

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter05 AKE Eng v2.0

Uploaded by

Copyright:

Available Formats

CHAPTER 5

• Ability to process human language is considered a

• Technology that can generate output speech is

• Let’s look at how humans produce speech!

• Speech synthesis, more specifically

• The main task of TTS is to convert written

Source: The ScienceElf

Can you suggest some applications of TTS?

Source: Understood Source: Bloomberg Source: Ivona Source: CUHK

Speech Volume Speed Pitch/Tone

Speech Volume Speed Pitch/Tone

Angry Stronger Faster Higher

Elderly Weaker Slower Lower

Test 2︰“She earned a doctorate in sociology at Columbia University.”

Test 4 ︰“I'm too busy for romance.”

For neutral speech, TTS generates very human-like outputs

For expressive speech, we can easily identify the differences

Sentence “I love Hong Kong”

I love Hong Kong

From speaking machines to electrical systems, speech

Listen to the two recordings of synthesised speeches below, it is obvious that

Previous TTS Modern TTS

Testing process of an optimised TTS system

Convert the following statements into words ONLY,

Text normalisation is the procedure that transforms

Convert the following underlined texts into phonetic labels:

1. Convert the following underlined texts into phonetic labels:

• Text-to-phoneme mapping may use pronunciation dictionaries and

Pronunciation dictionaries refer to the language model

We can actually go up there

Intonation rises Intonation falls

We can actually go up there? We can actually go up there.

[Li et al., 2018]

• Listen to the four synthesised speech audio Speech 1

Rating(1 = Poor, 5 = Excellent)

Speech 1 Your rating

Speech 2 Your rating

Speech 3 Your rating

Speech 4 Your rating

Speech 3 1, 2, 2, 2, 3 | Avg = 1.8

Speech 4 4, 4, 3, 3, 4 | Avg = 3.6 0 0;5 1 1;5 2 2;5 3 3;5 4

When you explore various TTS demonstrations,

• TTS for Cantonese is often not available

Big data is necessary to train a TTS

Languages that do not have sufficient data will

For instance, the current release of Microsoft

Watch the video and consider:

What security issues can appear if

Source: Bloomberg Quicktake

• Fake news, phone scams, voice lock overrides, etc.

You might also like