You are on page 1of 48

CHAPTER 5

Speak
Version 2.0

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 1
Update(s) in Version 2.0
• All Page: Updated the wordings
• Page 12, 15, 26, 31, 32, 35, 36, 38, 47: Added new slide
Content
Awareness
1. Human speech production and the information encoded in speech signal
2. What text-to-speech (TTS) synthesis is and how it works
3. Various TTS application and the difficulties of TTS

2 Knowledge
1. The historical background of speech synthesis
2. Modern TTS approaches and the key processes involved
3. The approaches to measure TTS

Ethics
1. Fairness and security issues that arise from TTS

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 3
Awareness

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 4
Spoken Language and Intelligence

• Ability to process human language is considered a


sign of intelligence

• Technology that can generate output speech is


core to AI

• Let’s look at how humans produce speech!

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 5
Human Speech Production

Source: Loosejocks

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 6
What is Text-to-Speech (TTS) Synthesis ?

• Speech synthesis, more specifically


known as text-to-speech (TTS), is a
comprehensive technology that
involves many disciplines such as
acoustics, linguistics, digital signal
processing and statistics

• The main task of TTS is to convert written


input into speech output. This video
shows how TTS works

Source: The ScienceElf

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 7
Discussion

Can you suggest some applications of TTS?

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 8
TTS Applications and Challenges
Example 1 Example 2 Example 3 Example 4

Text-to-visual-
Helping people with Helping people with Automatic speech (TTVS) in
reading disabilities speaking disabilities Announcement pronunciation
System training

Source: Understood Source: Bloomberg Source: Ivona Source: CUHK

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 9
Speech Characteristics
Speech contains rich information What kind of information
such as: should we put in machine-
• Meaning generated speech?
• Intent
• Accent
• Age
• Attitude
• Education level
• Emotions
• Gender
• Health
• Language proficiency
• Personality
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 10
Task 1: Characteristics of Speech
Listen and describe the speech characteristics in the following table:

Speech Volume Speed Pitch/Tone

Angry

Elderly

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 11
Task 1: Characteristics of Speech
Answers:

Speech Volume Speed Pitch/Tone

Angry Stronger Faster Higher

Elderly Weaker Slower Lower

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 12
Summary of Task 1
The ideal TTS is one that can generate speech with
rich information for human communication

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 13
Task 2: Difficulties of TTS
Listen to the following recordings and determine:
Which of them is/are recorded by humans? Which is/are recorded by machines?

Test 1 A B

Test 2 A B

Test 3 A B

Test 4 A B

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 14
Task 2: Difficulties of TTS
Test 1︰“That girl did a video about Star Wars lipstick”
A︰Human B: Machine

Test 2︰“She earned a doctorate in sociology at Columbia University.”


A︰Machine B: Human

Test 3︰“George Washington was the first President of the United States.”
A︰Machine B: Human

Test 4 ︰“I'm too busy for romance.”


A︰Human B: Machine

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 15
Summary of Task 2

For neutral speech, TTS generates very human-like outputs

For expressive speech, we can easily identify the differences


between human- and machine-generated outputs

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 16
TTS definition
TTS accepts written input and generates
spoken output
Summary of
Awareness
TTS challenges
The long-term goal of TTS is to be able to
synthesise all speech characteristics for
effective communication with listeners

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 17
Knowledge

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 18
Linguistic Hierarchy

Sentence “I love Hong Kong”

I love Hong Kong


Speech synthesis Grammar pronoun verb noun
starts from textual
input and ends I
Words / Phrases love
with a synthesised Hong Kong
waveform
Phones aɪ lʌv hɔŋ kɔŋ

Acoustics

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 19
Historical Background
Von Kempelen’s speaking machine Von Kempelen’s speaking machine is quite
• An early example of TTS technology designed in similar to the human speech production system
1791
• It was created to mimic the windpipe and Can you see the
articulators of the human speech production similarities
system between both
voice production
processes?

From speaking machines to electrical systems, speech


synthesis and TTS have now developed to incorporate
statistical algorithms and deep learning
Watch:
https://www.youtube.com/watch?v=oIjkzZGe2I8

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 20
Current Development
Speech synthesis today uses statistical algorithms and deep learning
approaches to improve its results

Listen to the two recordings of synthesised speeches below, it is obvious that


the speech produced with modern TTS is much clearer and more natural

Previous TTS Modern TTS


(using Big Data, Cloud Computing
and Machine Learning)

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 21
Modern TTS Approaches: Training Process

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 22
Modern TTS Approaches: Testing Process

Testing process of an optimised TTS system

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 23
Text Analysis
Text analysis is a procedure that involves the translation of input text into
symbolic linguistic forms
Steps of text analysis include:

Text Normalisation

Text-to-Phoneme Mapping

Prosody Prediction

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 24
Task 1: Text Normalisation

Convert the following statements into words ONLY,


e.g., ~10% around ten percent

1. The time now is 1:25pm, and the score of the basketball match is 3:2
2. It was in 1998 that he sold 1998 boxes of chocolate and
earned $200 instead of $20 million
3. 我外號「高人」,有成150cm高。
4. 她弟弟出生於2014/3/31號。

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 25
Task 1: Text Normalisation
1. Convert the statements into words ONLY, e.g., ~10% around ten percent

• The time now is 1:25pm, and the score of the basketball match is 3:2.
Answer: The time now is one twenty-five pm, and the score of the basketball match is
three to two.
• It was in 1998 that he sold 1998 boxes of chocolate and earned $200 instead of $20
million.
Answer: It was in nineteen ninety-eight that he sold one hundred and ninety-eight boxes
of chocolate and earned two hundred dollars instead of twenty million dollars.
• 我外號「高人」,有成150cm高。
Answer: 我外號「高人」,有成一百五十公分高。
• 她弟弟出生於2014/3/31號。
Answer: 她弟弟出生於二零一四年三月三十一號。

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 26
Summary of Task 1

Text normalisation is the procedure that transforms


written text into "speak-able" words

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 27
Task 2: Text-to-Phoneme Mapping

Convert the following underlined texts into phonetic labels:

Sentence 1:
“There are many books to read but I have read them all.”

Sentence 2:
"當單志堅單獨行街時,看見匯豐銀行旁的唱片行正在進行聖誔大特賣"

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 28
Task 2: Text-to-Phoneme Mapping

1. Convert the following underlined texts into phonetic labels:

• Sentence 1: “There are many books to read but I have read them all.”
Answer:
Read: (verb) /rɛd/, (past tensed)/ri:d/

• Sentence 2: "當單志堅單獨行街時,看見匯豐銀行旁的唱片行正在進行聖誔大特賣"
Answer:
單︰(surname) /sin6/, (verb/adj) /daan1/
行: (walk) /hang4/, (bank) /hong4/, /hong6/, /haang4/

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 29
Summary of Task 2
• Linguistic analysis is needed for the TTS system to select the correct
pronunciations
• Text-to-phoneme mapping can be challenging due to:
Prosody: e.g. 上海市長江大橋(上海市 長江大橋) or(上海 市長 江大橋)
Alternate pronunciation: e.g. "read" (/riːd/ or /rɛd/), "單" (/daan1/ or /sin6/).
Proper nouns: 費先生 /bei3/ /sin1/ /saang1/,單志堅 /sin6/ /zi3/ /gin1/
Part of Speech, e.g. record: rec’ord (verb) versus r’ecord (noun)

• Text-to-phoneme mapping may use pronunciation dictionaries and


letter-to-sound rules to facilitate correct mapping Pronunciation dictionaries

Pronunciation dictionaries refer to the language model


dictionary for mapping text to phonemes
Letter-to-sound rules are used for words that are absent from
the dictionary, especially for proper names or foreign words,
e.g. 'Scooby Doo'
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 30
Prosody
• In linguistics, prosody is concerned with the elements of speech that are not individual
phonetic segments (vowels and consonants) but are properties of syllables and larger
units of speech
• These include paralinguistic functions such as intonation, stress and rhythm
• While words mean what is being said, prosody refers to how words are being said.
• Prosodic features include pitch, duration, energy (loudness) and more - they all help
convey emotions and emphasis, and express out words in different ways

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 31
Prosodic Control
Example of prosodic control

We can actually go up there

Is it a question?
Yes No

Intonation rises Intonation falls

We can actually go up there? We can actually go up there.

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 32
Waveform Generation

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 33
TTS Evaluation
Objective Measurement
• Objective measurement is the comparison of waveform similarities between
natural and synthesised speech
• The figure below compares the waveform between human and synthesised
speech by showing the details of natural speech that seem absent in
synthesised speech, e.g. signal distance
Human Speech

Synthesised Speech

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 34
TTS Evaluation
Subjective Measurement
• Subjective measurement is the comparison of opinion scores for various factors. To get the
opinion score of naturalness or quality, a group of people would be asked to rate the
quality of synthesised speech based on listening tests, e.g. mean opinion score MOS
• MOS is the most frequently used method to evaluate the quality of generated speech. On
average, MOS has a range of 0 to 5, whereas the range of MOS of real human speech is 4.5
to 4.8

[Li et al., 2018]

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 35
Task 3: TTS Evaluation by Subjective Measurement

Subjective Measurement

• Listen to the four synthesised speech audio Speech 1


files below and rate the speech based on its
“intelligibility”, “naturalness” and “quality” Speech 2
on a 5-point scale (1 = poor, 5 = excellent)
Speech 3
• Collect the ratings by your peers and get
the average scores of four audio files Speech 4

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 36
Task 3: TTS Evaluation by Subjective Measurement
Intelligibility Naturalness Quality

Rating(1 = Poor, 5 = Excellent)

Speech 1 Your rating

Rating of classmate 1

Rating of classmate 2

Rating of classmate 2

Average Rating

Speech 2 Your rating

Rating of classmate 1

Rating of classmate 2

Average Rating

Speech 3 Your rating

Rating of classmate 1

Rating of classmate 2

Average Rating

Speech 4 Your rating

Rating of classmate 1

Rating of classmate 2

Average Rating
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 37
Summary of Task 3
The figure below shows an example of how we collect the rating
scores from a group of listeners and get the average rating as the
subjective evaluation
Subjective Score
Example Evaluation:
Speech 1 3, 2, 2, 3, 3 | Avg = 2.6 Speech 1

Speech 2
Speech 2 4, 3, 3, 3, 3 | Avg = 3.2
Speech 3

Speech 3 1, 2, 2, 2, 3 | Avg = 1.8


Speech 4

Speech 4 4, 4, 3, 3, 4 | Avg = 3.6 0 0;5 1 1;5 2 2;5 3 3;5 4


Subjective Score

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 38
Extension of TTS
Voice Conversion Visual Speech Synthesis
Conversion of TTS synthesised • Also known as talking head or
speech to the speech of the avatar
target speaker • Focuses on the speech and lips
synchronisation of an avatar or
Original Voice Conversion Target talking head
Demo
1

2 Visit this
website for an
interactive
Demo: CUHK’s technology demonstration
https://ttsdem
o.com/

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 39
Speech Synthesis Technology
Current approaches to developing speech
synthesis technologies require big data for
training

Summary of Evaluation
Evaluation of TTS includes objective and
Knowledge subjective evaluation
Text Analysis
Text is analysed for normalisation,
phoneme mapping, and prosodic prediction
to produce a linguistic representation

Waveform Generation
Maps linguistic representation to
audio by training on labelled big data
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 40
Ethics

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 41
Task 1: Fairness Issues

When you explore various TTS demonstrations,


why is Cantonese often unavailable?

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 42
Task 1: Fairness Issues
1. When you explore various TTS demonstrations, why is Cantonese often
unavailable?

• TTS for Cantonese is often not available


• There is a lack of annotated data in Cantonese for training TTS

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 43
Summary of Task 1

Big data is necessary to train a TTS

Languages that do not have sufficient data will


not be well supported by the TTS

For instance, the current release of Microsoft


Neural Voices only supports four languages

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 44
Task 2: Security Issues

Watch the video and consider:

What security issues can appear if


this technology is used to generate
fake voices?

Source: Bloomberg Quicktake

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 45
Task 2: Security Issues
1. What security issues can appear if this technology is used to generate
fake voices?

• Fake news, phone scams, voice lock overrides, etc.

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 46
Summary of Task 2
“A person's voice is part of their identity”
Voice impersonation attacks:
By generating personalised fake voices, criminals can steal other
peoples’ identities to conduct phone scams or break through the
voice lock of smartphones, i.e. attacks on voice biometric systems

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 47
Fairness Issue
Low-resource languages of minority
Summary communities cannot be well-supported
of Ethics by TTS due to the lack of annotated
training data

Security Issue
Personalised fake voices are
dangerous and potentially harmful if
they are misused for identity theft

Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 48

You might also like