Professional Documents
Culture Documents
Chapter05 AKE Eng v2.0
Chapter05 AKE Eng v2.0
Speak
Version 2.0
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 1
Update(s) in Version 2.0
• All Page: Updated the wordings
• Page 12, 15, 26, 31, 32, 35, 36, 38, 47: Added new slide
Content
Awareness
1. Human speech production and the information encoded in speech signal
2. What text-to-speech (TTS) synthesis is and how it works
3. Various TTS application and the difficulties of TTS
2 Knowledge
1. The historical background of speech synthesis
2. Modern TTS approaches and the key processes involved
3. The approaches to measure TTS
Ethics
1. Fairness and security issues that arise from TTS
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 3
Awareness
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 4
Spoken Language and Intelligence
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 5
Human Speech Production
Source: Loosejocks
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 6
What is Text-to-Speech (TTS) Synthesis ?
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 7
Discussion
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 8
TTS Applications and Challenges
Example 1 Example 2 Example 3 Example 4
Text-to-visual-
Helping people with Helping people with Automatic speech (TTVS) in
reading disabilities speaking disabilities Announcement pronunciation
System training
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 9
Speech Characteristics
Speech contains rich information What kind of information
such as: should we put in machine-
• Meaning generated speech?
• Intent
• Accent
• Age
• Attitude
• Education level
• Emotions
• Gender
• Health
• Language proficiency
• Personality
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 10
Task 1: Characteristics of Speech
Listen and describe the speech characteristics in the following table:
Angry
Elderly
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 11
Task 1: Characteristics of Speech
Answers:
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 12
Summary of Task 1
The ideal TTS is one that can generate speech with
rich information for human communication
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 13
Task 2: Difficulties of TTS
Listen to the following recordings and determine:
Which of them is/are recorded by humans? Which is/are recorded by machines?
Test 1 A B
Test 2 A B
Test 3 A B
Test 4 A B
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 14
Task 2: Difficulties of TTS
Test 1︰“That girl did a video about Star Wars lipstick”
A︰Human B: Machine
Test 3︰“George Washington was the first President of the United States.”
A︰Machine B: Human
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 15
Summary of Task 2
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 16
TTS definition
TTS accepts written input and generates
spoken output
Summary of
Awareness
TTS challenges
The long-term goal of TTS is to be able to
synthesise all speech characteristics for
effective communication with listeners
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 17
Knowledge
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 18
Linguistic Hierarchy
Acoustics
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 19
Historical Background
Von Kempelen’s speaking machine Von Kempelen’s speaking machine is quite
• An early example of TTS technology designed in similar to the human speech production system
1791
• It was created to mimic the windpipe and Can you see the
articulators of the human speech production similarities
system between both
voice production
processes?
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 20
Current Development
Speech synthesis today uses statistical algorithms and deep learning
approaches to improve its results
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 21
Modern TTS Approaches: Training Process
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 22
Modern TTS Approaches: Testing Process
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 23
Text Analysis
Text analysis is a procedure that involves the translation of input text into
symbolic linguistic forms
Steps of text analysis include:
Text Normalisation
Text-to-Phoneme Mapping
Prosody Prediction
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 24
Task 1: Text Normalisation
1. The time now is 1:25pm, and the score of the basketball match is 3:2
2. It was in 1998 that he sold 1998 boxes of chocolate and
earned $200 instead of $20 million
3. 我外號「高人」,有成150cm高。
4. 她弟弟出生於2014/3/31號。
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 25
Task 1: Text Normalisation
1. Convert the statements into words ONLY, e.g., ~10% around ten percent
• The time now is 1:25pm, and the score of the basketball match is 3:2.
Answer: The time now is one twenty-five pm, and the score of the basketball match is
three to two.
• It was in 1998 that he sold 1998 boxes of chocolate and earned $200 instead of $20
million.
Answer: It was in nineteen ninety-eight that he sold one hundred and ninety-eight boxes
of chocolate and earned two hundred dollars instead of twenty million dollars.
• 我外號「高人」,有成150cm高。
Answer: 我外號「高人」,有成一百五十公分高。
• 她弟弟出生於2014/3/31號。
Answer: 她弟弟出生於二零一四年三月三十一號。
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 26
Summary of Task 1
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 27
Task 2: Text-to-Phoneme Mapping
Sentence 1:
“There are many books to read but I have read them all.”
Sentence 2:
"當單志堅單獨行街時,看見匯豐銀行旁的唱片行正在進行聖誔大特賣"
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 28
Task 2: Text-to-Phoneme Mapping
• Sentence 1: “There are many books to read but I have read them all.”
Answer:
Read: (verb) /rɛd/, (past tensed)/ri:d/
• Sentence 2: "當單志堅單獨行街時,看見匯豐銀行旁的唱片行正在進行聖誔大特賣"
Answer:
單︰(surname) /sin6/, (verb/adj) /daan1/
行: (walk) /hang4/, (bank) /hong4/, /hong6/, /haang4/
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 29
Summary of Task 2
• Linguistic analysis is needed for the TTS system to select the correct
pronunciations
• Text-to-phoneme mapping can be challenging due to:
Prosody: e.g. 上海市長江大橋(上海市 長江大橋) or(上海 市長 江大橋)
Alternate pronunciation: e.g. "read" (/riːd/ or /rɛd/), "單" (/daan1/ or /sin6/).
Proper nouns: 費先生 /bei3/ /sin1/ /saang1/,單志堅 /sin6/ /zi3/ /gin1/
Part of Speech, e.g. record: rec’ord (verb) versus r’ecord (noun)
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 31
Prosodic Control
Example of prosodic control
Is it a question?
Yes No
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 32
Waveform Generation
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 33
TTS Evaluation
Objective Measurement
• Objective measurement is the comparison of waveform similarities between
natural and synthesised speech
• The figure below compares the waveform between human and synthesised
speech by showing the details of natural speech that seem absent in
synthesised speech, e.g. signal distance
Human Speech
Synthesised Speech
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 34
TTS Evaluation
Subjective Measurement
• Subjective measurement is the comparison of opinion scores for various factors. To get the
opinion score of naturalness or quality, a group of people would be asked to rate the
quality of synthesised speech based on listening tests, e.g. mean opinion score MOS
• MOS is the most frequently used method to evaluate the quality of generated speech. On
average, MOS has a range of 0 to 5, whereas the range of MOS of real human speech is 4.5
to 4.8
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 35
Task 3: TTS Evaluation by Subjective Measurement
Subjective Measurement
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 36
Task 3: TTS Evaluation by Subjective Measurement
Intelligibility Naturalness Quality
Rating of classmate 1
Rating of classmate 2
Rating of classmate 2
Average Rating
Rating of classmate 1
Rating of classmate 2
Average Rating
Rating of classmate 1
Rating of classmate 2
Average Rating
Rating of classmate 1
Rating of classmate 2
Average Rating
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 37
Summary of Task 3
The figure below shows an example of how we collect the rating
scores from a group of listeners and get the average rating as the
subjective evaluation
Subjective Score
Example Evaluation:
Speech 1 3, 2, 2, 3, 3 | Avg = 2.6 Speech 1
Speech 2
Speech 2 4, 3, 3, 3, 3 | Avg = 3.2
Speech 3
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 38
Extension of TTS
Voice Conversion Visual Speech Synthesis
Conversion of TTS synthesised • Also known as talking head or
speech to the speech of the avatar
target speaker • Focuses on the speech and lips
synchronisation of an avatar or
Original Voice Conversion Target talking head
Demo
1
2 Visit this
website for an
interactive
Demo: CUHK’s technology demonstration
https://ttsdem
o.com/
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 39
Speech Synthesis Technology
Current approaches to developing speech
synthesis technologies require big data for
training
Summary of Evaluation
Evaluation of TTS includes objective and
Knowledge subjective evaluation
Text Analysis
Text is analysed for normalisation,
phoneme mapping, and prosodic prediction
to produce a linguistic representation
Waveform Generation
Maps linguistic representation to
audio by training on labelled big data
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 40
Ethics
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 41
Task 1: Fairness Issues
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 42
Task 1: Fairness Issues
1. When you explore various TTS demonstrations, why is Cantonese often
unavailable?
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 43
Summary of Task 1
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 44
Task 2: Security Issues
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 45
Task 2: Security Issues
1. What security issues can appear if this technology is used to generate
fake voices?
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 46
Summary of Task 2
“A person's voice is part of their identity”
Voice impersonation attacks:
By generating personalised fake voices, criminals can steal other
peoples’ identities to conduct phone scams or break through the
voice lock of smartphones, i.e. attacks on voice biometric systems
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 47
Fairness Issue
Low-resource languages of minority
Summary communities cannot be well-supported
of Ethics by TTS due to the lack of annotated
training data
Security Issue
Personalised fake voices are
dangerous and potentially harmful if
they are misused for identity theft
Copyright © CUHK Jockey Club AI for the Future Project. All rights reserved. 48