You are on page 1of 6

HG3052: Speech Synthesis and Recognition

Course Outline for Semester 1, AY2019/20


Division of Linguistics and Multilingual Studies

Course code/title: HG3052 Speech Synthesis and Recognition


Location Tutorial Room + 109 (Block SS4 SS4-01-37)
Time 10:30 – 13:30, Fridays

Pre-requisite: HG2003 Phonetics & Phonology

Course Coordinator: Name Dr Scott Reid Moisik


E-mail scott.moisik@ntu.edu.sg
Office HSS-03-51
Office hours 15:30 – 17:30, Thursdays
Office phone (+65) 6316 8791

1. Course Description
Speech synthesis and recognition deals with the design and implementation of computer systems
that can “produce” and “perceive” speech signals. The goal of this course is to equip you with the
knowledge and skills necessary to understand, evaluate, and design and implement a wide range of
speech synthesis and recognition systems. These two fields are highly relevant in the linguistics-
related sector of the information and communications technology industry. Taking this course will
open up opportunities for you by preparing you to enter positions related to these topics or pursue
advanced degrees dealing with speech synthesis and/or recognition. For those of you who are more
interested in linguistics research, the course will have positive benefits for the study phonetics and
phonology, since many of the topics discussed reinforce and extend issues in these fields,
particularly concerning acoustic and perceptual phonetics. (This course does not assume any
background in computer programming and it will also not require you to learn how to program
beyond working with scripts in Praat; thus, it is a very gentle starting point for entering the fields of
speech synthesis and/or recognition. However, be advised that computer programming is an
essential skill to master for advanced studies in these fields.)

2. Learning Objectives
• Describe the different types of speech synthesis and recognition systems and identify their uses
in industry and in research
• Explain the inner workings of each type of system
• Evaluate the performance of these systems
• Engage with the literature about these systems
3. Course Content
The course begins with an overview of natural language processing (NLP) to situate speech synthesis
and recognition within industry and reviews familiar applications of these systems. Necessary topics
concerning phonetics and phonology are introduced/reviewed, with a particular focus on acoustic
phonetics. The topic of speech synthesis is then covered, with a focus on concatenative synthesis
and statistical parametric systems, but formant-based and articulatory synthesis will also be covered
in detail. In relation to these latter topics, you will learn about modeling the structures of the vocal
tract to produce articulatory-based systems and the types of applications these have in speech
research. Next the topic of speech recognition is covered. You will learn about the design
parameters of these systems and the tie-ins with other topics in NLP (such as language models and
syntactic and semantic parsers). The Hidden Markov Model will be discussed in detail and its
application to the task of recognizing speech in conjunction with acoustic analysis and modeling of
the speech signal.

4. Course Outline

Week Date Topics Suggested Readings


1 16 Aug Introduction to NLP Huckvale 2003 (x2)
2 23 Aug Speech articulation Browman & Goldstein 1989
3 30 Aug Speech prosody Ladefoged & Johnson 2014; MIT OCW (ToBI)
4 6 Sep Acoustic Phonetics Johnson 2011
5 13 Sep Bottom-up Synthesis Mermelstein 1973
6 20 Sep Quiz 1
7 27 Sep TTS & US Synthesis Sproat & Olive 1995
4 Oct Recess
8 11 Oct HMMs Fosler-Lussier 1998
9 18 Oct Statistical-Parametric Synthesis Zen, Tokuda, & Black 2009
10 25 Oct Automatic Speech Recognition (ASR) Jurafsky & Martin 2009, Ch8 (pp. 285-295)
11 1 Nov ASR: Acoustic Analysis Jurafsky & Martin 2009, Ch8 (pp. 295-314)
12 8 Nov ASR: Language Model and Training Jurafsky & Martin 2009, Ch8 (pp. 314-334)
13 15 Nov ASSIGN 1, ASSIGN 2, Quiz 2

5. Learning Outcome
• Discuss the state-of-the-art and the major challenges still faced to improve the naturalness of
speech synthesis and the accuracy of speech recognition
• Demonstrate a knowledge of how to go about developing such systems
• Be able to work with some basic tools which assist in developing speech synthesis and
recognition systems
6. Course Assessment components

Assessments %
Assessment #1: QUIZ 1 (CQ) 20
This is an in-class, closed-book quiz comprised of MCQs and short-answer
questions to be held in Week 13.
Expected preparation time: 10 hours

Assessment #2: QUIZ 2 (CQ) 20


This is an in-class, closed-book quiz comprised of MCQs and short-answer
questions to be held in Week 13.
Expected preparation time: 10 hours

Assessment #3: INDIVIDUAL ASSIGNMENT 1 (WA) 25


You will explore some basic speech synthesis systems and approaches. You
must attempt to create simple utterances using concatenative (unit-selection),
formant, and articulatory synthesis with the help of some computational tools
(such as Praat and VocalTractLab). A written component will give you a chance
to make critical observations about the difficulties you faced.
Expected preparation time: 17 hours
Due Week 8
Assessment #4: INDIVIDUAL ASSIGNMENT 2 (WA) 25
You will gain hands on experience in designing (on paper) a basic speech
recognition system. You will create a proposal or “pitch” for a system, specify its
purpose and limitations, and create a document that outlines the design of your
recognition system. Through the document, you must demonstrate a solid
understanding of the technical details of how speech recognition systems work.
You will need to draw on the literature to attempt to make your “pitch” more
attractive by offering state-of-the-art features but explained in simple language.
Expected preparation time: 17 hours
Due Week 13
Assessment #5: CLASS PARTICIPATION (CP) 10
You will be expected to participate in class activities. It is strongly
recommended that you bring a laptop to class in order to gain practical
experience working with various tools related to speech synthesis and
recognition (primarily in Praat).

7. Suggested Reading Materials

Textbooks

Allen, J. (1995). Natural Language Understanding (2nd ed.). Benjamin/Cummings.

Dickinson, M., Brew, C., & Meurers, D. (2013). Language and Computers. Wiley-Blackwell.

Johnson, K. (2011). Acoustic and Auditory Phonetics (3rd Edition). Wiley.


Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing (2nd ed.). Prentice Hall.

Ladefoged, P. and Johnson, K. 2014. A Course in Phonetics, 7th edition. Wadsworth.

Manning, C., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing.
Cambridge, MA: MIT Press.

Articles

Birkholz, P. (2011). A survey of self-oscillating lumped-element models of the vocal folds. In B. J.


Kröger & P. Birkholz (Eds.), Studientexte zur Sprachkommunikation: Elektronische
Sprachsignalverarbeitung 2011 (pp. 47–58). Dresden: TUDPress.

Browman, C. P., & Goldstein, L. (1989). Articulatory gestures as phonological units. Phonology,
6(2), 201–251.

Charpentier, F., & Stella, M. (1986). Diphone synthesis using an overlap-add technique for
speech waveforms concatenation. In ICASSP ’86. IEEE International Conference on
Acoustics, Speech, and Signal Processing (Vol. 11, pp. 2015–2018).

Fant, G., Liljencrants, J., & Lin, Q. (1985). A four-parameter model of glottal flow. Quarterly
Progress and Status Report, Speech Transmission Laboratory, Royal Institute of
Technology, 26(4), 1–13.

Flanagan, J. L., Coker, C. H., Rabiner, L. R., Schafer, R. W., & Umeda, N. (1970). Synthetic voices
for computers. IEEE Spectrum, 7(10), 22–45.

Fosler-Lussier, E. (1998). Markov Models and Hidden Markov Models: a brief tutorial.
International Computer Science Institute, Berkeley, California.

Ishizaka, K., & Flanagan, J. L. (1972). Synthesis of voiced sounds from a two-mass model of the
vocal cords. The Bell System Technical Journal, 51(6), 1233–1268.

Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech


representations using a pitch-adaptive time–frequency smoothing and an
instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in
sounds. Speech Communication, 27(3), 187–207.

King, S. (2011). An introduction to statistical parametric speech synthesis. Sadhana, 36(5), 837–
852.

King, S. (2015). A reading list of recent advances in speech synthesis. Proceedings of the 18th
International Congress of Phonetic Sciences. Retrieved June 12, 2017 from
https://www.internationalphoneticassociation.org/icphs-
proceedings/ICPhS2015/Papers/ICPHS1043.pdf

Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. The Journal of the
Acoustical Society of America, 67(3), 971–995.

Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations
among female and male talkers. The Journal of the Acoustical Society of America, 87(2),
820–857.

LaRoche, J., Stylianou, Y., Moulines, E. 1993. Hnm: a simple, efficient harmonic+noise model for
speech. In Applications of Signal Processing to Audio and Acoustics, 169–172.

Mermelstein, P. (1973). Articulatory model for the study of speech production. The Journal of
the Acoustical Society of America, 53(4), 1070–1082.

Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for
text-to-speech synthesis using diphones. Speech Communication, 9(5), 453–467.

Rabiner, L. R. (2009). A tutorial on Hidden Markov Models. UCSC.

Sproat, R. W., & Olive, J. P. (1995). Text-to-speech synthesis. AT&T Technical Journal, 74(2), 35–
44.

Story, B. H. (1995). Physiologically-based speech simulation using an enhanced wave-reflection


model of the vocal tract (Doctoral dissertation). The University of Iowa, Iowa.

Story, B. H., & Titze, I. R. (1995). Voice simulation with a body-cover model of the vocal folds.
The Journal of the Acoustical Society of America, 97(2), 1249–1260.

Valbret, H., Moulines, E., & Tubach, J. P. (1992). Voice transformation using PSOLA technique.
Speech Communication, 11(2), 175–187.

van Santen, J. P. H. (1994). Assignment of segmental duration in text-to-speech synthesis.


Computer Speech & Language, 8(2), 95–128.

Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech
Communication, 51(11), 1039–1064.
Essays

Huckvale, M. (2003). Conversational machines in science fiction and science fact. University
College London. Retrieved June 12, 2017 from
http://markhuckvale.com/research/essay/sf.htm

Huckvale, M. (2003). Why are machines less proficient than humans at recognising words?
University College London. Retrieved June 12, 2017 from
http://markhuckvale.com/research/essay/proficient.php

Electronic Resources

iPA phonetics, free iOS app for iPhone: search “ipa phonetics” in the app store

Voice Quality: The Laryngeal Articulator Model, Resources (General Resources > Resources),
https://www.cambridge.org/ca/academic/subjects/languages-linguistics/phonetics-and-
phonology/voice-quality-laryngeal-articulator-model?format=HB

MIT Open Course Ware on ToBI Annotation, https://ocw.mit.edu/courses/electrical-


engineering-and-computer-science/6-911-transcribing-prosodic-structure-of-spoken-
utterances-with-tobi-january-iap-2006/index.htm

Praat: Doing phonetics by computer, http://www.fon.hum.uva.nl/praat/

You might also like