Professional Documents
Culture Documents
1. Course Description
Speech synthesis and recognition deals with the design and implementation of computer systems
that can “produce” and “perceive” speech signals. The goal of this course is to equip you with the
knowledge and skills necessary to understand, evaluate, and design and implement a wide range of
speech synthesis and recognition systems. These two fields are highly relevant in the linguistics-
related sector of the information and communications technology industry. Taking this course will
open up opportunities for you by preparing you to enter positions related to these topics or pursue
advanced degrees dealing with speech synthesis and/or recognition. For those of you who are more
interested in linguistics research, the course will have positive benefits for the study phonetics and
phonology, since many of the topics discussed reinforce and extend issues in these fields,
particularly concerning acoustic and perceptual phonetics. (This course does not assume any
background in computer programming and it will also not require you to learn how to program
beyond working with scripts in Praat; thus, it is a very gentle starting point for entering the fields of
speech synthesis and/or recognition. However, be advised that computer programming is an
essential skill to master for advanced studies in these fields.)
2. Learning Objectives
• Describe the different types of speech synthesis and recognition systems and identify their uses
in industry and in research
• Explain the inner workings of each type of system
• Evaluate the performance of these systems
• Engage with the literature about these systems
3. Course Content
The course begins with an overview of natural language processing (NLP) to situate speech synthesis
and recognition within industry and reviews familiar applications of these systems. Necessary topics
concerning phonetics and phonology are introduced/reviewed, with a particular focus on acoustic
phonetics. The topic of speech synthesis is then covered, with a focus on concatenative synthesis
and statistical parametric systems, but formant-based and articulatory synthesis will also be covered
in detail. In relation to these latter topics, you will learn about modeling the structures of the vocal
tract to produce articulatory-based systems and the types of applications these have in speech
research. Next the topic of speech recognition is covered. You will learn about the design
parameters of these systems and the tie-ins with other topics in NLP (such as language models and
syntactic and semantic parsers). The Hidden Markov Model will be discussed in detail and its
application to the task of recognizing speech in conjunction with acoustic analysis and modeling of
the speech signal.
4. Course Outline
5. Learning Outcome
• Discuss the state-of-the-art and the major challenges still faced to improve the naturalness of
speech synthesis and the accuracy of speech recognition
• Demonstrate a knowledge of how to go about developing such systems
• Be able to work with some basic tools which assist in developing speech synthesis and
recognition systems
6. Course Assessment components
Assessments %
Assessment #1: QUIZ 1 (CQ) 20
This is an in-class, closed-book quiz comprised of MCQs and short-answer
questions to be held in Week 13.
Expected preparation time: 10 hours
Textbooks
Dickinson, M., Brew, C., & Meurers, D. (2013). Language and Computers. Wiley-Blackwell.
Manning, C., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing.
Cambridge, MA: MIT Press.
Articles
Browman, C. P., & Goldstein, L. (1989). Articulatory gestures as phonological units. Phonology,
6(2), 201–251.
Charpentier, F., & Stella, M. (1986). Diphone synthesis using an overlap-add technique for
speech waveforms concatenation. In ICASSP ’86. IEEE International Conference on
Acoustics, Speech, and Signal Processing (Vol. 11, pp. 2015–2018).
Fant, G., Liljencrants, J., & Lin, Q. (1985). A four-parameter model of glottal flow. Quarterly
Progress and Status Report, Speech Transmission Laboratory, Royal Institute of
Technology, 26(4), 1–13.
Flanagan, J. L., Coker, C. H., Rabiner, L. R., Schafer, R. W., & Umeda, N. (1970). Synthetic voices
for computers. IEEE Spectrum, 7(10), 22–45.
Fosler-Lussier, E. (1998). Markov Models and Hidden Markov Models: a brief tutorial.
International Computer Science Institute, Berkeley, California.
Ishizaka, K., & Flanagan, J. L. (1972). Synthesis of voiced sounds from a two-mass model of the
vocal cords. The Bell System Technical Journal, 51(6), 1233–1268.
King, S. (2011). An introduction to statistical parametric speech synthesis. Sadhana, 36(5), 837–
852.
King, S. (2015). A reading list of recent advances in speech synthesis. Proceedings of the 18th
International Congress of Phonetic Sciences. Retrieved June 12, 2017 from
https://www.internationalphoneticassociation.org/icphs-
proceedings/ICPhS2015/Papers/ICPHS1043.pdf
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. The Journal of the
Acoustical Society of America, 67(3), 971–995.
Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations
among female and male talkers. The Journal of the Acoustical Society of America, 87(2),
820–857.
LaRoche, J., Stylianou, Y., Moulines, E. 1993. Hnm: a simple, efficient harmonic+noise model for
speech. In Applications of Signal Processing to Audio and Acoustics, 169–172.
Mermelstein, P. (1973). Articulatory model for the study of speech production. The Journal of
the Acoustical Society of America, 53(4), 1070–1082.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for
text-to-speech synthesis using diphones. Speech Communication, 9(5), 453–467.
Sproat, R. W., & Olive, J. P. (1995). Text-to-speech synthesis. AT&T Technical Journal, 74(2), 35–
44.
Story, B. H., & Titze, I. R. (1995). Voice simulation with a body-cover model of the vocal folds.
The Journal of the Acoustical Society of America, 97(2), 1249–1260.
Valbret, H., Moulines, E., & Tubach, J. P. (1992). Voice transformation using PSOLA technique.
Speech Communication, 11(2), 175–187.
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech
Communication, 51(11), 1039–1064.
Essays
Huckvale, M. (2003). Conversational machines in science fiction and science fact. University
College London. Retrieved June 12, 2017 from
http://markhuckvale.com/research/essay/sf.htm
Huckvale, M. (2003). Why are machines less proficient than humans at recognising words?
University College London. Retrieved June 12, 2017 from
http://markhuckvale.com/research/essay/proficient.php
Electronic Resources
iPA phonetics, free iOS app for iPhone: search “ipa phonetics” in the app store
Voice Quality: The Laryngeal Articulator Model, Resources (General Resources > Resources),
https://www.cambridge.org/ca/academic/subjects/languages-linguistics/phonetics-and-
phonology/voice-quality-laryngeal-articulator-model?format=HB