Professional Documents
Culture Documents
Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud. It’s
sometimes called “read aloud” technology.
With a click of a button or the touch of a finger, TTS can take words on a computer or other
digital device and convert them into audio.
The goal of TTS is the automatic conversion of written text into corresponding speech. The
speech synthesis field has witnessed much advancement in the past few decades.
TTS allows users to see text and hear it read aloud simultaneously. There are many apps
available, but typically as text appears on the screen, it’s spoken. Some software uses a
computer-generated voice and others use a recorded human voice. Very often the user has a
choice of gender and accent as well.
Tablets and smartphones usually have built-in text-to-speech features. The software reads text
files, and the names of programs or folders when pointed at on the screen and can read certain
web pages aloud.
People with learning disabilities who have difficulty reading large amounts of text due to
dyslexia or other problems really benefit from TTS, offering them an easier option for
experiencing website content.
People who have literacy issues and those trying to learn another language often get frustrated
trying to browse the internet because so much text is confusing. Many people have difficulty
reading fluently in a second language even though they may be able to read content with a
basic understanding. TTS technology allows them to understand information in a way that
makes content easier to retain.
TTS synthesizer
A general TTS synthesizer comprises a text analysis module and digital signal processing (DSP)
module.
The text analysis module produces a phonetic transcription of the text read, together with the
desired intonation and rhythm (i.e., prosody). The DSP module produces synthetic speech
The text analysis stage is extremely difficult, as this stage needs to produce all the information
required by the DSP module (for producing speech) from mere text only. However, a mere text
does not contain all of the information needed to produce speech. The first block of the text
analysis module is Text-to- Phonetics (T2P) block which converts the input text-to-phonetic
transcription, and the second block is Text-to-Prosody which produces prosodic information.
Text-to-Phonetics: - The Text-to-Phonetics block can be further broken down into text
normalization module and a word pronunciation module. Following is the brief description of
Text Normalization: A text normalization module organizes the input text into manageable lists
of words. It identifies numbers, abbreviations, acronyms and idiomatic and expands them into
Word Pronunciation: Once the sequence of words has been generated using the text
case, a morphosyntactic analyzer may be used. A morpho-syntactic analyzer tags the speech
with various identities, such as prefixes, roots and suffixes, and organizes the sentences into
syntactically related groups of words, such as nouns, verbs, and adjectives. The pronunciation
Text-to-Prosody
The term prosody refers to certain properties of the speech signal such as audible changes in
the pitch (i.e., intonation), loudness, tempo, duration, stress and rhythm. The naturalness of
speech can be described mainly in terms of prosody. Prosodic events are also referred to as
The pattern of prosody is used to communicate the meaning of sentences. The Text-to- Prosody
block produces prosody information using the text and output of the word pronunciation
module. This block can be further broken down into smaller processes that determine
accenting, phrasing, duration, and intonation for each sentence. Following is the brief
description of these four processes:
Accenting: Accent or stress assignment is based on the category of the word. For example,
content words (such as nouns, adjectives and verbs) tend to be accented and function words
(such as prepositions and auxiliary verbs) are usually not accepted. This information is used for
Phrasing: Sentences are broken down into phrasal units and phrase boundaries are assigned to
the text. These boundaries indicate pauses and the resetting of intonation contours.
Intonation: Intonation clarifies the type and meaning of the sentence (neutral, imperative or
question). In addition, intonation also conveys information about the speaker's characteristics
(such as gender and age) and emotions. The intonation module generates a pitch contour for
the sentences. For example, the sentences “Open the door” and “Open the door?” have very
different prosody. In terms of intonation contour (defined as rise and fall of the pitch
throughout the utterance), the first sentence is declarative and has a relatively flat pitch
contour, whereas the second is questioning and exhibits a rise in pitch at the end of the phrase.
Duration: Segmental duration is an essential aspect of prosody. It affects the overall rhythm of
the speech, stress and emphasis, the syntactic structure of the sentence, and the speaking rate.
There are many factors which contribute to the duration of a speech segment. Some of them
are the identity of the phone itself, the identity and characteristics of neighboring phones, the
accent status of the syllable containing the phone, its phrase position and the speaking rate and
dialect of the speaker.
The DSP module uses the phonetic transcription and prosodic information produced by text
analysis module to produce speech. This can be done in two ways, viz.,
By using the series of rules which formally describe the influence of one phoneme on the other
By storing numerous instances of each speech sound unit and using them as they are, as
Based on the above two ways, two main classes of TTS systems have emerged, namely,
Previously speech synthesis techniques were classified into articulatory, formant and
concatenative speech synthesis. The concatenative speech synthesis method was the most
popular method. With the advancement of statistical modelling of the speech production
mechanism, the statistical approach of using hidden Markov model (HMM) based and further
using the deep neural network (DNN) in speech synthesis. Next, we give a brief description of
The articulatory synthesis attempts to ideally model the complete human vocal organs (i.e., the
human articulators and vocal folds) that produce speech as perfectly as possible. The
articulatory control parameters include lip aperture, lip protrusion, tongue tip position, tongue
tip height, tongue position and tongue height. It has the advantage of accurate modelling of
transients due to abrupt area changes due to continuous vocal track movement. Therefore,
ideally, it should be the most adequate method to produce high-quality synthetic speech. On
the other hand, it is also one of the most difficult methods to implement.
The first articulatory model was based on a table of vocal tract area functions from the larynx to
lips for each phonetic segment (Klatt 1987). The articulators are usually modeled with a set of
area functions between the glottis and mouth. For rule-based synthesis, the articulatory control
parameters may be for example lip aperture, lip protrusion, tongue tip height, tongue tip
position, tongue height, tongue position and velic aperture. Phonatory or excitation parameters
During the process of speech generation, the vocal tract muscles cause the articulators to move
and hence, the shape of the vocal tract changes, which results in the production of different
sounds. During speaking, the moment of the vocal tract is obtained from the data for
articulatory model is generally derived from X-ray analysis of natural speech. However, this data
is usually only 2- D when the real vocal tract is naturally 3-D, so the rule-based articulatory
synthesis is very difficult to optimize due to the unavailability of sufficient data of the motions
of the articulators during the production of speech. In addition, another disadvantage with
articulatory synthesis is that X-ray data do not describe information about the masses or
degrees of freedom of the articulators. In addition, the movements of the tongue are so
complicated that it is almost impossible to model them precisely. Articulatory synthesis is quite
rarely used in present systems; however, since the analysis methods are developing fast and
the computational resources are increasing rapidly, it might be a potential synthesis method in
the future.
Sinewave Synthesis
Sinewave synthesis is a method for synthesizing speech by replacing the formants (the
resonances in the vocal tract model) with pure tone whistles. It is based on the assumption that
the speech signal can be represented in terms of sine waves having varying frequencies and
also varying amplitudes. Therefore, speech signals (n) can be modelled as the sum of N no. of
sinusoids,
where Ai, ωi and φi represent the amplitude, frequency and phase of the sinusoid component.
These parameters are estimated from the discrete Fourier transform (DFT) of windowed signal
frames. The peaks of the spectral magnitude are selected from each frame. The basic model is
also known as the McAulay/Quartieri sinusoidal model. The basic model has also some
modifications such as ABS/OLA (Analysis by Synthesis / Overlap Add) and Hybrid / Sinusoidal
Noise models. However, such a model works well with representing periodic signals, such as
vowels and voiced consonants only and it will not work with the representation of unvoiced
speech.
Formant Synthesis
Formant synthesis is based on the source-filter model of speech. It is based on the model that
the human vocal tract system has several resonances and these resonances in the system result
in various characteristics of speech sounds. The vocal tract can be assumed to be a cascade of
several resonators in case of speech production of various periodic sounds. However, for
nasalized sounds, the nasal cavity comes into the picture and the tract can be modelled as a
parallel resonant structure. Therefore, there are two basic structures in general, parallel and
cascade, but for better performance, some kind of combination of these is usually used.
Formant synthesis also provides an infinite number of sounds which makes it more flexible than
The lower forms are known to carry the information of the sound and the higher forms carry
speaker information. At least three forms are generally required to produce intelligible speech
and up to five forms to produce high-quality speech. Each formant is usually modelled by a 2nd
formant frequency. The pole angle corresponds to the angular frequency and the distance of
the pole from the unit circle (in ᵶ plane) corresponds to the -3 dB bandwidth specified.
Concatenation Synthesis
speech by connecting natural, prerecorded speech sound units. These sound units can be
words, syllables, half- syllables, phonemes, half-phone, diphones or triphones. The joining of
natural speech utterances helps to achieve very high natural-sounding speech. However, the
joints of speech sound units need to be smooth to avoid abrupt changes or glitches in the
synthetic speech signal. The main challenges in this approach are the prosodic modification to
speech units and resolving discontinuities at unit boundaries or joints. Prosodic modification
results in artefacts in the speech which in turn make the speech sound unnatural. Unit
a large speech corpus is required to be labeled, which is very tedious and time-consuming.
These types of TTS systems are more natural sounding. However, the memory requirement of
such synthesis techniques is very large and hence, may find difficulty in being portable to hand-
held devices.
Instead of storing the whole speech database in this method, statistically, meaningful model
parameters are stored and used to synthesize speech waveforms. To build unit-selection
system on one speaker data requires about 8–10 hours of speech database. In addition, syllable
or phoneme coverage in the database is very uneven. Apart from it, in a concatenative
approach, the system will recreate units from what we have recorded. Furthermore, in this
approach, we are effectively memorizing the speech data whereas, in the statistical approach,
we are attempting to learn the general properties of the speech data. Hidden Markov Model
(HMM) -based speech synthesis system (HTS) comes under the category of statistical
parametric speech synthesis (SPS) methods. Basic idea is to generate an average of some
similar-sounding speech segments. Spectrum and excitation parameters are first extracted from
the speech database. Mel frequency cepstral coefficients (MFCC) and their dynamic features
are generally taken as spectrum (i.e., vocal tract system) parameters and log(F0) and its
dynamic features are taken as excitation (i.e., speech source) parameters. Then, these features
are modeled by context dependent HMMs. Here, spectrum, excitation and duration are going
to be modeled in a unified framework. After the training at the time of synthesis, the first given
dependent phoneme sequence. Then, according to the phoneme sequence, utterance HMM is
constructed by concatenating context dependent HMMs. Then the state duration of HMMs is
determined. Then using speech parameter generation algorithm, spectrum and excitation
parameters are generated. Finally, using Mel log spectrum approximation (MLSA) filter speech
waveform is generated.
The recent papers in Audio TTS are heading in this direction. Utilizing a single acoustic model
https://www.kaggle.com/datasets/hbchaitanyabharadwaj/audio-dataset-with-10-indian-
languages