Professional Documents
Culture Documents
Abhijit Pradhan, Aswin Shanmugam S, Anusha Prakash, Kamakoti Veezhinathan, Hema Murthy
1
phone based system. In Section 5, we discuss the experiments 3. 1. Training
and results. And finally, in Section 6 we summarise the fall
A detailed phone training process can be found in [16]. A
outs of this proposed approach in the context of Indian lan
sequence of steps in the training process is:
guages.
(1) Monophone composite models are built.
(2) Fullcontext composite and duration models are built from
2. IMPORTANCE OF A SYLLABLE AS A UNIT monophone models.
(3) A context tree is built from these composite models using
The basic linguistic unit with a correspondence to a sound unit a Question set.
is the phoneme. Although the phoneme is a distinct acous
tic unit, the basic production and cognition unit is a syllable 3.2. Synthesis
[6, 7, 8, 9] . A syllable can be represented by a C*YC* where
C is a consonant and Y is a vowel. A syllable captures intra The training produces fullcontext composite and fullcontext
syllable co-articulation between phonemes well. Although duration models. The synthesis involves the following se
co-articulation is present across syllables, the co-articulation quence of steps:
is less owing to the acoustic energy at the boundary between (1) Create fullcontext phone labels from input text.
syllables being significantly lower [10] than that at the middle (2) For an unseen model find the closest model from the con
of a syllable. text tree.
Our brain learns and stores a dictionary of syllables. (3) From duration models get the state sequence.
Whenever a new word is encountered, it is split into syllable (4) Form a sentence/utterance from the state sequence and
like units (syllable priming) based on some context [8, 9] . For full context composite models.
example, Iprotect! is syllabified as Iprol - Itect!, while Ipro (5) Apply parameter generation algorithm [3] to get the
tectedl is syllabified as Ipro/-/tec/-/ted/. Clearly, the end It! speech observation vector that maximizes the output prob
in Itect! is acoustically quite different from that of It! in Ited/. ability.
Experiments conducted in [11] indicate that the knowledge
of the writing system is essential to determine and identify 4. SYLLABLE BASED HTS
phones or sounds. A number of efforts exist in the literature
where an attempt is made to use syllable as a basic unit. Es 4. 1. Training
pecially in the context of Indian languages, it has been shown
that syllable can be used for both synthesis and recognition We now discuss the development of a syllable based HTS.
[12]. Although the role of syllables in speech segmentation When building a syllable based HTS, three different issues
cannot be questioned [11], appropriate acoustic cues are re must be addressed, namely, mapping from text to syllables,
quired to segment the speech signal at the syllable level. In building syllable level HMMs and fallback when a syllable is
[13], it is shown that an utterance can be quite accurately not found in the database. Each of these tasks are described
segmented at the syllable level using acoustic cues. The body in this subsection.
of work presented in this paper exploits this.
4.1.1. Letter To Sound Rules (Parser)
3. PHONE BASED HTS In building a TTS system the first task is to establish a map
ping between the written form and the spoken form. A parser
HTS generates synthetic speech from HMMs by using a is the component that performs the mapping. The conversion
speech parameter generation algorithm [3] and a source-filter process is heavily language specific. For instance, in English
based speech production model. For each phone, two types of grapheme to phoneme conversion cannot be performed using
HMMs are created. Spectrum and pitch together are modeled a set of rules. Instead the conversions are learnt from data
as a multi-stream HMM. We henceforth refer to this as com or a dictionary. In Indian languages too, the conversion is
posite model. The duration is modeled using another HMM. language specific. A finite number of rules that are indepen
We henceforth refer to this as duration model. The feature dent of the type of word are required [17, 18] . We briefly
vector in a composite model has two streams: one for spectral describe two rules one for each of the languages: Hindi and
parameter vector and another for pitch parameter vector [14]. Tamil. In Hindi one such rule is schwa deletion. In this rule,
Each duration model is represented by a multi-dimentional for syllables ending with a unit of the CY type, the Y part
Gaussian distribution. The dimension of the duration den is replaced with a unit called halant. For example, the word
sity is N, where N is equal to the number of states in the F"l '1d '1 Ichinatana! without schwa deletion would be syllab
corresponding composite model and the n-th dimension cor ified as � Ichina! and cr-r Itana!, but with schwa deletion
responds to the n-th state in the composite model [15]. Both it is syllabified as � Ichinl and CR:: Itan/. In Tamil one
the HMM models are fullcontext phone models. such rule is voiced-unvoiced classification. The voiced and
2
unvoiced stop consonants have the same orthographic repre ber of states for each syllable can be obtained. The number of
sentation. The Tamil character ffi can be pronounced as both states in the corresponding HMM is number of phones x 5.
/ka! (UnVoiced) and Iga! (Voiced). It is classified as UnVoiced Although syllables capture co-articulation well, the acoustic
when the character occurs at the beginning of a word or ap realisation of a syllable depends upon its position in a word.
pears as a geminate; and classified as Voiced when it occurs The prosody does differ significantly based on the syllable's
in between the vowels or after the nasals (1l'iJ � 61f(5f 6lrT .U5 position in a word. Based on experiments with syllable-based
ill) ( NAa na Na Ana na rna) or after the consonants (llJ [f 6\l USS for Indian languages positional context seems to im
6fT H? ) (/ya! Ira! 11a! /La! Izha!). For example, in the word prove USS performance [20]. Syllables are classified into
�ffiffirT lakkaa!, the character occurs as a geminate and thus three categories: begin, middle, and end based on their po
is UnVoiced. In word �Ji;ffiffit.D (aNAgam), the character sitions in a word. BEG,MID,END are added as suffixes to the
ffi occurs after the nasal.U5 (NAa) and thus is Voiced. text transcriptions and are modelled as separate HMMs. Once
the positional context and the number of states for a syllable
4.1.2. Building Syllable Models are known an appropriate HMM prototype is chosen to build
the syllable models. Thus for a language having N syllables
A semiautomatic labeling tool is used to obtain labels at the there can be at most 3N composite HMMs and 3N duration
syllable level. Group delay segmentation [19] is used to find HMMs. Unseen syllables are modelled using fallback units.
the syllable boundaries. Figure 1 illustrates the process to This is discussed in the next section.
3
Ipa! (BeginTime=0.464 EndTime=0.549) and �_end Irl (Be S. EXPERIMENTS AND RESULTS
ginTime=0.549 EndTime=0.700).
1643 Tamil sentences and 945 Hindi sentences are used for
our experiments. All sentences are spoken by a single native
female speaker of the respective languages. They are recorded
in a studio environment at 16000Hz, 16bits per sample. After
parsing these sentences and removing syllables with very less
number of examples (syllables with less than 5 examples are
removed), 984 and 1087 syllables (with positional contexts)
for Tamil and Hindi are obtained, respectively. For the experi
ments in this paper HTK-3.4+HTS-2.1 is used and the Global
Variance (GV ) enhancement [21] is enabled. Degradation
6. CONCLUSION
4
7. REFERENCES [13] Hema A Murthy and B. Yegnanarayana, "Group de
lay functions and its application to speech processing, "
[1] A Hunt and A Black, "Unit selection in a concatenative Sadhana, vol. 36, no. 5, pp. 745-782, November 2011.
speech synthesis system using a large speech database, "
in ICASSP. IEEE, 1996, pp. 373-376. [14] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi,
and T. Kitamura, "Simultaneous modeling of spectrum,
[2] H. Zen, K. Tokuda, and AW. Black, "Statistical para pitch and duration in hmm-based speech synthesis, " in
metric speech synthesis, " Speech Comm. , vol. 51, pp. Eurospeech, Sept 1999, pp. 2347-2350.
1039-1064, 2009.
[15] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi,
[3] K. Tokuda, T. Yoshimura, T. Masuko, and T. Kita and T. Kitamura, "Duration modeling for hmm-based
speech synthesis, " in ICSLP , 1998, pp. 29-32.
mura, "Speech parameter generation algorithms for
hmm-based speech synthesis, " in ICASSP. IEEE, 2000, [16] "HTS tutorial at interspeech, 2009, " http:
pp. 1315-1318. //www.sp.nitech.ac.jp/-tokuda/tokuda_
interspeech09_tutorial.pdf, [Last online
[4] 1. 1. Odell, P. C. Woodland, and S. 1. Young, "Tree
access by authors; 01-Feb-2013] .
based state clustering for large vocabulary speech recog
nition, " in International Conference on Speech, Image [17] M. Choudhury, "Rule based grapheme to phoneme map
Processing and Neural Networks, 1994. ping for Hindi speech synthesis, " in 90th Indian Science
Congress of the International Speech Communication
[5] "HMM-based speech synthesis system (HTS), " http: Association (ISCA), 2003.
//hts.sp.nitech.ac.jp/, [Last online access
by authors; 01-Feb-2013] . [18] K. Ghosh, R. V. Reddy, N. P. Narendra, S. Maity, S. G.
Koolagudi, and K. S. Rao, "Grapheme to phoneme con
[6] Steven Greenberg, "Speaking in shorthand- a syllable version in Bengali for festival based TTS framework., "
centric perspective for understanding pronunciation in 8th International Conference on Natural Language
variation, " Speech Communication, vol. 29, pp. 159- Processing (ICON), 2010.
176, 1999.
[19] T. Nagarajan and Hema A Murthy, "Group delay based
[7] Osamu Fujimura, "Syllable as a unit of speech recog segmentation of spontaneous speech into syllable-like
nition, " in ICASSP. IEEE, February, 1975, vol. 23, pp. units, " EURASIP Journal of Applied Signal Processing,
82-87. vol. 17, pp. 2614-2625, 2004.
[8] Joana Cholin, Niels O. Schiller, and Willem J. M. Lev [20] Vinodh M.V., Ashwin Bellur, Badri Narayan K., Deepali
elt, "The preparation of syllables in speech production, " Thakare M., Anila Susan, Suthakar N.M., and Hema A
Journal of Memory and Language, vol. 50, pp. 47-61,
Murthy, "Using polysyllabic units for text to speech syn
2004. thesis in Indian languages, " in National Conference on
Communication(NCC), 2010, pp. 1-5.