You are on page 1of 135

NAIST-IS-DT0161027

Doctoral Thesis

High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion
Tomoki Toda

March 24, 2003 Department of Information Processing Graduate School of Information Science Nara Institute of Science and Technology

Doctoral Thesis submitted to Graduate School of Information Science, Nara Institute of Science and Technology in partial ful

llment of the requirements for the degree of DOCTOR of ENGINEERING Tomoki Toda Thesis committee: Kiyohiro Shikano, Professor Yuji Matsumoto, Professor Nick Campbell, Professor Hiroshi Saruwatari, Associate Professor

we propose a novel segment selection algorithm based on both phoneme and diphone units that does not avoid concatenation of Doctoral Thesis. Department of Information Processing. This thesis addresses two problems in speech synthesis. The other is how to improve control of speaker individuality in order to achieve more exible speech synthesis. e. no general-purpose TTS has been developed that can consistently synthesize suciently natural speech. One is how to improve the naturalness of synthetic speech in corpus-based TTS. and e-mail reading. Graduate School of Information Science. there is not yet enough exibility in corpusbased TTS. Corpus-based TTS makes it possible to dramatically improve the naturalness of synthetic speech compared with the early TTS. Moreover. 3 i . In order to address this problem. It can be utilized for various purposes. NAIST-IS-DT0161027. we focus on a voice conversion technique to control speaker individuality to deal with the latter problem. Since various vowel sequences appear frequently in Japanese. and (2) an evaluation measure for selecting the synthesis units. Nara Institute of Science and Technology. 2003.High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion Tomoki Toda Abstract 3 Text-to-Speech (TTS) is a useful technology that converts any text into a speech signal. car navigation. it is not realistic to prepare long units that include all possible vowel sequences to avoid vowel-to-vowel concatenation. Furthermore. To deal with the former problem.g. response services in telecommunications. announcements in railway stations. which often produces auditory discontinuity. we focus on two factors: (1) an algorithm for selecting the most appropriate synthesis units from a speech corpus. March 24. However.

Experiments testing concatenation of vowel sequences clarify that better segments can be selected by considering concatenations not only at phoneme boundaries but also at vowel centers. Moreover.vowel sequences but alleviates the resulting discontinuity. A cost is established as a measure for selecting the optimum waveform segments from a speech corpus. In order to achieve high-quality segment selection for concatenative TTS. it is important to utilize a cost that corresponds to perceptual characteristics. the results of perceptual experiments show that speech synthesized using the proposed algorithm has better naturalness than that using the conventional algorithms. We .

From the results of perceptual experiments.rst clarify the correspondence of the cost to the perceptual scores and then evaluate various functions to integrate local costs capturing the degradation of naturalness in individual segments. we .

To overcome this problem. We improve the voice conversion algorithm based on the Gaussian Mixture Model (GMM).nd a novel cost that takes into account not only the degradation of naturalness over the entire synthetic speech but also the local degradation. which is a conventional statistical voice conversion algorithm. Segment selection. Keywords: Text-to-Speech. The GMM-based algorithm can convert speech features continuously using the correlations between source and target features. However. we propose a novel voice conversion algorithm that incorporates Dynamic Frequency Warping (DFW) technique. the quality of the converted speech is degraded because the converted spectrum is excessively smoothed by the statistical averaging operation. speaker individuality. The experimental results reveal that the proposed algorithm can synthesize speech with a higher quality while maintaining equal conversionaccuracy for speaker individuality compared with the GMM-based algorithm. Naturalness. Voice conversion ii . Measure for selection. We also clarify that the naturalness of synthetic speech can be slightly improved by utilizing this cost and investigate the e ect of using this cost for segment selection. Synthesis unit.

NAIST-IS- iii .素片選択と声質変換を用いた高品質かつ 柔軟な音声合成3 戸田 智基 内容梗概 音声合成は機械から人間へ情報を伝達する上で重要な役割を担う技術である. その中の一つであるテキスト音声合成( TTS )は任意のテキストから音声を合成 れている.現在の TTS の主流は大規模な音声データ処理に基づくコーパスベー ス方式と呼ばれるものであり,初期のルールベース方式と比較すると劇的な進歩 する技術であり,その実用面における有効性の高さから研究開発が勢力的に行わ を遂げた.しかしながら,安定して自然性の高い音声を合成できるまでには至っ ておらず,様々な話者の音声を容易に合成するなどといった柔軟性も持ち合わせ ていない. 本論文では音声合成における二つの問題に取り組む.一つはコーパスベース TTS における合成音声の自然性の改善であり,もう一つは話者性制御の改善であ る.自然性に重要な影響を及ぼす要因として,音声コーパスからの最適な音声素 片選択方法とその際に用いる選択尺度に着目する.話者性制御に関しては,少量 の音声データから多様な話者の音声合成を可能とする声質変換技術に着目する. 波形接続型音声合成において,母音間での接続はしばしば不連続感を生み出 す要因となる.母音連鎖を一つの合成単位に含める方法が提案されているが,日 本語では様々な母音連鎖が頻繁に出現するため,母音間での接続を完全に防ぐこ とは不可能である.この問題に対して,接続を防ぐのではなく不連続感を緩和す るという観点から,音素単位とダ イフォン単位に基づく新たな素片選択法を提案 する.スペクトルの不連続性に関する評価実験を行うことで,音素境界だけでな く母音中心での接続も許容して素片を選択することの有効性を明らかにする.ま 3 奈良先端科学技術大学院大学 情報科学研究科 情報処理学専攻 博士論文 DT0161027. . 2003 年 3 月 24 日.

自然性. 合成単位,選択尺度.た,知覚実験結果から,提案法は従来法と比較し,より自然な音声を合成できる ことを示す. 音声コーパスから最適な素片を選択する際に用いる尺度をコストと呼ぶ.安 定して自然性の高い音声を合成するためには,知覚特性との対応が良いコストを 用いることが重要である.そこで,本論文ではまずコストと知覚特性との対応を 明らかにする.その際に,個々の素片に対するコストを統合して素片系列に対す るコストへと変換するための関数に着目する.知覚実験結果から,素片系列全体 に対する自然性劣化だけでなく局所的な自然性劣化も捉えることのできるコスト が,より知覚特性との対応が良いことを示す.本論文では,全体に対する劣化と 局所的な劣化を捉えるコストとして RMS( Root Mean Square )コストを用いる. 全体に対する劣化のみを捉える従来の平均コストとの比較により,RMS コストは 合成音声の自然性改善に有効であることを示す.さらに,素片選択において RMS 従来の統計的処理に基づく声質変換法の一つである混合正規分布モデルに基 コストを用いることにより生じる影響を明らかにする. づく変換法は,各話者の特徴量間の相関を用いることで連続的かつ精度の高い特 徴量の変換を可能とする.しかし,統計的処理により変換スペクトルは過剰に平 滑化されたものとなり,その結果変換音声の品質劣化が生じる.この過剰な平滑 化を防ぐために,周波数軸伸縮によるスペクトル変換処理を導入した新たな声質 変換法を提案する.評価実験結果から,提案法は従来法と比較し,同等の話者性 変換精度を維持しながら変換音声の品質を改善できることを示す. キーワード テキスト音声合成. 声質変換 iv . 素片選択. 話者性.

Hisashi Kawai. Supervisor of ATR Spoken Language Translation Research Laboratories. of Nara Institute of Science and Technology. President of ATR. and Associate Professor Hiroshi Saruwatari. Seiichi Yamamoto. for his constant guidance and encouragement through my master's course and doctoral course.Acknowledgments I would like to express my deepest appreciation to Professor Kiyohiro Shikano of Nara Institute of Science and Technology. for his continuous support and valuable advice through the doctoral course. for their bene. I would also like to express my gratitude to Professor Yuji Matsumoto. I could learn many lessons from his attitude toward study. my thesis advisor. Professor Nick Campbell. for their invaluable comments to the thesis. which led me to a new research idea. for giving me the opportunity to work for ATR Spoken Language Translation Research Laboratories as an Intern Researcher. The core of this work originated with his pioneering ideas in speech synthesis. Nobuyoshi Fugono. Director of ATR Spoken Language Translation Research Laboratories. and Dr. This work could not have been accomplished without his direction. I would like to thank Assistant Professor Hiromichi Kawanami and Assistant Professor Akinobu Lee of Nara Institute of Science and Technology. I would sincerely like to thank Dr. I have always been happy to carry out research with him. I would especially like to express my sincere gratefulness to Dr.

who is currently Head of Department 1 at ATR Spoken Language Translation Research Laboratories. for their helpful discussions.cial comments. and former Assistant Professor Jinlin Lu. who is currently an Associate Professor at Aichi Prefectural University. I would also like to thank former Associate Professor Satoshi Nakamura. I want to thank all members of the Speech and Acoustics Laboratory and Applied Linguistics Laboratory in Nara Institute of Science and v .

guidance. Associate Professor Keiichi Tokuda of Nagoya Institute of Technology. Senior Researcher of Arcadia Inc. for his support in the laboratories. Professor Hideki Kawahara of Wakayama University. for their support. for providing thoughtful advice and discussions on speech synthesis techniques. Jinfu Ni. I greatly appreciate Dr. I owe a great deal to Ryuichi Nishimura.. and having recommended that I enter Nara Institute of Science and Technology. I would like to acknowledge my family and friends for their support. of ATR Spoken Language Translation Research Laboratories. Also. for their valuable advice and discussions. and Dr. Finally. the Head of Department 4 at ATR Spoken Language Translation Research Laboratories. for his encouragement. I would also like to thank my many other colleagues at ATR Spoken Language Translation Research Laboratories. I would especially like to express my gratitude to Dr. I am indebted to many Researchers and Professors. and Professor Yoshinori Sagisaka of Waseda University. I would sincerely like to thank Minoru Tsuzaki. Toshio Hirai. Senior Research Engineer of NTT Cyber Space Laboratories. of Nagoya University. a Researcher. Senior Researcher. Associate Manager. vi . Masanobu Abe. I would especially like to thank Dr. for providing lively and fruitful discussions about speech synthesis. Hideki Tanaka. doctoral candidate of Nara Institute of Science and Technology. I would also like to express my gratitude to Professor Fumitada Itakura. Associate Professor Syoji Kajita. and Associate Professor Mikio Ikeda of Yokkaichi University. and Research Associate Hideki Banno. Associate Professor Kazuya Takeda.Technology for providing fruitful discussions.

1 Background and Problem De.Contents Abstract Japanese Abstract Acknowledgments List of Figures List of Tables 1 Introduction ii iv v xiv xv 1 : : : : : : : : : : : : 1.

2.2 Thesis Scope 1.3 Statistical Voice Conversion Algorithm vii : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 8 8 10 11 15 17 19 6 .nition 1.2.4 Waveform synthesis 2.2 Prosody generation 2.2.2.2.1 Improvement of naturalness of synthetic speech 1.2.2 Improvement of control of speaker individuality 1.2.2 Structure of Corpus-Based TTS 2.1 Introduction 2.3 Thesis Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 2 4 4 2 Corpus-Based Text-to-Speech and Voice Conversion 2.3 Unit selection 2.5 Speech corpus 2.1 Text analysis 2.

2.2.4 Segment Selection Algorithm Based on Both Phoneme and Diphone Units 3.3.2 Proposed algorithm 3.5.3.7 Integrated cost 3.2 Experimental results 3.5 Sub-cost on spectral discontinuity: spec 3.2 Experiment allowing substitution of phonetic environment 3.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : C : : : : : : : : : : : : : : : : : : C F : : : : : : : : : : : : : : C : : : : : : : : : : C : : : : : : : : : : : : : : : : : : : C : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 An Evaluation of Cost Capturing Both Total and Local Degradation of Naturalness for Segment Selection 49 4.3.1 Experimental conditions 3.4 Sub-cost on phonetic environment: env 3.4.2 Cost Function for Segment Selection 3.5.2.1 Conventional algorithm 3.3 Sub-cost on 0 discontinuity: F0 3.1 Local cost 3.2.4.2 Conversion algorithm based on Gaussian Mixture Model 2.3.3 Experiment prohibiting substitution of phonetic environment 3.3 Comparison with segment selection based on half-phoneme units 3.1 Experimental conditions 3.6 Sub-cost on phonetic appropriateness: app 3.1 Conversion algorithm based on Vector Quantization 2.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 22 23 25 27 28 29 30 31 32 32 33 34 35 36 37 38 39 39 41 45 46 46 47 47 50 3 A Segment Selection Algorithm for Japanese Speech Synthesis Based on Both Phoneme and Diphone Units 26 3.3 Comparison of mapping functions 2.4.1 Introduction 3.2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii .3.2.5 Experimental Evaluation 3.2 Sub-cost on prosody: pro 3.3.2.3 Concatenation at Vowel Center 3.2.

3 Relationship between e ectiveness of RMS cost and corpus size 4.4.6 Summary ix : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77 77 79 79 80 82 83 84 85 86 86 89 90 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : .2 E ect of RMS cost on selected segments 4.5.3 Preference tests on speech quality 5.4 Evaluation of segment selection by estimated perceptual score 4.1 E ect of RMS cost on various costs 4.1 Introduction 5.4.2 Preference tests on speaker individuality 5.5.2 Shortcomings of GMM-based conversion algorithm 5.1 Evaluation of spectral conversion-accuracy of GMM-based conversion algorithm 5.2 Various Integrated Costs 4.3 Correspondence of RMS cost to perceptual score in lower range of RMS cost 4.4 E ectiveness of Mixing Converted Spectra 5.4 Segment Selection Considering Both Total Degradation of Naturalness and Local Degradation 4.3.2 Subjective evaluation of speech quality 5.2.4.3.2 GMM-Based Conversion Algorithm Applied to STRAIGHT 5.5 Experimental Evaluation 5.1 E ect of mixing-weight on spectral conversion-accuracy 5.2 Mixing of converted spectra 5.3.1 Correspondence of cost to perceptual score 4.4.4.4.4.2 Preference test on naturalness of synthetic speech 4.3 Voice Conversion Algorithm Based on Gaussian Mixture Model with Dynamic Frequency Warping 5.3.3 Perceptual Evaluation of Cost 4.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51 52 52 57 60 62 63 66 67 70 71 74 75 5 A Voice Conversion Algorithm Based on Gaussian Mixture Model with Dynamic Frequency Warping 73 5.4.1 Subjective evaluation of speaker individuality 5.3.1 Dynamic Frequency Warping 5.2.

1 Summary of the Thesis 6.6 Conclusions Appendix A B C 6.2 Future Work 91 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91 93 96 97 Frequency of Vowel Sequences De.

s and p. on Mismatch of Phonetic Environment P S S 96 99 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : References List of Publications 100 113 x .nition of the Nonlinear Function Sub-Cost Functions.

1 2. Schematic diagram of waveform concatenation.1 Problems addressed in this thesis. Schematic diagram of text analysis.3 2.5 2.List of Figures 1. Schematic diagram of segment selection. Schematic diagram of HMM-based prosody generation.2 2.6 : : : : : : : : : : : : : : : : : : 3 8 9 12 15 18 18 Structure of corpus-based TTS.4 2. Schematic diagram of speech synthesis with prosody modi. 2.

3. The contour line denotes frequency distribution of training data in the joint feature space.2 Targets and segments used to calculate each sub-cost in calculation of the cost of a candidate segment i for a target i.5 Statistical characteristics of static feature and dynamic feature of spectrum in vowels.1 Schematic diagram of cost function. i and i show phonemes considered target and candidate segments. 3. 3.7 Various mapping functions. 3. respectively.cation by STRAIGHT.3 Schematic diagram of function to integrate local costs . : : : : : : : : : : : : : : : : : u t t u LC : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 37 xi . \Normalized time" shows the time normalized from 0 (preceding phoneme boundary) to 1 (succeeding phoneme boundary) in each vowel segment. 2. E y x x : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 24 29 31 34 36 3.4 Spectrograms of vowel sequences concatenated at (a) a vowel boundary and (b) a vowel center. \x" denotes the conditional expectation [ j ] calculated in each value of original feature .

\Vfh " and \Vlh " show the .3. \V*" shows all vowels.6 Concatenation methods at a vowel boundary and a vowel center.

Concatenation at C-V boundaries is prohibited.10 Targets and segments used to calculate each sub-cost in calculation l of the cost of candidate segments f i i for a target i .5 Correlation between maximum cost and perceptual score.u t : : : : : : : : : : : : : : : : : 38 40 40 41 43 44 45 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48 53 53 54 55 55 56 58 4. 4. 4. B"). u .D. \S.9 Example of segment selection based on phoneme units. 3. 3. V-S.4 Correlation between average cost and perceptual score. and V-N sequences (\Exp. " shows standard deviation.12 Example of segment selection based on half-phoneme units.3 Correlation coecient between norm cost and perceptual score as a function of power coecient. 4.7 Correlation coecient between RMS cost and normalized opinion score for each listener. 3. A") and those of comparison with the segment selection allowing only concatenation at vowel center in V-V. 4. 3.1 Distribution of average cost and maximum cost for all synthetic utterances. Concatenation at C-V boundaries and selection of isolated half-vowels are prohibited. 4.13 Results of comparison with the segment selection based on phoneme units (\Exp. respectively. 3.rst half-vowel and the last half-vowel.2 Scatter chart of selected test stimuli. The input sentence is \tsuiyas" (\spend" in English). 4. . 3. 3. p : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii .7 Frequency distribution of distortion caused by concatenation between vowels in the case of allowing substitution of phonetic environment.6 Correlation between RMS cost and perceptual score.8 Frequency distribution of distortion caused by concatenation between vowels that have the same phonetic environment.11 Example of segment selection based on phoneme units and diphone units. The RMS cost can be converted into a perceptual score by utilizing the regression line.

4.8 Best correlation between RMS cost and normalized opinion score (left .

gure) and worst correlation between RMS cost and normalized opinion score (right .

2 Example of spectrum converted by GMM-based voice conversion algorithm (\GMM-converted spectrum") and target speaker's spectrum (\Target spectrum").3 GMM-based voice conversion algorithm with Dynamic Frequency Warping.gure) in results of all listeners. Each dot denotes a stimulus pair. Mean and standard deviation are shown.5 Variations of mixing-weights that correspond to the di erent parameters . 4. 4. a : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii . 5.1 Mel-cepstral distortion.12 Correlation between RMS cost and perceptual score in lower range of RMS cost.17 Segment length in number of syllables as a function of corpus size." and \RMS" show the average and the root mean square of local costs. 4. 4.14 Target cost as a function of corpus size. 4. \*" denotes any phoneme. 5.11 Preference score. 5.18 Increase rate in the number of concatenations as a function of corpus size. \Av.4 Example of frequency warping function.21 Estimated perceptual score as a function of corpus size. Mean and standard deviation are shown. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58 59 60 61 62 64 65 65 68 68 69 69 70 71 78 78 79 81 82 5.19 Concatenation cost in each type of concatenation.13 Local costs as a function of corpus size.16 Segment length in number of phonemes as a function of corpus size. The corpus size is 32 hours.15 Concatenation cost as a function of corpus size. 4. 4. 4. 4. 4. 4.10 Scatter chart of selected test stimuli. 4. respectively.20 Di erences in costs as a function of corpus size. 4. 5.9 Examples of local costs of segment sequences selected by the average costs and by the RMS cost.

5.6 Example of converted spectra by the GMM-based algorithm (\GMM"), the proposed algorithm without the mix of the converted spectra (\GMM & DFW"), and the proposed algorithm with the mix of the converted spectra (\GMM & DFW & Mix of spectra"). 83 5.7 Mel-cepstral distortion as a function of parameter of mixingweight. \Original speech of source" shows the mel-cepstral distortion before conversion. 84 5.8 Relationship between conversion-accuracy for speaker individuality and parameter of mixing-weight. A preference score of 50% shows that the conversion-accuracy is equal to that of the GMM-based algorithm, which provides good performance in terms of speaker individuality. 86 5.9 Relationship between converted speech quality and parameter of mixing-weight. A preference score of 50% shows that the converted speech quality is equal to that of the GMM-based algorithm with DFW, which provides good performance in terms of speech quality. 87 88 5.10 Correct response for speaker individuality. 5.11 Mean Opinion Score (\MOS") for speech quality. 89
: : : : a : : : : : : : : : : : : : : : : : : : : : : : : a : : : : : : : : : : : : : : : : : : : : : : : : : : : : : a : : : : : : : : : : : : : : : : : : : : : :

B.1 Nonlinear function for sub-cost on prosody.
P

: : : : : : : : : : :

98

xiv

List of Tables
3.1 Sub-cost functions 3.2 Number of concatenations in experiment comparing proposed algorithm with segment selection based on phoneme units. \S" and \N" show semivowel and nasal. \Center" shows concatenation at vowel center 3.3 Number of concatenations in experiment comparing proposed algorithm with segment selection allowing only concatenation at vowel center in V-V, V-S, and V-N sequences A.1 Frequency of vowel sequences
: : : : : : : : : : : : : : : : : : : : : : : : : :

30

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

47 47 96

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

xv

Chapter 1 Introduction
1.1 Background and Problem De

Thus.nition Speech is the ordinary way for most people to communicate. In general. it is important to realize a man-machine interface to facilitate communication between people and computers. One is speech recognition. Therefore. e. two technologies for processing speech are needed. attitude. and the other is speech synthesis. and useful means of communication. Therefore. Naturally. Necessary information. Moreover. it is said that speech is the most natural. message information. In recent years. and speaker individuality. speech is focused on as a medium for such communication. it is important to . Speech recognition is a technique for information input. speech can convey other information such as emotion. is extracted from input speech that includes diverse information. computers have come into common use as computer technology advances. convenient.g.

speech synthesis is a technique for information output. Moreover. e. and is generated from input information. other information such as speaker individuality and emotion is needed in order to realize smoother communication. On the other hand. sound information and prosodic information. it is important to . This procedure is the reverse of speech recognition.nd a method to extract only useful information.g. Output speech includes various types of information. Thus.

TTS is 1 .nd a method to generate the various types of paralinguistic information that are not processed in speech recognition. Text-to-Speech (TTS) is one of the speech synthesis technologies.

Therefore. However. there is not yet enough exibility in corpus-based TTS. announcements in railway stations. car navigation.g. and research and development on TTS has been progressing. Corpus-based TTS can be used for practical purposes under limited conditions [15]. This approach makes it possible to dramatically improve the naturalness of synthetic speech compared with the early TTS. corpus-based TTS can synthesize only speech having the speci. and it is very useful in many practical applications. The current trend in TTS is based on a large amount of speech data and statistical processing.a technique to convert any text into a speech signal [67]. response services in telecommunications. no general-purpose TTS has been developed that can synthesize sucient natural speech consistently for any input text. e. Furthermore. This type of TTS is generally called corpus-based TTS. and e-mail reading. it is desirable to realize TTS that can synthesize natural and intelligible speech. In general.

and it requires an enormous amount of time and expensive costs. Therefore. speech of various speakers. and (3) an evaluation measure to select the synthesis units. Speech recording is hard work. One is how to improve the naturalness of synthetic speech in corpus-based TTS.1 Improvement of naturalness of synthetic speech In corpus-based TTS. and other speaking styles. 1. The other is how to improve control of speaker individuality in order to achieve more exible speech synthesis. 2 . e.g. large-sized speech corpora are needed to synthesize speech with sucient naturalness. in order to synthesize other types of speech.c style included in a speech corpus.2 Thesis Scope This thesis addresses two problems in speech synthesis shown in Figure 1. it is necessary to improve the performance of corpus-based TTS. Therefore. various speech samples need to be recorded in advance. 1. emotional speech.1. Moreover. three main factors determine the naturalness of synthetic speech: (1) a speech corpus. (2) an algorithm for selecting the most appropriate synthesis units from the speech corpus. We focus on the latter two factors.2.

In order to alleviate this discontinuity. i.1. portions of speech utterances included in the corpus. In a speech synthesis procedure. and the synthetic speech is generated by concatenating the selected waveform segments. the optimum set of waveform segments. we propose a novel selection algorithm based on two synthesis unit de. In Japanese speech synthesis. syllable units cannot avoid vowel-to-vowel concatenation. Various units.e. are selected. and so on have been proposed.Improvement of naturalness of synthetic speech Naturalness Current level Synthesis of specific speakers’ speech Improvement of control of speaker individuality Synthesis of various speakers’ speech Large amount of speech data Small amount of speech data Flexibility Figure 1. This selection is performed based on synthesis units. phonemes. which often produces auditory discontinuity. diphones. syllables. because various vowel sequences appear frequently in Japanese. Problems addressed in this thesis. However. syllable units are often used since the number of Japanese syllables is small and transition in the syllables is important for intelligibility.

Therefore.nitions. Although a measure based on acoustic measures is often used. in order to realize high and consistent quality of synthetic speech. Moreover. we clarify 3 . the correspondence of such a measure to the perceptual characteristics is indistinct. it is important to use an evaluation measure that corresponds to perceptual characteristics in the selection of the most suitable waveform segments.

Therefore. 1. From the results of perceptual experiments. In this technique. The training of the conversion rules is performed based on statistical methods. we introduce Dynamic Frequency Warping (DFW) technique into the statistical voice conversion. Then some techniques in each module are reviewed. We describe the basic structure of the corpus-based TTS system. However. some conventional voice conversion algorithms are reviewed and conversion functions of the algorithms are compared. the performance of conventional voice conversion techniques is inadequate. 1. In Chapter 2. we show that the proposed voice conversion algorithm can synthesize converted speech more naturally while maintaining equal conversion-accuracy on speaker individuality compared with a conventional voice conversion algorithm. we improve this measure based on the results of these experiments. Moreover. conversion rules between two speakers are extracted in advance using a small amount of training speech data. Once training has been performed. Moreover. Although accurate conversion rules can be extracted from a small amount of training data. a corpus-based TTS system and conventional voice conversion techniques are described. any utterance of one speaker can be converted to sound like that of another speaker. 4 .2 Improvement of control of speaker individuality We focus on a voice conversion technique to control speaker individuality.the correspondence of the measure utilized in our TTS by performing perceptual experiments on the naturalness of synthetic speech. important information in uencing speech quality is lost.2. In order to avoid quality degradation caused by losing the information.3 Thesis Overview The thesis is organized as follows. we can easily synthesize speech of various speakers from only a small amount of speech data of the speakers by using the voice conversion technique. and we brie y introduce the techniques applied to the TTS system under development in ATR Spoken Language Translation Research Laboratories.

In Chapter 4. We clarify correspondence of the measure to the perceptual scores determined from the results of perceptual experiments.In Chapter 3. we . the measure is evaluated based on perceptual characteristics. we propose a novel segment selection algorithm for Japanese speech synthesis. Not only the segment selection algorithms but also our measure for selection of optimum segments are described. Moreover. Results of perceptual experiments show that the proposed algorithm can synthesize speech more naturally than conventional algorithms.

control of speaker individuality by voice conversion is described. In Chapter 6. We also show the e ectiveness of increasing the size of a speech corpus.nd a novel measure having better correspondence and investigate the e ect of using this measure for segment selection. The results of experiments show that the proposed algorithm has better performance compared with a conventional algorithm. We propose a novel voice conversion algorithm and perform an experimental evaluation on it. 5 . we summarize the contributions of this thesis and o er suggestions for future work. In Chapter 5.

In this chapter. parameters characterizing a speech production model are adjusted by performing iterative feedback control so that the error between the observed value and that produced by the model is minimized. 2.Chapter 2 Corpus-Based Text-to-Speech and Voice Conversion Corpus-based TTS is the main current direction in work on TTS.1 Introduction The early TTS was constructed based on rules that researchers determined from their objective decisions and experience [67]. Therefore. Moreover. we review conventional voice conversion algorithms that are useful for exibly synthesizing speech of various speakers. In general. The researcher extracts the rules for speech production by the Analysis-by-Synthesis (A-b-S) method [6]. Such rule determination needs professional expertise since it is dicult to extract consistent and reasonable rules. we describe the basic structure of corpus-based TTS and the various techniques used in each module. In the A-b-S method. the rulebased TTS systems developed by researchers usually have di erent performances. and then we compare various conversion functions. The naturalness of synthetic speech has been improved dramatically by the transition from the early rule-based TTS to corpus-based TTS. synthetic speech by rule-based TTS has an unnatural quality because a 6 . Moreover. this type of TTS is called rule-based TTS.

On the other hand.g. e.speech waveform is generated by a speech production model.g. which generally needs some approximations in order to model the complex human vocal mechanism [67]. optimum speech units are selected from the speech corpus. e. a mismatch of phonetic environments. An output speech waveform is synthesized by concatenating the selected units and then modifying their prosody. terminal analog speech synthesizer. the current TTS is constructed based on a large amount of data and a statistical process [43][89]. In corpusbased TTS. If the selected units need little modi. Corpus-based TTS can synthesize speech more naturally than rule-based TTS because the degradation of naturalness in synthetic speech can be alleviated by selecting units satisfying certain factors. This approach has been developed through the dramatic improvements in computer performance. and discontinuity produced by concatenating units. this type of TTS is called corpus-based TTS in contrast with rule-based TTS. di erence in prosodic information. In synthesis. In general. a large amount of speech data are stored as a speech corpus.

not only quality and intelligibility but also speaker individuality are important for smooth and full communication.cation. corpus-based TTS can synthesize high-quality and intelligible speech of the speaker. While. Furthermore. Therefore. If a large-sized speech corpus of a certain speaker can be used. since the corpus-based approach has hardly any dependency on the type of language. we can apply the approach to other languages more easily than the rule-based approach. natural speech can be synthesized by concatenating speech waveform segments directly. it is important to synthesize the speech of various speakers as well as the speech of a speci.

One of approaches for exibly synthesizing speech of various speakers is speech modi.c speaker.

2. This problem is associated with a mapping between features. This chapter is organized as follows. In voice conversion. In Section 2. it is important to extract accurate conversion rules from a small amount of training data. we describe the basic 7 . an extraction method of conversion rules is based on statistical processing. In general.cation by a voice conversion technique used to convert one speaker's voice into another speaker's voice [68]. and it is often used in speaker adaptation for speech recognition.

Power.. Finally. we summarize this chapter in Section 2.) Prosody generation Prosodic information (F0. corpus-based TTS is comprised of . structure of corpus-based TTS and review various techniques in each module. 2.1.) Speech corpus Unit selection Unit information Synthetic speech Waveform synthesis Figure 2.. . Accent. conventional voice conversion algorithms and comparison of mapping functions of the algorithms are described. Duration.4.. In Section 2.. Structure of corpus-based TTS.3. .Text TTS Text analysis Contextual information (Pronunciation.2 Structure of Corpus-Based TTS In general.

accent type. unit selection. 2. part-of-speech. and so on. pronunciation. The contextual information plays an important role in the quality and intelligibility of synthetic speech because prediction accuracy on this information a ects all of the subsequent procedures.1. and speech corpus. The structure of corpus-based TTS is shown in Figure 2.ve modules: text analysis. by natural language processing [91][96]. prosody generation. i.1 Text analysis In the text analysis. 8 .e. waveform synthesis.2. an input text is converted into contextual information.

Reading rules and accentual rules for word concatenation are often applied to the determination of this information [77][88]. the accent information is crucial to achieving high-quality synthetic speech. The normalized text is then divided into morphemes.g. various obstacles. and a syntactic analysis is performed. boundaries of prosodic clauses. e.2. First. Then. In some TTS systems. are removed if the input text includes these obstacles. Especially in Japanese.Input text Text normalization Morphological analysis Syntactic analysis Phoneme generation Accent generation Contextual information Figure 2.2. accent nucleus. and syntactic structure. 9 . accentual phrases. These morphemes are tagged with their parts of speech. the module determines phoneme and prosodic symbols. such as unreadable marks like HTML tags and e-mail headings. This processing is called text normalization. Schematic diagram of text analysis. ToBI (Tone and Break Indices) labels [97] or Tilt parameters [108] are predicted [17][38]. which are minimum units of letter strings having linguistic meaning. especially English TTS systems. A schematic diagram of text analysis is shown in Figure 2.

a phrase component that decreases gradually toward the end of a sentence and an accent component that increases and decreases rapidly at each accentual phrase. prosodic features such as 0 contour. [56]. In the 0 contour control model proposed by Kagoshima et al. The representative vectors are selected from an 0 contour codebook with contextual information.2. which are generated by modifying vectors that are representative of typical 0 contours. Many data-driven algorithms for prosody generation have been proposed. The codebook is designed so that the approximation error between 0 contours generated by this model and real 0 contours extracted from a speech corpus is minimized. particularly in Japanese TTS [45][61]. proposed using not the representative vectors but natural 0 contours selected from a speech corpus in order to generate an 0 contour of a sentence [50]. Fujisaki's model has been proposed as one of the models that can represent the 0 contour e ectively [39]. Isogai et al. This model decomposes the 0 contour into two components. Then the rules arranged by experts are applied.2 Prosody generation In the prosody generation. i. power contour.e. Fujisaki's model is often used to generate the 0 contour from the contextual information in rule-based TTS. the 0 contour is selected and used without modi. if there is an 0 contour having equal contextual information to the predicted contextual information in the speech corpus. automatic extraction algorithms of control parameters and rules from a large amount of data with statistical methods have been proposed [41][46].2. an 0 contour of a whole sentence is produced by concatenating segmental 0 contours. In this algorithm. In recent years. and phoneme duration are predicted from the contextual information output from the text analysis. This prosodic information is important for the intelligibility and naturalness of synthetic speech.

In all other cases. the 0 contour that most suits the predicted contextual information is selected and used with modi.cation.

algorithms for predicting the 0 contour from the ToBI labels or Tilt parameters have been proposed [13][37]. [111][112][117]. Moreover. the mel-cepstrum sequence including the power contour. As a powerful data-driven algorithm.cation. and phoneme duration are generated directly from HMMs trained by a F F F F F F F F F F F F F F F F F F 10 . the 0 contour. In this method. HMM-based (Hidden Markov model) speech synthesis has been proposed by Tokuda et al.

F F 2. and the duration is modeled by multi-dimensional Gaussian distribution HMMs in which each dimension shows the duration in each state of the HMM. The mel-cepstrum is modeled by either multi-dimensional Gaussian distribution HMMs or multi-dimensional Gaussian mixture distribution HMMs. All segments of a given phoneme in the speech corpus are clustered in advance into equivalence classes according to their preceding and succeeding phoneme contexts. The centroid segment of each cluster is saved as a synthesis unit.decision-tree based on a context clustering technique. prosodic di erence. an optimum set of units is selected from a speech corpus by minimizing the degradation of naturalness caused by various factors.2. A schematic diagram of HMM-based prosody generation is shown in Figure 2. and a mismatch of phonetic environments [52][89]. In these systems. the optimum synthesis units are generated or selected from a speech corpus of a single speaker in advance in order to alleviate degradation caused by spectral di erence. Decision-trees are constructed for each feature. The decision-tree for the 0 and that for the mel-cepstrum are constructed in each state of the HMM. spectral di erence. unit selection. the optimum synthesis units are selected from leaf clusters that most suit given pho11 . The decision trees that perform the clustering are constructed automatically by minimizing the acoustic di erences within the equivalence classes. contextual information is used instead of prosody information for the next procedure. In our corpus-based TTS under development. proposed an automatic procedure called Context Oriented Clustering (COC) [80]. Various types of units have been proposed to alleviate such degradation.3.g. Nakajima et al.3 Unit selection In the unit selection. are generated from the HMMs by maximizing the likelihood criterion while considering the dynamic features of speech [112]. the smooth parameter contours. HMM-based speech synthesis is applied to a prosody generation module. In this technique. As for the duration. In the speech synthesis phase. In synthesis. Some TTS systems do not perform the prosody generation [24]. The 0 is modeled by multi-space probability distribution HMMs [111]. which are static features. one decision-tree is constructed. e. All training procedures are performed automatically.

and mel-cepstrum sequence including power contour Figure 2. Schematic diagram of HMM-based prosody generation. either spectral parameter sequences [40] or waveform segments [51] are utilized. As the synthesis units.3.Contextual information Decision tree for state duration model Context dependent duration models Decision trees for spectrum Context dependent HMMs for spectrum and F0 Decision trees for F0 1st state 2nd state 3rd state Sentence HMM State duration sequence. Kagoshima and Akamine proposed an automatic generation method of synthesis units with Closed-Loop Training (CLT) [5][55]. F0 contour. netic contexts. an optimum set of synthesis units is selected or generated from a speech corpus in advance to minimize the degradation caused by synthesis processing such as prosodic modi. In this approach.

cation. A measure capturing this degradation is de.

ned as the di erence between a natural speech segment prepared as training data cut o from the speech corpus and a synthesized speech segment with prosodic modi.

cation. The selection or generation of the best synthesis unit is performed on the basis of the evaluation 12 .

the speci. In the VCV units. While. and CVC units [93]. proposed Non-Uniform Units (NUU) [52][89][106]. in the CVC units. In this approach. Sagisaka et al. the concatenation between phonemes that often produces perceptual discontinuity can be avoided by utilizing these units. speech with natural and smooth sounding quality can be synthesized. syllable. V: Vowel) units are often used since nearly all Japanese syllables consists of CV or V. Although the number of diphone waveform segments used as synthesis units is very small (only 302 segments). Therefore. diphone. phoneme. There are many types of basic synthesis units. e. CV (C: Consonant. VCV units [92].g. The units comprised of more than two phonemes can preserve transitions between phonemes. In Japanese. concatenation points are vowel centers in which formant trajectories are stabler and clearer than those in consonant centers [92]. The diphone units have unit boundaries at phoneme centers [32][84][87]. concatenation is performed at the consonant centers in which waveform power is often smaller than that in the vowel centers [93].and minimization of the measure in each unit cluster represented by a diphone [87]. In order to use stored speech data e ectively and exibly.

In this approach. An optimum set of synthesis units is selected by minimizing a cost capturing the degradation caused by spectral di erence. One is a target cost. and concatenation between units in a synthesis procedure. the selected units.c units are not selected or generated from a speech corpus in advance. di erence in phonetic environment. NUU. Campbell et al. proposed that not only factors related to spectrum and phonetic environment but also prosodic di erence are considered in selecting the optimum synthesis units [44]. which captures the degradation caused by concatenating  13 . Hirokawa et al. which captures the degradation caused by prosodic di erence and di erence in phonetic environment. The ATR -talk speech synthesis system is based on the NUU represented by a spectral parameter sequence [90]. Based on these works. and the other is a concatenation cost. Since it is possible to use all phoneme subsequences as synthesis units. i. Black et al. speech is synthesized by concatenating the selected waveform segments and then modifying their prosody. also proposed utilization of prosodic information in selecting the synthesis units [19][20]. have variable lengths.e. proposed a general algorithm for unit selection by using two costs [11][22][47].

If the size of a corpus is large enough and it's possible to select waveform segments satisfying target prosodic features predicted by the prosody generation.units. CHATR is constructed as a generic speech synthesis system [10][12][21]. By introducing these techniques to -talk. a larger-sized speech corpus is utilized than that of -talk. the sum of the two costs is minimized using a dynamic programming search based on phoneme units. In this algorithm. Since the number of considered factors increases. it is not necessary to perform prosody modi.

This waveform segment selection has become the main current of corpus-based TTS systems for any language. natural speech without degradation caused by signal processing can be synthesized by concatenating the waveform segments directly. Therefore. In recent years. and Takano et al. Therefore. The units are stored in a speech corpus as sequences of phonemes with phonetic environments. optimum waveform segments can be selected while considering both the degradation caused by concatenation and that caused by prosodic modi. in order to alleviate the perceptual discontinuity caused by concatenation.e. i.. Conkie proposed a waveform segment selection based on half-phoneme units to improve the robustness of the selection [28]. These units can preserve the important transitions for Japanese. A stored unit can have multiple waveform segments with di erent 0 or phoneme duration. V-V transitions. respectively.cation [23]. CV* units [60] and multiform units [105] have been proposed as synthesis units by Kawai et al.

Shorter units have also been proposed. the number of concatenations becomes smaller by utilizing longer units. the longer units cannot always synthesize natural speech. decision-tree state-clustered HMMs are trained automatically with a speech corpus in advance. [117]. the waveform segment selection technique is applied   F 14 . In our corpus-based TTS. In this approach. Donovan et al. In the HMM-based speech synthesis proposed by Yoshimura et al. However. proposed HMM statebased units [33][34]. In order to determine the segment sequence to concatenate. since the number of candidate units becomes small and the exibility of prosody synthesis is lost.cation. In general. the optimum HMM sequence is selected from decision-trees by utilizing phonetic and prosodic context information. a dynamic programming search is performed over all waveform segments aligned to each leaf of the decision-trees in synthesis.

2.4 Waveform synthesis An output speech waveform is synthesized from the selected units in the last procedure of TTS. 2. A schematic diagram of the segment selection is shown in Figure 2. In general.Predicted phonetic and prosodic information Target sequence e r a b u Candidate segments in speech corpus e r a b u Selected sequence e r a b u Segment sequence causing least degradation of naturalness Figure 2. Schematic diagram of segment selection. two approaches to waveform synthesis have been used.4.4. to a unit selection module. One is waveform concatenation without speech modi.

and the other is speech synthesis with speech modi.cation.

cation. In the waveform concatenation. In this case. instead of performing prosody modi. speech is synthesized by concatenating waveform segments selected from a speech corpus using prosodic information to remove need for signal processing [23].

raw waveform segments are used.cation. However. synthetic speech has no degradation caused by signal processing. degradation is caused by the prosodic di erence [44]. if the prosody of the selected waveform segments is di erent from the predicted target prosody. Therefore. In order to alleviate the degradation. it is necessary to prepare a large-sized speech corpus that contains abundant wave15 .

form segments. the naturalness is not always consistent. In the speech synthesis. The Time-Domain Pitch-Synchronous OverLap-Add (TD-PSOLA) algorithm is often used for prosody modi. signal processing techniques are used to generate a speech waveform with the target prosody. Although synthetic speech by waveform concatenation sounds very natural.

Speech is synthesized from a mel-cepstrum sequence generated directly from the selected HMMs and the excitation source by utilizing a Mel Log Spectrum Approximation (MLSA) .cation [79]. In the HMM-synthesis method. a mel-cepstral analysis-synthesis technique is performed [117]. Speech analysis-synthesis methods can also modify the prosody. TD-PSOLA does not need any analysis algorithm except for determination of pitch marks throughout the segments.

lter [49]. A vocoder type algorithm such as this can modify speech easily by varying speech parameters, i.e. spectral parameter and source parameter [36]. However, the quality of the synthetic speech is often degraded. As a high-quality vocoder type algorithm, Kawahara et al. proposed the STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) analysis-synthesis method [58]. STRAIGHT uses pitch-adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region to remove signal periodicity and designs an excitation source based on phase manipulation. Moreover, STRAIGHT can manipulate such speech parameters as pitch, vocal tract length, and speaking rate while maintaining high reproductive quality. Stylianou proposed the Harmonic plus Noise Model (HNM) as a high-quality speech modi

cation technique [100]. In this model, speech signals are represented as a time-varying harmonic component plus a modulated noise component. Speech synthesis with these modi

cation algorithms is very useful in the case of a small-sized speech corpus. Synthetic speech by this speech synthesis sounds very smooth, and the quality is consistent. However, the naturalness of the synthetic speech is often not as good as that of synthetic speech by waveform concatenation. In our corpus-based TTS, both the waveform concatenation technique and STRAIGHT synthesis method are applied in the waveform synthesis module. In the waveform concatenation, we control waveform power in each phoneme segment by multiplying the segment by a certain value so that average power in a 16

phoneme segment selected from a speech corpus becomes equal to average target power in the phoneme. When the segments modi.

speech waveforms in voiced phonemes are synthesized with STRAIGHT by using a concatenated spectral sequence.ed by this power are concatenated. we use original waveforms modi. In unvoiced phonemes. A schematic diagram of the waveform concatenation is shown in Figure 2. and target prosodic features. an overlap-add technique is applied in the frame-pair with the highest correlation around a concatenation boundary between the segments. a concatenated aperiodic energy sequence.5. In the other synthesis method based on STRAIGHT.

A schematic diagram of the speech synthesis with prosody modi.ed only by power.

2. phonetic environments. 2.cation by STRAIGHT is shown in Figure 2. which should be recorded while maintaining high quality. Abe et al. In order to realize a consistently high quality of synthetic speech.5 Speech corpus A speech corpus directly in uences the quality of synthetic speech in corpusbased TTS. it is important to prepare a speech corpus containing abundant speech segments with various phonemes.6. developed a Japanese sentence set in which phonetic coverage is controlled [1]. and prosodies. This sentence set is often used not only in the .

i. is decided in advance. (1) the distributions of 0 and phoneme duration predicted by the prosody generation and (2) perceptual degradation of naturalness due to the prosody modi. The size of the sentence set.e. Kawai et al. The coverage measure captures two factors. This method selects the optimum sentence set from a large number of sentences by maximizing the measure of coverage. proposed an e ective method for designing a sentence set for utterances by taking into account prosodic coverage as well as phonetic coverage [62]. the number of sentences.eld of speech synthesis but also in speech recognition. i.e.

However. the degradation of naturalness caused by a mismatch of phonetic environments and prosodic di erence can be alleviated by increasing the size of the speech corpus.cation. In general. Concatenation between speech segments with di erent voice qualities produces F 17 . variation in voice quality is caused by recording the speech of a speaker for a long time in order to construct the large-sized corpus [63].

6. Schematic diagram of waveform concatenation. Schematic diagram of speech synthesis with prosody modi. Target prosody Waveform segments in unvoiced phonemes Power modification Parameter segments in voiced phonemes Concatenated parameters STRAIGHT synthesis Synthesized waveform Waveform concatenation Synthetic speech Figure 2.5.Waveform segments Target power Power modification Search for optimum frame-pairs around concatenative boundaris Overlap-add Synthetic speech Figure 2.

cation by STRAIGHT. 18 .

In our corpus-based TTS. A sentence set for utterances is extracted from TV news articles. These VQ-based algorithms are basic to statistical voice conversion. To deal with this problem. Abe et al. newspaper articles. Moreover. As a similar conversion method to the DFW algorithm. 2. a large-sized corpus of speech spoken by a Japanese male who is a professional narrator is under construction. In the LMR algorithm.audible discontinuity. a spectrum of a source speaker is converted with a simple linear transformation for each class. [2]. In this approach. speech segments of a source speaker selected by HMM are replaced by the corresponding segments of a target speaker. [114]. Both static and dynamic characteristics of speaker individuality can be preserved in the segment. and this idea has been proposed as a method for speaker adaptation by Shikano et al. In the algorithm using the Dynamic Frequency Warping (DFW) [114]. previous studies have proposed using a measure capturing the di erence in voice quality to avoid concatenation between such segments [63] and normalization of power spectral densities [94]. the fuzzy VQ-based algorithm using di erence vectors between mapping code vectors and input code vectors has been proposed in order to represent various spectra beyond the limitation caused by the codebook size [75][76]. has also proposed a segment-based approach [4]. modi. phrase books for foreign tourists. This algorithm has been improved by introducing the fuzzy Vector Quantization (VQ) algorithm [81]. In this method. The maximum size of the corpus used in this thesis is 32 hours.3 Statistical Voice Conversion Algorithm The code book mapping method has been proposed as a voice conversion method by Abe et al. [95]. voice conversion is formulated as a mapping problem between two speakers' codebooks. a frequency warping algorithm is performed in order to convert the spectrum. A voice conversion algorithm using Linear Multivariate Regression (LMR) has been proposed by Valbret et al. and so on by taking into account prosodic coverage as well as phonetic coverage.

an algorithm using Neural Networks has been proposed [82]. Moreover. 19 .cation of formant frequencies and spectral intensity has been proposed [78].

In these algorithms.j 1 Cj(y) . VFS/MAP [104]. we focus on a voice conversion algorithm based on Gaussian Mixture Model (GMM) proposed by Stylianou et al. HMM parameters among representative speakers' HMM sets are interpolated [118]. Although the quality of the synthetic speech is not adequate. and both training procedures and conversion procedures are performed automatically. Speech with the voice quality of various speakers can be synthesized from the interpolated HMMs directly [49][112]. In this algorithm.3. the converted spectrum is represented by a mapping codebook. a Gaussian mixture model. the average voice is often used in place of the source speaker's voice. The converted spectrum is synthesized by interpolating the spectra of multiple speakers. A code vector Ci(map) of class in the mapping codebook for a code vector Ci(x) of class in the source speaker's codebook is generated as follows: i i C (map) i = X m j =1 wi. MAP (Maximum A Posteriori) [69]. can be applied to HMM-based speech synthesis [74][107]. Utilization of correlation between features of two speakers is indeed characteristic of this algorithm.A voice conversion algorithm by speaker interpolation has been proposed by Iwahashi and Sagisaka [53]. 2.1 Conversion algorithm based on Vector Quantization In the codebook mapping method. The VQ-based algorithms mentioned above and the GMM-based algorithm are described in the following section. In HMM-based speech synthesis. only the speech data of the source speaker and target speaker are needed.1) 20 . (2. some speaker adaptation methods. e. the feature space can be represented continuously by multiple distributions. and MLLR (Maximum Likelihood Linear Regression) [71]. i. In this thesis. VFS (Vector Field Smoothing) [83]. Moreover.g. which is calculated as a linear combination of the target speaker's code vectors [2]. this attractive approach has the ability to synthesize various speakers' speech exibly.e. [98][99]. In the voice conversion by HMM-based speech synthesis.

(2.j = X m j hi.j hi.k .j denotes a histogram for the frequency of the correspondence between the code vector Ci(x) and the code vector Cj(y) in the training data. the source speaker's speech features are vectorquantized into a code vector sequence with the source speaker's codebook. In the conversion-synthesis step. i. quantization errors are caused by vector quantization. The vectors are represented not as one code vector but as a linear combination of several code vectors. The quantization errors are decreased by introducing a fuzzy VQ technique that can represent various kinds of vectors beyond the limitation caused by the codebook size [81]. Therefore. and then each code vector is mapped into a corresponding code vector in the mapping codebook. the number of code vectors. i. the converted speech is synthesized from the mapped code vector sequence having characteristics of the target speaker.wi. Since representation of the features is limited by the codebook size. the converted speech includes unnatural sounds.e. Finally. The fuzzy VQ is de.2) k=1 where Cj(y) denotes a code vector of class in the target speaker's codebook having code vectors.

u i ui X ! k di dj 1 .6) where denotes fuzziness.5) 0 j =1 ( i )f k ( j )f u u .ned as follows: m h x = 0 X k i=1 wi (f ) 1 Ci(x) . The conversion is performed by replacing the source speaker's code vectors with the mapping code vectors. f j =1 21 . (2. f 1 0 = jjx 0 Ci(x) jj (2. i denotes a fuzzy membership function of class and is given by 1 = (2. The converted vector di .3) (2.4) k wi (f ) = X where x denotes a decoded vector of an input vector x. denotes the number of the nearest code vectors to the input vector.

7) Furthermore. i.8) (2.3. X X m i=1 m i=1 i N (x. In the GMM algorithm.12) 2 (x 0 u) 6 (x 0 u) m 0 N  0   : 22 . u. 6i) . (2. i . the probability distribution of acoustic features x can be written as p x . 6 ) = (2 )p=2 exp 0 1 (2. (x. x 0 y .4) and Di denotes the di erence vector as Di = Ci(map) 0 Ci(x) : (2.  6 ) denotes the normal distribution with the mean vector  and the covariance matrix 6 and is given by j6 j 1=2 1 T (x. .x(map) is given by x(map) = X k i=1 wi (f ) 1 Ci(map) : (2. y .2 Conversion algorithm based on Gaussian Mixture Model We assume that -dimensional time-aligned acoustic features xf[ 0 1 . it is possible to represent various additional input vectors by introducing di erence vectors between the mapping code vectors and the code vectors in the fuzzy VQ-based voice conversion algorithm [75]. . . where T denotes transposition of the vector. . i =1  0. (2. In this algorithm. and denotes the total number of Gaussian mixtures. i=1 where follows: wi (f ) is given by Equation (2. .9) . p 1]Tg (target speaker's) are determined by Dynamic Time Warping (DTW). y 0 p (x) = subject to i N . the converted vector x(map) is given by x(map) = D(map) + x k X (f ) D(map) = i 1 Di w . . x .11) where i denotes a weight of class . p 1]Tg (source speaker's) and yf[ 0 1 .10) 2.

3 2 3 : (2. (iz). and 6 (iyx). (2. (ix) . p (z ) = subject to X X m i=1 m i=1 i N (z . In order to estimate parameters i. 6i(z)) . we assume that the covariance matrices. i =1  0. respectively. that in the fuzzy VQ mapping algorithm (\Fuzzy VQ"). (iy). (ix) 6 (ixx) ) (x) 6 (jxx)) j (x.15) where 6i(z) denotes the covariance matrix of class for the joint vectors and (iz) denotes the mean vector of class for the joint vectors.16) These parameters are estimated by the EM algorithm [30]. 6 (ixx). i . These are given by 6 (z ) i 26 =4 6 i (xx) i (yx) i 6i(xy) 5 (z) = 4 (ix) 5 i 6i(yy) (iy) .14) hi P m j =1 (x.3. i (2.The mapping function [98][99] converting acoustic features of the source speaker to those of the target speaker is given by F (x) = = (x) = X E m i=1 [yjx] (y ) (yx) 6 (ixx) i (x)[i + 6 i h i N   .3 Comparison of mapping functions . that 23 Figure 2. algorithm (\VQ").7 shows various mapping functions: that in the codebook mapping 2. and the cross-covariance matrices. 6 (iyx) denotes the cross-covariance matrix of class for the source and target speakers. In this thesis. 0 1 (x 0 (ix) )] . 6 (ixx) and 6 (iyy). j . where (ix) and (iy) denote the mean vector of class for the source and that for the target speakers. N . 6 (ixy) and 6 (iyx) . are diagonal. 6 (ixx) denotes the covariance matrix of class for the source speaker. the probability distribution of the joint vectors z = [xT yT]T for the source and target speakers is represented by the GMM [57] as follows: i i i .13) (2.

001 0. the accuracy is not high enough.4 5.2 5. Various mapping functions.4 4.6 4.2 5 4.7. E y x x in the fuzzy VQ mapping algorithm using the di erence vector (\Fuzzy VQ using di erence vector").6 5. The contour line denotes frequency distribution of training data in the joint feature space.6 5. We also show values of the conditional expectation [ j ] calculated directly from the training samples.8 4.2 Figure 2. x 6 6. y 5.6 5. Although the mapping function nearly approximates the conditional expectation by introducing the difference vector.005 Fuzzy VQ using difference vector GMM 5.8 Original feature.6 4. The mapping function in the codebook mapping algorithm is discontinuous because hard decision clustering is performed.4 5. it is shown that the mapping function in the GMM-based algorithm is close to the conditional expectation because the correlation between the source feature and the target feaE y x 24 . On the other hand. The mapping function becomes continuous by performing fuzzy clustering in the fuzzy VQ algorithm. \x" denotes the conditional expectation [ j ] calculated in each value of original feature . However.8 Target feature.0001 VQ Fuzzy VQ 0. and that in the GMM-based algorithm (\GMM"). The number of classes or the number of Gaussian mixtures is set to 2. the accuracy of the mapping function is bad because the mapping function seems to be di erent from the conditional expectation values.8 5 0.

conventional voice conversion algorithms are described. The GMM-based algorithm can convert spectrum more smoothly and synthesize converted speech with higher quality compared with the codebook mapping algorithm [109]. Therefore. its naturalness is still inadequate.4 Summary This chapter described the basic structure of corpus-based Text-to-Speech (TTS) and reviewed the various techniques in each module. Moreover. since the covariance can be considered in the GMM-based algorithm. the mapping function of the voice conversion algorithm based on the Gaussian Mixture Model (GMM) is the most practical and has the highest conversion-accuracy among the conventional algorithms.ture can be utilized. However. the mapping function in the GMM-based algorithm is the most reasonable and has the highest conversion-accuracy among the conventional algorithms. 2. 25 . Corpus-based TTS improve the naturalness of synthetic speech dramatically compared with rule-based TTS. We also introduced some techniques applied to the corpus-based TTS under development. Gaussian mixtures can represent the probability distribution of features more accurately than the VQ-based algorithms. From the results of comparing various mapping functions. and exible synthesis has not yet been achieved. Moreover.

e. In order to alleviate such discontinuities. longer units. we propose a novel segment selection algorithm that incorporates not only phoneme units but also diphone units. and these correspond to symbols in the Japanese `Kana' syllabary. To address this problem. The advantage of considering both types of units is examined by experiments on concatenation of vowel sequences. 26 . Since Japanese syllables consist of CV (C: Consonant or consonant cluster. except when a vowel is devoiced. However.g. The concatenation in the proposed algorithm is allowed at the vowel center as well as at the phoneme boundary.Chapter 3 A Segment Selection Algorithm for Japanese Speech Synthesis Based on Both Phoneme and Diphone Units This chapter describes a novel segment selection algorithm for Japanese TTS systems. Moreover. CV units are often used in concatenative TTS systems for Japanese. have been proposed. since various vowel sequences appear frequently in Japanese. However. the results of perceptual evaluation experiments clarify that the proposed algorithm outperforms the conventional algorithms. CV* units. V: Vowel or syllabic nasal /N/) or V. it is not realistic to prepare long units that include all possible vowel sequences. speech synthesized with CV units sometimes has discontinuities due to V-V or V-semivowel concatenation.

since various vowel sequences appear frequently in Japanese. It is also well known that transitions from C to V. Formant transitions are more stationary at vowel centers than at vowel boundaries.1 Introduction In Japanese. However.000) [67]. extended the CV unit to the CV* unit [60]. an English TTS system based on CHATR has been adapted for diphone units by AT&T [7]. it is not realistic to construct a corpus that includes all possible vowel sequences. Therefore. and this system has proved to be an improvement over the previous system. other units are often used in TTS systems for English because the number of syllables is enormous (over 10. As typical Japanese TTS systems that utilize the concatenation at the vowel centers. Kawai et al. V: Vowel or syllabic nasal /N/) syllables as synthesis units. Therefore. Furthermore. On the other hand. the concatenation between V and V is unavoidable. concatenation at the vowel centers tends to reduce audible discontinuities compared with that at the vowel boundaries. a speech corpus can be constructed eciently by considering CV (C: Consonant or consonant cluster. In recent years. CV units are often used in concatenative TTS systems for Japanese. optimum units are selected from a speech corpus to minimize the total cost calculated as the sum of some sub-costs [52][90][106]. TOS Drive TTS (Totally Speaker Driven Text-to-Speech) has been constructed 27 . various sized sequences of phonemes are selected [11][22][47]. In Japanese TTS. since Japanese syllables consist of CV or V except when a vowel is devoiced. are very important in auditory perception [89][105]. Sagisaka proposed non-uniform units to use stored speech data e ectively and exibly [89]. As a result of dynamic programming search based on phoneme units. VCV units are based on this view [92]. In order to alleviate these discontinuities.3. or from V to V. CV syllables correspond to symbols in the Japanese `Kana' syllabary and the number of the syllables is small (about 100). Therefore. If the coverage of prosody is also to be considered. speech synthesized with CV units has discontinuities due to V-V or V-semivowel concatenation. the NextGen TTS system based on half-phoneme units has been constructed [8][28][102]. The frequency of vowel sequences is described in Appendix A. the corpus becomes enormous [62]. which has been supported by our informal listening test. In this algorithm.

On the other hand. a mapping of phonetic information into perceptual measures can be determined from the results of perceptual experiments [64]. evaluation experiments are described. into a perceptual measure. However. A mapping of acoustic measures into a perceptual measure is generally not practical except when the acoustic measures have simple structure. both TTS systems take into account only the concatenation at the vowel centers in vowel sequences. 3. as in the case of 0 or phoneme duration.2 Cost Function for Segment Selection The cost function for segment selection is viewed as a mapping.g. e. The chapter is organized as follows. In Section 3. ThereF 28 . the novel segment selection algorithm is described. Therefore. cost functions for segment selection are described. the advantage of performing concatenation at the vowel centers is discussed. both types of concatenation should be considered in vowel sequences. A cost is considered the perceptual measure capturing the degradation of naturalness of synthetic speech. and the other contextual information is converted into acoustic measures by the prosody generation. In this thesis.2. In Section 3. Acoustic measures with complex structure. acoustic measures and contextual information. we propose a novel segment selection algorithm incorporating not only phoneme units but also diphone units. of objective features.by TOSHIBA [55] and Final Fluet has been constructed by NTT [105]. Finally. as shown in Figure 3. Thus.1.4. In Section 3. have not been found so far [31][66][101][115]. concatenation at the vowel boundaries is not always inferior to that at the vowel centers. In Section 3. diphone units are used if the desirable CV* units are not stored in the corpus. such as spectral features that are accurate enough to capture perceptual characteristics. only phonetic information is used as contextual information. The former TTS is based on diphone units.6.5. In the latter TTS. The results of evaluation experiments clarify that the proposed algorithm outperforms the conventional algorithms. In this chapter. The proposed algorithm permits the concatenation of synthesis units not only at the phoneme boundaries but also at the vowel centers. The components of the cost function should be determined based on results of perceptual experiments. we summarize this chapter in Section 3.3.

fore. Therefore.1 Local cost The local cost shows the degradation of naturalness caused by utilizing an individual candidate segment. it is possible to capture the perceptual characteristics by utilizing such a mapping.Observable features Acoustic measure Spectrum Duration F0 Contextual information Phonetic context Accent type Perceptual measure Cost function Cost Mapping observable features into a perceptual measure Figure 3.1. 3. acoustic measures that can represent the characteristic of each segment are still necessary.2. we utilize both acoustic measures and perceptual measures determined from the results of perceptual experiments. The cost function is comprised of . However. Schematic diagram of cost function. since phonetic information can only evaluate the di erence between phonetic categories.

Each sub-cost re ects either source information or vocal tract information. The local cost is calculated as the weighted sum of the .ve sub-cost functions shown in Table 3.1.

ti ) wF0 1 CF0 (ui . The local cost ( i i) at a candidate segment i is given by LC u . ui 0 1 ) 29 . t u LC ui .ve sub-costs. ti ( ) = wpro + 1 Cpro (ui .

In each phoneme.2 Sub-cost on prosody: F Cpro F This sub-cost captures the degradation of naturalness caused by the di erence in prosody ( 0 contour and duration) between a candidate segment and the target.2 shows targets and segments used to calculate each sub-cost in the calculation of the cost of a candidate segment i for a target i. t w . Figure 3. w .1) (3. These sub-cost functions are described in the following subsections.Table 3. ti ). a phoneme is divided into several parts.w . All sub-costs are normalized so that they have positive values with the same mean. In order to calculate the di erence in the 0 contour. the prosodic cost is represented as an average of the F 30 .2) where i denotes a target phoneme. When the candidate segments i 1 and i are connected in the corpus.2. Sub-cost functions Source information Prosody ( 0 .1.w . + wapp =1 (3. w u 0 i t 0 u 0 u u t 3. concatenation between the two segments is not performed. i. pro F0 env spec and app denote the weights for individual sub-costs. ui 1 ) 1 0 ) . duration) 0 discontinuity Vocal tract information Phonetic environment Spectral discontinuity Phonetic appropriateness F F Cpro CF 0 Cenv Cspec Capp wpro + wF0 + + + + wenv wspec wapp 1 Cenv (ui . and the di erence in an averaged log-scaled 0 is calculated in each part.e. wspec 1 Cspec (ui .2. 0. these weights are equal. ui 0 wenv + 1 Capp (ui . The preceding segment i 1 shows a candidate segment for the ( 0 1)-th target phoneme i 1 . In this thesis.

which is calculated for each phoneme and used in the calculation of the cost in each part. u i . denotes the nonlinear function and is described in Appendix B. M m=1 P DF0 ui . u i . The function was determined from the results of perceptual experiments on the degradation of naturalness caused by prosody modi. denotes the number of divisions. u i .Targets t i-1 t i t C pro ( u i . respectively.1 ) C F0 ( u i . t C app ( u i . i and i show phonemes considered target and candidate segments.2. d denotes the di erence in the duration. ti . ti ( ( (3.3) F D where F0 ( i i ) denotes the di erence in the averaged log-scaled 0 in the -th divided part. The sub-cost Cpro ui . m . u t t u costs calculated in these parts. Dd ui . F0 is set to 0.1 ) C spec ( u i .1 ) Segments ui-1 ui ) ) ui+1 Figure 3. Targets and segments used to calculate each sub-cost in calculation of the cost of a candidate segment i for a target i. ti ( )= 1 X M Cpro is given by ) ( )) . t i i i+1 C env ( u i . In the unvoiced phoneme.

cation. assuming that the output speech was synthesized with prosody modi.

cation. When prosody modi.

the function should be determined based on other experiments on the degradation of naturalness caused by using a di erent prosody from that of the target. D u .4) ( )= ( P DF0 ui . 31 .3 Sub-cost on F0 discontinuity: C CF ) 0) 0 F0 This sub-cost captures the degradation of naturalness caused by an nuity at a segment boundary.m m D M P P 3.cation is not performed. ui01 disconti(3. The sub-cost F0 is given by CF0 ui . ( .2. ui01 .t .

5 Sub-cost on spectral discontinuity: Cspec This sub-cost captures the degradation of naturalness caused by the spectral discontinuity at a segment boundary. while p denotes the preceding phoneme in the corpus. The sub-cost functions s and p are determined from the results of perceptual experiments described in Appendix C. this sub-cost is set to 0. s( i 1 s( i 1) i) denotes the degradation caused by the mismatch with the succeeding environment in the phoneme for i 1.4 Sub-cost on phonetic environment: Cenv This sub-cost captures the degradation of naturalness caused by the mismatch of phonetic environments between a candidate segment and the target.e.t u 0 E u 0 t S u . we utilize the function in Equation (3.6) by considering that a phoneme for i is equal to a phoneme for i and a phoneme for i 1 is equal to a phoneme for i 1 . i. Even if a mismatch of phonetic environments does not occur. Therefore. The sub-cost env is given by C Cenv ui .5) into Equation (3. replacing s ( i 1 ) with the phoneme for i .6) 1 )g 2 = . and p ( i p ( i ) i 1 ) denotes the degradation caused by the mismatch with the preceding environment in the phoneme i.5) (3. s denotes the succeeding phoneme in the corpus.E u S u .t 0 u E u t 0 S S u 0 u 3.where F0 denotes the di erence in log-scaled 0 at the boundary.E u u 0 .t u i0 ) ) + p( i p( i) 1) i) + p( i p( i) S u .2. E ( s( ui01 . replacing p ( i) with the phoneme for i 1. Es 0 1. the sub-cost becomes 0. u t t 0 S S E E S u 0 . ti0 )g 2 (3.E u 0 .3). In order to normalize a dynamic range of the sub-cost. F0 is set to 0 at the unvoiced phoneme boundary. When the segments i 1 and i are connected in the corpus. s denotes the sub-cost function that captures the degradation of naturalness caused by the mismatch with the succeeding environment. where we turn Equation (3. ui . i.E u . When the segments i 1 and i are connected in the corpus.e.2. and p denotes that caused by the mismatch with the preceding environment. ui01 . the sub-cost does not necessarily become 0 because this sub-cost re ects the diculty of concatenation caused by the uncertainty of segmentation. ui01 ( ) = = fSs (ui fSs (ui 0 1 . = . D F D P u 0 u 3. This sub-cost is calculated as the weighted 32 .

Concatenation is performed between the 01-th frame of i 1 and the 0-th frame of i . (i i1 ) denotes the mel-cepstral distortion between the -th frame from the concatenation frame ( = 0) of the preceding segment i 1 and the -th frame from the concatenation frame ( = 0) of the succeeding segment i in the corpus. f . s is a coecient to normalize the dynamic range of the sub-cost.f f f u 0 f f u u 0 u c 20 1 2 1 40 ( ln 10 d=1 mc mc d . The sub-cost spec is given by Cspec ui . ui01 ( )= cs 1 w=201 f =0w=2 X C h f M C D ui . The mel-cepstral distortion calculated in each frame-pair is given by h w .sum of mel-cepstral distortion between frames of a segment and those of the preceding segment around the boundary. ( ) ( ) (3.7) M CD u . u 0 where denotes the triangular weighting function of length . ui01 .

v u u t X mc (d) 0 mc.

) (3.8) where ( d) and (. (d) 2 .

When the segments i 1 and i are connected in the corpus. C E N (ti )).6 Sub-cost on phonetic appropriateness: Capp This sub-cost denotes the phonetic appropriateness and captures the degradation of naturalness caused by the di erence in mean spectra between a candidate segment and the target. the conversion algorithm proposed by Oppenheim et al. u 0 u 3. respectively. (3.d) show the -th order mel-cepstral coecient of a frame and that of a frame . Mel-cepstral coecients are calculated from the smoothed spectrum analyzed by the STRAIGHT analysis-synthesis method [58][59]. ti ( )= ct 1 M C D (C E N (ui ). denotes the mel-cepstral distortion between the mean cepstrum of the segment i and that of the target i. The mel-cepstral distortion is given CEN M CD u t c 33 . t is a coecient to normalize the dynamic range of the sub-cost. The sub-cost app is given by C Capp ui .2.9) where denotes a mean cepstrum calculated at the frames around the phoneme center. Then. is used to convert cepstrum into mel-cepstrum [85]. this sub-cost becomes 0.

we integrate local costs for individual segments into a cost for a segment sequence as shown in Figure 3. tN ) Integration-function Integrated cost Mapping local costs into an integrated cost Figure 3. We utilize the mel-cepstrum sequence output from contextdependent HMMs in the HMM synthesis method [117] in calculating the mean cepstrum of the target ( i). the optimum set of segments is selected from a speech corpus.3. In this thesis. by Equation (3. CEN t 3. t 2 ) uN LC(uN . This cost is de. t 1 ) u2 LC(u2 . Schematic diagram of function to integrate local costs LC .8). this sub-cost is set to 0 in the unvoiced phoneme. Therefore.7 Integrated cost In segment selection.2.3.Targets Segments t 0 t 1 t 2 t N u0 u1 LC(u1 .

10) t u where denotes the number of targets in the utterance. and it is given by AC AC =1 u N 1 X N i=1 LC ui . ( ) (3. ti .ned as an integrated cost. 0 ( 0 ) shows the pause before the utterance and N ( N ) shows the pause after the utterance. The subN t 34 . The average cost is often used as the integrated cost [11][22][25][47][102]. The optimum segment sequence is selected by minimizing the integrated cost.

C C 3.costs pro and app are set to 0 in the pause. At vowel boundaries. discontinuities can be observed at the concatenation points.3 Concatenation at Vowel Center boundary and at a vowel center. Minimizing the average cost is equivalent to minimizing the sum of the local costs in the selection. This is because it is not easy to .

.nd a synthesis unit satisfying continuity requirements for both static and dynamic characteristics of spectral features at once in a restricted-sized speech corpus. in contrast. At vowel centers.

it is dicult to directly show the e ectiveness achieved by using the concatenation at vowel centers since various factors are considered in segment selection. In order to investigate the instability of spectral characteristics in the vowel. it is assumed that the discontinuities caused by concatenating vowels can be reduced if the vowels are concatenated at their centers.2. The results are shown in Figure 3. As the spectral feature. In order to clarify this assumption. the distances of static and dynamic spectral features were calculated between centroids of individual vowels and all segments of each vowel in a corpus described in the following subsection. Therefore. we need to investigate the e ectiveness of concatenation at vowel centers in segment selection. However. it is expected that more synthesis units reducing the spectral discontinuities can be found.5. From these results. the formant trajectories are continuous at the concatenation points. Therefore.5. As a result. It is obvious that the spectral characteristics are stabler around the vowel center than those around the boundary. we used the mel-cepstrum described in Section 3.nding a synthesis unit involves only static characteristics. because the spectral characteristics are nearly stable. and their transition characteristics are well preserved. we .

we compare concatenation at vowel boundaries with that at vowel centers by the mel-cepstral distortion. When a vowel sequence is generated by concatenating one vowel segment and another vowel segment.rst investigate this e ectiveness in terms of spectral discontinuity. In this section. the mel-cepstral 35 Figure 3.4 compares spectrograms of vowel sequences concatenated at a vowel . which is one of the factors considered in segment selection.

05 0.15 0.2 0.4. The utterances had a duration of about 30 minutes in total (450 sentences).25 3000 (b) Concatenation at vowel center Frequency [Hz] 2500 2000 1500 1000 500 0 0 0.3. We used a speech corpus comprising Japanese utterances of a male speaker.6.25 Phoneme units Diphone units Figure 3.15 0. where segmentation was performed by experts and 0 was revised by hand. Spectrograms of vowel sequences concatenated at (a) a vowel boundary and (b) a vowel center. 3.2 0.1 Time [s] 0.05 0. The vowel center shows a point of a half duration of each vowel segment. distortion caused by the concatenation at vowel boundaries and that at vowel centers are calculated. F 36 .Input phonemes 4000 3500 a phoneme 2 s o i 3000 Frequency [Hz] (a) Concatenation at boundary 2500 2000 1500 1000 500 Concatenation points 0 4000 3500 0 0.1 Experimental conditions The concatenation methods at a vowel boundary and at a vowel center are shown in Figure 3.1 Time [s] diphone 2 s 0.

7) was calculated. Statistical characteristics of static feature and dynamic feature of spectrum in vowels.8 0.3. The sampling frequency was 16.000 Hz. substitution of phonetic environments was not prohibited. The concatenation at vowel boundaries and that at vowel centers were performed by using all of the vowel sequences in the corpus.e.6 Normalized time 0. i. Concatenation at vowel centers (\Vowel center") can generally reduce the discontinuity caused by the concatenation compared with that at vowel bound37 .6 shows all vowels.2 0.8 1 0.5 Distance of ∆ mel-cepstrum 0. \V*" in Figure 3.4 0.4 0.7 shows frequency distribution of distortion caused by concatenation. \Normalized time" shows the time normalized from 0 (preceding phoneme boundary) to 1 (succeeding phoneme boundary) in each vowel segment.0.2 Experiment allowing substitution of phonetic environment In this experiment. All segments of \V1" having a vowel as the succeeding phoneme in the corpus were used.5. the weighted sum of the mel-cepstral distortion given by Equation (3. In each segment-pair.9 Static Distance of mel-cepstrum Dynamic 0.6 0 0. and then the coecient s was set to 1.2 Figure 3.3 0.7 0. Figure 3. c 3.

\Vfh" and \Vlh " show the .V2 Concatenation at vowel boundary Segment 1 V1 V* V1 Concatenation at vowel center Segment 1 V1 fh V1 lh V* Segment 2 V1 V2 V2 Segment 2 V1 fh V1 lh V2 V1 fh V1 lh V2 Figure 3. Concatenation methods at a vowel boundary and a vowel center.6. \V*" shows all vowels.Target vowel sequence: V1 .

as the frequency distribution shifts to the left side. e. aries (\Vowel boundary"). All segments of \V1" having \V2" as the succeeding phoneme in the corpus were used.e. In the segment selection. From this point of view." 38 . it was found that segment selection was better when using concatenation at vowel centers along with substitution of phonetic environments. 3.3.3 Experiment prohibiting substitution of phonetic environment In this experiment. respectively. substitution of phonetic environments was prohibited.g. prosodic distance. i. the number of segments that can reduce distortion increases. \V*" = \V2.rst half-vowel and the last half-vowel. Therefore. it is important to select the segments that can reduce not only the spectral discontinuity but also the distortion caused by various factors.

we selected the best type of concatenation in each segment-pair by allowing both vowel center and vowel boundary concatenations.4.8 shows the frequency distribution of distortion caused by the con- Motivated by the considerations in Section 3.10). This approach (\Vowel boundary & Vowel center") can reduce the discontinuity in concatenation compared with concatenation performed only at vowel boundaries or only at vowel centers. a target phoneme sequence. Therefore.e. The distortion caused by the concatenation at vowel centers is almost equal to that at vowel boundaries.catenation between vowels that have the same phonetic environment. 3. we also describe the conventional segment selection algorithm based on phoneme unit for comparison with the proposed algorithm.4 Segment Selection Algorithm Based on Both Phoneme and Diphone Units An input sentence. The local costs of candidate segments for each target phoneme are calculated by Equation (3.8.e. Here. As a result.1). we propose a novel segment selection algorithm based on both phoneme and diphone units. The optimum set of segments is selected from a speech corpus by minimizing the average cost given by Equation (3.3. is divided into phonemes. the performance of concatenation at vowel centers is the same as that at vowel boundaries when substitution of phonetic environments is not performed. Next. i. the sum of the local costs. Figure 3. 39 . This shows that the number of better segments increases by considering both types of concatenation. i.1 Conventional algorithm 3. Frequency distribution of distortion in this case is shown in Figure 3. These results clarify that better segment selection can be achieved by considering not only the concatenation at vowel centers but also that at vowel boundaries in vowel sequences. non-uniform units based on phoneme units can be used as synthesis units.

08 0.35.02 0 Vowel boundary Mean 7. Frequency distribution of distortion caused by concatenation between vowels that have the same phonetic environment.06 0. 1. 40 . S.D.D.23.06 0. S.64 Vowel boundary & Vowel center Mean 7. S. 0. S.D.7. 1.D.8. \S.39 Normalized frequency 0 2 4 6 8 10 12 14 16 Weighted sum of mel-cepstral distortion [dB] 18 Figure 3.87.02 0 Vowel boundary Mean 9. 1.74 Normalized frequency 0 2 4 6 8 10 12 14 16 18 Weighted sum of mel-cepstral distortion [dB] 20 Figure 3.12 0. " shows standard deviation.D.04 0.D. 1.78.04 0. 2. Frequency distribution of distortion caused by concatenation between vowels in the case of allowing substitution of phonetic environment.10 0.60 Vowel center Mean 7.0.07 Vowel center Mean 8.08 0.17. S.12 0.10 0.

Example of segment selection based on phoneme units.9. Therefore.9 shows an example of the conventional segment selection based on When concatenation is allowed at the vowel centers. phoneme units. we do not allow C-V concatenation since the transition from C to V is very important in auditory perception to the intelligibility in Japanese [89][105]. l We assumed that candidate segments f i and i are the . Concatenation at C-V boundaries is prohibited.2 Proposed algorithm Figure 3. In this thesis.Input sentence = sil ts ts u u i i / ts u (C V y y i V a a y a C V s s s/ C) sil Figure 3. The input sentence is \tsuiyas" (\spend" in English).4. Each half-vowel segment has a half duration of the original vowel segment. 3. the half-vowel segments dividing the vowel segments are utilized to take account of diphone units. the segment sequences comprised of non-uniform units based on the syllable were selected.

which is l divided into the .rst half-vowel segment of the original vowel segment 1i and the last half-vowel segment of the original vowel segment 2i . respectively. Sub-costs for a target phoneme i.

dur u dur u dur u dur u 41 .t w C u .12) l f = f f ( i ) + ( li ) ( i ) + ( li) u u u u t t t C w C u .rst half-phoneme f i and the last half-phoneme i .11) f 1 pro ( i i ) + l 1 pro ( i i ) f l ( i) ( i) = (3. w dur u . w dur u .t . are calculated as follows:  The pro sub-cost is calculated as the weighted sum of the sub-costs calculated at the half-vowel segments and is given by f f l l (3.

M= where the weights f and l are de.

( f ( li) denote duration of the .ned according to durations of the segments.

the sub-cost functions.13)  The env sub-cost is calculated as the sum of the sub-costs at a phoneme boundary and a vowel center and is given by C Cenv ui . ti+1 ) = .t 0 . ui C ( env ( d ( + p S l f ) u1i . ui01 ( f ) + = Cenv ui . sd and p . E ) + f sd( 1i s( 1i ) p ( 2i ) i 1 )g 2 S u . S S S S  The spec sub-cost is calculated as the sum of the sub-costs at a phoneme boundary and a vowel center and is given by C Cspec ui . d On the other hand.rst half-vowel segment i ) and f l i and that of the last half-vowel segment i .14) where the phonetic environments of the half-vowel segment are equal to those of the original vowel segment divided into the half-vowel segments. In calculating the pro for the half-vowel segments. for the concatenation at phoneme boundaries. (3. 1i and 2i divided into the half-vowel segments. s and p . ui ( l f ) : (3. ui01 ( f )+ CF0 ui .15)  The app sub-cost is calculated as the weighted sum of the sub-costs calculated at the original vowel segments. for the concatenation at vowel centers are not equal to the sub-cost functions. w w dur u dur u u u C C  The F0 sub-cost is calculated as the sum of the sub-costs at a phoneme boundary and a vowel center and is given by CF0 ui . which are given by Equation (3. ti ) + wl 1 Capp (u2i . (3. 42 .E u u .16) where the weights wf and wl are given by Equation (3. respectively. each half-vowel segment is divided into 2 parts.6). li and f i . ui ( l f ) : (3.12). ti ). ui01 ( f )+ Cspec ui . and is given by C u u u u wf 1 Capp (u1i . ui01 u 2i .

if a phoneme unit l f l f is used. u i .1 ) C F0 ( u fi . become i f 0.1 ) C spec ( u fi .1 ) Figure 3. u fi ) C F0 ( u li . and F0 ( i i ). since i 1 and i are connected in the corpus.10. the sub-costs. the sub-costs. env ( li f i ). and F0 ( f i 1 ).t Targets t i-1 f i Division of a vowel into half vowels t l i t ) f ) i i f i t i+1 w f .u t l calculation of the cost of candidate segments f i i for a target i . l since f i and i are connected in the corpus. env ( i i 1 ). Moreover. it is not easy to . The advantage of using phoneme units is the ability to preserve the characteristics of phonemes. it might be assumed that slight spectral discontinuities in the transitions in which spectral features change rapidly are hard to perceive. Targets and segments used to calculate each sub-cost in calculation l of the cost of candidate segments f i i for a target i . Furthermore. spec ( i i ). non-uniform units based on both phoneme units and diphone units can be used as synthesis units. u i . t Segments u i-1 w l . u i . C pro ( u li . In the other phonemes where concatenations are not allowed at their centers. u fi ) C spec ( u li . If a diphone unit f f is used. Phoneme units and diphone units have their own advantages and disadvantages. u fi ) u l l ) i) i i+1 C env ( u fi . t u fi u li C env ( u li . also become 0. As a result. the costs are calculated in the same way as the conventional algorithm. C app ( u fi . The optimum segment sequence is selected from a speech corpus by minimizing the average cost. t w f . However. C pro ( u fi . C app ( u li . spec( i i 1 ). u . t w l .

u 0 C u .u C u .u 0 u 0 u C u .nd synthesis units that can reduce the spectral discontinuities since static and dynamic characteristics of spectral feau .u 0 Figure 3.u u u 43 .10 shows targets and segments used to calculate each sub-cost in the C u .u t C u .u C u .

In the transitions from V to a semivowel or a nasal. the half-vowel segments are not used except for the segments having silences as phonetic environments. the cost is used to determine which of the two units is better. the advantage of using diphone units is the ability to preserve transitions between phonemes and to concatenate at steady parts in phonemes. a a s s sil sil ts ts ts u u u u u i i i y a Figure 3. 44 . more synthesis units reducing the spectral discontinuities can be found. it might be assumed that spectral discontinuities in steady parts are easy to perceive. tures should be considered in the transitions. Concatenation at C-V boundaries and selection of isolated half-vowels are prohibited. In the proposed algorithm. An example of the proposed segment selection algorithm is shown in Figure 3. On the other hand. However. Therefore. minimum units preserve either the important transitions between phonemes or the characteristics of Japanese syllables. We allow concatenations at vowel centers not only in transitions from V to V but also in transitions from V to a semivowel or a nasal. and /i-y/ as well as phoneme units are considered in the segment selection.11. Example of segment selection based on phoneme units and diphone units. In this thesis. Diphone units such as /ts-u/.11. Therefore.Input sentence = / ts u (C V i i i V y y y a s/ C V C) /y/ is semivowel. /u-i/. diphone units that start from the center of a vowel in front of consonants are used.

the weight env in Equation (3. Since the number of Japanese syllables is much smaller than that of English syllables. we don't use half-phoneme units as used in the AT&T NextGen TTS system [28].4.12. In order to make a fair comparison. An example of a segment selection algorithm based on half-phoneme units is shown in Figure 3.12. we compared the proposed algorithm with the conventional algorithm based on half-phoneme units.sil ts ts Input sentence = ts ts u u u u i i / ts u i i i y y y a y y a a s/ a a s s s s sil Figure 3. Therefore. As a result of a preference test on the naturalness of synthetic speech. The proposed algorithm can be considered an algorithm based on half-phoneme units adapted for Japanese speech synthesis by restricting some types of concatenation. However. Example of segment selection based on half-phoneme units. Namely. we can construct a speech corpus containing all of the syllables. 3.3 Comparison with segment selection based on halfphoneme units Our TTS under development is mainly for Japanese speech synthesis.1) was set to 0 since we have not yet determined s and p in sub-cost env for almost all of the halfphoneme units. it might be assumed that half-phoneme units would also work well for Japanese speech synthesis. the 95% con. we restrict minimum synthesis units to syllables or diphones in order to preserve important transitions. Therefore.

especially w S S C 45 . Therefore. This result shows that as the length of units used in segment selection becomes short. it is important to use a cost that is accurate enough to capture the audible discontinuities.98%. although more combinations of units can be considered to reduce the prosodic di erence.dence interval of the proposed algorithm's preference score was 66.67 6 3. the risk of causing more audible discontinuities by excessive concatenations becomes high.

The speech was synthesized with prosody ( 0 contour.5 Experimental Evaluation In order to evaluate the performance of the proposed algorithm. and the number of concatenations in each algorithm is shown in Table 3.2.3. A preference test was performed with synthesized speech of 10 Japanese sentences. we also compared the proposed algorithm with another conventional algorithm. duration. and the number of concatenations in each algorithm is shown in Table 3.in segment selection based on short units. In Experiment A. and V-N sequences. V-S. all of the synthesized speech were comprised of 453 phonemes. These sentences were not from the speech corpus used in the segment selection. The sentences used in Experiment A were di erent form those in Experiment B. In the other experiment.3.5. Experiment B.1. which is described in Section 3. which allows concatenation only at vowel centers in V-V.1 Experimental conditions We used the speech corpus comprising Japanese utterances of a male speaker. 3. We call the former comparison Experiment A and the latter comparison Experiment B. all of the synthesized speech were comprised of 366 phonemes. 3. The speech was synthesized by the proposed segment selection algorithm or the conventional algorithm in each experiment. and power) modi. we compared the proposed algorithm with the conventional algorithm. The natural prosody and the mel-cepstrum sequences extracted from the original utterances were used to investigate the performance of the segment selection algorithms. Moreover. which allows concatenation only at phoneme boundaries.

In each trial. a pair of utterances synthesized with the proposed algorithm and the conventional algorithm was presented in random order. Ten listeners participated in the experiment. F 46 . and the listeners were asked to choose either of the two types of synthetic utterances as sounding more natural.cation by using STRAIGHT.

In both experiments. \S" and \N" show semivowel and nasal. and that in Experiment B was 64. 3. and V-N sequences Concatenation V-C V-V V-S V-N Center Proposed algorithm 137 10 6 23 22 Conventional algorithm 143 62 3.13. Number of concatenations in experiment comparing proposed algorithm with segment selection allowing only concatenation at vowel center in V-V. The preference score of the proposed algorithm in Experiment A was 69. we proposed a novel segment selection algorithm for Japanese speech synthesis with both phoneme and diphone units to avoid the degradation of naturalness caused by concatenation at perceptually important transi47 .2 Experimental results Results of Experiment A and those of Experiment B are shown in Figure 3. V-S. Number of concatenations in experiment comparing proposed algorithm with segment selection based on phoneme units.6 Summary In this chapter.3. the preference scores of the proposed algorithm exceeded 50% by a large margin.Table 3.25%.2.25%.5. These results demonstrate that the proposed algorithm can synthesize speech more naturally than the conventional algorithms. \Center" shows concatenation at vowel center Concatenation V-C V-V V-S V-N Center Proposed algorithm 125 6 3 11 25 Conventional algorithm 124 16 3 20 Table 3.

Proposed algorithm Conventional algorithm 95% confidence interval

Exp. A

Exp. B

0

20

60 40 Preference score [%]

80

100

Figure 3.13. Results of comparison with the segment selection based on phoneme units (\Exp. A") and those of comparison with the segment selection allowing only concatenation at vowel center in V-V, V-S, and V-N sequences (\Exp. B"). tions between phonemes. In the proposed algorithm, non-uniform units allowing concatenation not only at phoneme boundaries but also at vowel centers can be selected from a speech corpus. The experimental results of concatenation of vowel sequences clari

ed that better segments reducing the spectral discontinuities increases by considering both types of concatenation. We also performed perceptual experiments. The results showed that speech synthesized with the proposed algorithm has better naturalness than that of the conventional algorithms. Although we restrict minimum synthesis units to syllables or diphones, these units are not always the best units. The best unit de

nition is expected to be determined according to various factors, e.g. corpus size, correspondence of the cost to perceptual characteristics, synthesis methods, and the kinds of languages. Consequently, we need to investigate various units based on this view. Moreover, we need to clarify the e ectiveness of searching optimal concatenation frames [27] after segment selection, although we have no acoustic measure that is accurate enough to capture perceptual characteristics. 48

Then. which is a ected by both the average cost and the maximum cost. From the results of experiments comparing this approach with segment selection based on the average cost. Furthermore. we investigate the e ect of using the RMS cost for segment selection. it is found that (1) in segment selection based on the RMS cost a larger number of concatenations causing slight local degradation are performed to avoid concatenations causing greater local degradation. The results show that the conventional average cost. which shows the degradation of naturalness over the entire synthetic speech. we evaluate various costs for a segment sequence in terms of correspondence of the cost to perceptual scores determined from results of perceptual experiments on the naturalness of synthetic speech. has better correspondence to the perceptual scores than the maximum cost.Chapter 4 An Evaluation of Cost Capturing Both Total and Local Degradation of Naturalness for Segment Selection In this chapter. We also clarify that the naturalness of synthetic speech can be slightly improved by utilizing the RMS cost. 49 . which shows the local degradation of naturalness. has the best correspondence. it is shown that RMS (Root Mean Square) cost. and (2) the e ect of the RMS cost has little dependence on the size of the corpus.

it is doubtful whether this correspondence is preserved. In segment selection. Therefore. we compare segment selection based on the RMS cost with that based on the average cost. Therefore. Moreover. it might be assumed that local degradation of naturalness would have a great effect on the naturalness of synthetic speech. the optimum set of segments is selected from a speech corpus by minimizing the integrated cost for a segment sequence. e. As a result. Then various functions to integrate the costs are evaluated in terms of correspondence to the MOS (Mean Opinion Score) determined from the results of perceptual experiments on the naturalness of synthetic speech. In the design process of the cost function.g. However. Here. which shows the degradation of naturalness over the entire synthetic speech. it is important to utilize a cost that corresponds to the perceptual characteristics to synthesize speech naturally [70][113]. which is described in Section 3. speech synthesis based on segment selection has recently become the focus of much work on synthesis [24][102].2. which is a ected by both the average cost and the maximum cost showing the local degradation of naturalness. although the average cost.4. utilization of acoustic measures that are not accurate enough to capture perceptual characteristics [64][101]. To realize TTS with high quality and robustness. and independence among various factors. is often used [11][22][47].2 to the perceptual scores. Selected segment sequences are analyzed from various points of view to clarify how the local degradation of naturalness can be alleviated by utilizing the RMS cost. such a cost has not been found so far [31][66][103][115]. We also clarify that the naturalness of synthetic speech can be slightly improved by using the RMS cost in segment selection. a direct investigation of the relationship between a perceptual measure and the cost is worthwhile. however. since there are some approximations. We also clarify the relationship between the e ectiveness 50 . In order to investigate the e ect of considering not only the total degradation of naturalness of synthetic speech but also the local degradation in segment selection.7. we show that the RMS (Root Mean Square) cost.1 Introduction In corpus-based concatenative TTS. we clarify the correspondence of our cost described in Section 3. has the best correspondence to the perceptual scores. it is necessary to improve this cost function [25][86].

This chapter is organized as follows.5. the local degradation of naturalness. It might be assumed that the largest cost in the sequence. In Section 4.of the RMS cost and the size of the corpus. Therefore.3.10).2 Various Integrated Costs In the conventional segment selection [11][22][25][47][102]. the e ectiveness of utilizing the RMS cost in segment selection is discussed. various integrated costs for the segment selection are described. the optimum set of segments is selected from a speech corpus by minimizing the average cost given by Equation (3. Finally.2.4.e. a segment with a large cost can be included in the output sequence of segments even if it is optimal in view of the average cost. we summarize this chapter in Section 4. we present perceptual evaluations of the costs. In Section 4. The average cost shows the degradation of naturalness over the entire synthetic utterance. would have much e ect on the degradation of naturalness in synthetic speech. 4. In Section 4. let us de. i. To investigate this issue.

p . ti ( )gp # 1 p . When is set to 1.1) where denotes the number of targets in the utterance. (4. we utilize the norm cost. the norm cost is equal to the average cost. When is set to in. ti ( )g 1  .2) where denotes a power coecient. In order to evaluate the various integrated costs.ne the maximum cost as the integrated cost given by AC MC MC = max f i W C ui . given by N NC N Cp "1 X = 1 f N N i=1 W C ui . (4. i  N.

we . this norm cost takes into account both the mean value and the maximum value by varying the power coecient.nity. the norm cost is equal to the maximum cost. In this chapter. Thus.

nd an optimum value of the power coecient from the results of perceptual experiments. p p p 51 .

4. 22. 2. We synthesized 14.6. 0. 11. from which we selected a set of 140 stimuli so that the set covers a wide . In order to select a proper set of test stimuli. 8. 5.7.926 utterances that were not included in the corpus. 32).3.7. Each utterance consisted of a part of a sentence that was divided by a pause.1 Correspondence of cost to perceptual score We performed an opinion test on the naturalness of the synthetic speech. 1. a large number of utterances were synthesized by varying the corpus size from 0.5.3 Perceptual Evaluation of Cost 4.3. 16.4. 1.5 to 32 hours (0.8.4. 2.

Natural prosody and the mel-cepstrum sequences extracted from the original utterances were used as input information for segment selection.2. respectively. and the number of concatenations are roughly equal among the selected stimuli. This selection was performed under the restriction that the number of phonemes in an utterance. In order to alleviate audible discontinuity at the boundary between vowel and voiced phoneme.eld in terms of both average cost and maximum cost. concatenation at the preceding vowel center is also allowed in the segment selection.1 and Figure 4. The distribution of the average cost and the maximum cost for all synthetic utterances and selected test stimuli are shown in Figure 4. In the waveform synthesis. signal processing for prosody modi. the duration of an utterance.

cation was not performed except for power control. Therefore.2. we should use a di erent subcost on prosody pro from that described in Section 3. where the function has been determined in the case of performing prosody modi.2.

cation. In the case of not performing prosody modi.

variance = 1) for each listener in order to equalize the score range among listeners. These levels were determined by each listener. They evaluated the naturalness on a scale of seven levels.3 shows the correlation coecient between the norm cost and C P P C P 52 . we approximate pro by the function described in Appendix C. namely 1 (very bad) to 7 (very good). so the scores were distributed widely among the stimuli. Eight listeners participated in the experiment. The perceptual score. Therefore. however. was calculated as an average of the normalized score calculated as a Z-score (mean = 0. we have not determined any function .cation. here the MOS. Figure 4.

4 Average cost 0.4 Average cost 0.6 0. 53 .5 0.1 0.6 0.2 0. Distribution of average cost and maximum cost for all synthetic utterances.2 Maximum cost 1.5 1 0.5 1 0. 2 Maximum cost 1.7 Figure 4.2 0.2.3 0.3 0. Scatter chart of selected test stimuli.7 Figure 4.5 0.5 0 0 0.1 0.5 0 0 0.1.

p < : 54 . the norm cost. called the Root Mean Square (RMS) cost. . The absolute value of the correlation coecient in the case of the RMS cost is statistically larger than those in the cases of the average cost ( = 2 4696 = 137 0 05).75 -0.d f . the naturalness of synthetic speech is better estimated by considering both the degradation of naturalness over the entire synthetic speech and the local degradation of natup : : : t : . Therefore.-0.7 -0.65 -0.6 Norm cost Maximum cost 1 2 3 4 5 6 7 Power coefficient.8 -0.85 -0.4 shows the correlation between the average cost and the perceptual score. and Figure 4. p 8 9 10 Figure 4. when the power coecient is set to 2. Correlation coecient between norm cost and perceptual score as a function of power coecient. the naturalness of synthetic speech is better estimated by the degradation of naturalness over the entire synthetic utterance than by using only the local degradation of naturalness.3. Therefore. Moreover.9 Correlation coefficient -0. Figure 4. p the perceptual score as a function of the power coecient. The average cost ( = 1) has better correspondence to the perceptual scores (correlation coecient = 00 808) than does the maximum cost (correlation coecient = 00 685). has the best correspondence to the perceptual scores (correlation coecient = 00 840).5 shows the correlation between the maximum cost and the perceptual score.

2 Figure 4.5 Correlation coefficient = -0.5 0 -0. Perceptual score (Normalized MOS) 2 1.5 0 -0.5 Correlation coefficient = -0.6 1.5.2 0.Perceptual score (Normalized MOS) 2 1.4 0.1 Figure 4.5 1 0.685 Residual standard error = 0.4 0. 55 .6 0.808 Residual standard error = 0.3 0.8 1 1.4 Maximum cost 1.8 2 Regression line -2 0.5 1 0. Correlation between maximum cost and perceptual score.502 0.5 -1 -1.621 0.7 Regression line -2 0.5 Average cost 0.4. Correlation between average cost and perceptual score.2 1.5 -1 -1.6 0.

Correlation between RMS cost and perceptual score.3 0. Figure 4.8 0.6.7 0. It is worth noting that the naturalness of synthetic speech is better estimated by using both the total degradation of : 56 . we also performed multiple linear regression analysis by utilizing the norm costs while varying the power coecient from 1 to 10 and the maximum cost as predictor variables.6 RMS cost 0.2 0. since the RMS cost is a ected by both types of degradation.5 1 0.5 0.840 Residual standard error = 0. the correspondence to the perceptual scores was not improved statistically (multiple correlation coecient = 0 846) compared with the correspondence of the RMS cost.6 shows the correlation between the RMS cost and the perceptual score.4 0. Figure 4. In order to estimate the perceptual scores more accurately. Although the RMS cost is the best integrated cost in this experiment.8 shows the best correlation and the worst correlation in the results of all listeners. As a result.9 Regression line Figure 4.Perceptual score (Normalized MOS) 2 1.5 -2 0. Figure 4.1 Correlation coefficient = -0.5 0 -0.7 shows the correlation coecient between the RMS cost and the normalized opinion score for each listener.5 -1 -1. Moreover.462 0. The RMS cost can be converted into a perceptual score by utilizing the regression line. it is expected that the best power coecient depends on the correspondence of the local cost to perceptual characteristics. ralness.

and utterances used as test stimuli were not included in the corpus.naturalness and the local degradation than by using only the total degradation. In order to clarify which of the two costs could select the best segment sequence. 4. The corpus size was 32 hours. On the other hand.3. we performed a preference test on the naturalness of synthetic speech. Natural prosody and the mel-cepstrum sequences extracted from the original utterances were used as input information for segment selection. such large local costs are alleviated in the case of the RMS cost. Some large local costs surrounded by circles are shown in the case of the average cost.2 Preference test on naturalness of synthetic speech the conventional average cost and that of another segment sequence selected by the novel RMS cost. Signal processing for prosody modi.

Therefore.10.3) (4. The di erence in the RMS cost and that in the average cost were calculated as follows: RM S CACsel ACACsel Figure 4. RM S C AC RM S C AC 57 . The naturalness of synthetic speech was expected to be nearly equal between segment sequences having similar costs.4) where ACsel and ACsel denote the RMS cost and the average cost of the segment sequence selected by minimizing the average cost. A scatter chart of the test stimuli is shown in Figure 4. In order to fairly compare the performances of these two costs. RMSCsel and RMSCsel denote the RMS cost and the average cost of the segment sequence selected by minimizing the RMS cost. (4. we used pairs of segment sequences that had greater cost di erences in the test. respectively. \Sub-set A" includes stimulus pairs with larger di erences in the RMS cost.9 shows an example of local costs of a segment sequence selected by 0 RM S CRMSCsel . Each pair is comprised of the segment sequence selected by the RMS cost and that by the average cost. 0 ACRMSCsel . we selected the pairs with the larger di erences in the average cost as well as those with the larger di erences in the RMS cost. \sub-set B" includes stimulus pairs with larger di erences in average cost.cation was not performed except for power control. respectively. On the other hand.

3 0. Best correlation between RMS cost and normalized opinion score (left .5 0.-0.572 0.7 -0.9 Figure 4.6 -0.7 RMS cost 0.3 0.8.75 -0.8 Correlation coefficient -0.1 2 1 0 -1 -2 0.778 2 Normalized opinion score Normalized opinion score 1 0 -1 -2 0.5 0. Correlation coecient between RMS cost and normalized opinion score for each listener.7 RMS cost 0.55 1 2 3 4 5 6 Listener index 7 8 Figure 4.65 -0. Correlation coefficient = -0.9 0.7.1 Correlation coefficient = -0.

gure) and worst correlation between RMS cost and normalized opinion score (right .

58 .gure) in results of all listeners.

\Av. since 5 pairs were included in both sub-sets. There were 20 stimulus pairs in each sub-set." and \RMS" show the average and the root mean square of local costs. and the total number of stimulus pairs was 35.2 0 /sil/ j/ i/ b/ u/ N/ y/ o/ r/ i/oo/kI/kU/ k/ a/ N/ j/ i/ r/ a/ r/ e/ r/ u/ m/ o/ n/ o/ w/ a/pau/ Phoneme in segment sequence Figure 4. this improvement is only slight.360) 1 0.= 0. RMS = 0.= 0. RMS = 0. In each trial. Eight Japanese listeners participated in the experiment. Examples of local costs of segment sequences selected by the average costs and by the RMS cost. 59 .Selection by average cost (Av.11 show that the segment selection based on the RMS cost can synthesize speech more naturally than that based on the average cost in all cases: utilizing all stimuli.6 0. and stimuli in sub-set B only.9. respectively. stimuli in sub-set A only.316.8 Local cost 0. synthetic speech by the segment selection based on the average cost and that by the segment selection based on the RMS cost were presented in random order.412) Selection by RMS cost (Av. The results in Figure 4. and listeners were asked to choose either of the two types of synthetic speech as sounding more natural.4 0. However.294.

02 -0. Each dot denotes a stimulus pair.01 Sub-set B 0 -0. 4.02 0.005 Difference in average cost 0 Figure 4.05 0.03 -0.0. Scatter chart of selected test stimuli.3 Correspondence of RMS cost to perceptual score in lower range of RMS cost We clari.03 0.015 -0.04 0.10.01 -0.06 Sub-set A Difference in RMS cost 0.3.035 -0.025 -0.

Test stimuli were included in the region covered in utilizing the 32-hour 60 . the RMS costs are expected to be distributed not in a wide range but in a lower range. We performed an opinion test on the naturalness of the synthetic speech to clarify the correspondence of the RMS cost to the perceptual scores in a lower range. our TTS system utilizes a large-sized corpus that includes many segments with high coverage on both phonetic environment and prosody to synthesize speech more naturally and consistently.ed the correspondence of the RMS cost to the perceptual scores when the size of the corpus was varied in Section 4. since segments causing only a slight degradation of naturalness can usually be selected. In the case of a large-sized corpus.1.3. it is worthwhile to investigate the correspondence to the perceptual scores in a range of lower RMS costs. However. Thus.

This selection was performed under the restriction that the number of phonemes in an utterance. The number of selected stimuli was 160. The perceptual score was calculated as an average of the normalized score calculated as a Z-score (mean = 0. so the scores were distributed widely among the stimuli. it is obvious that the correspondence of the RMS cost is inadequate and that we should improve the cost function. These levels were determined by each listener.RMS cost Average cost 95% confidence interval All stimuli Sub-set A Sub-set B 0 20 60 40 Preference score [%] 80 100 Figure 4. and the number of concatenations were roughly equal among the selected stimuli. The correspondence of the RMS cost to the perceptual scores is shown in Figure 4. in which the RMS costs were less than 0 4. The correspondence is much worse (correlation coecient = 00 400) than that in the case of utilizing stimuli that cover a wide range of the cost (correlation coecient = 00 840). Preference score. corpus. Eight Japanese listeners participated in the experiment. variance = 1) for each listener in order to equalize the score range among listeners.12. the duration of an utterance. They were selected from a large number of utterances synthesized by varying the corpus size.11. : : : 61 . They evaluated the naturalness on a scale of seven levels. Therefore.

4.4 0. 62 i=1 LCi ui .12. only the sum of the square local costs is calculated for the selection. the RMS cost is minimized and given by N = 1 1 f ( )g2 (4. the optimum segment sequence is selected while taking into account only the total degradation. Correlation between RMS cost and perceptual score in lower range of RMS cost.3 RMS cost 0.4 Segment Selection Considering Both Total Degradation of Naturalness and Local Degradation In the conventional segment selection based on the average cost.Correlation coefficient = -0.2 0.5 Figure 4.5) RM S C RM S C v u u t X N Actually.400 Residual standard error = 0. ti : .1 0. In order to consider not only the total degradation but also the local degradation.723 2 Regression line Normalized MOS 1 0 -1 -2 0. we incorporated the RMS cost in segment selection. In this selection.

In this section. in order to represent the costs more easily. we divide the .

we investigated the e ects of the RMS cost on both the target cost and the concatenation cost. ui 0 1 ) 0 (4.6) 1 1 ) ) .ve sub-costs described in Section 3. In order to clarify what causes the decrease in the standard deviation. ui 1 Cspec (ui . The target cost is shown in Figure 4. 63 .4.13 shows the local costs as a function of corpus size. These utterances were not included in the corpus used for segment selection. (4. concatenations at certain phoneme centers. i.8) (4. Figure 4.7) (4. wapp . ti ) + wc 1 Cc (ui .e. ti ).1 E ect of RMS cost on various costs We investigated the e ect of the RMS cost on the local cost. a target cost t and a concatenation cost c [11][22][47][102]. i.131 utterances as an evaluation set. although the mean of this cost is slightly worse. the standard deviation of the local cost is smaller than that of the segment selection based on the average cost (\Average cost"). 4.10) (4. In the segment selection. ti ) wapp =wt wenv =wc + + 1 Cenv (ui . We utilized 1. These costs are given by C C Ct ui . were also allowed in order to alleviate audible discontinuity.9) ) . ti ( ) = ) = = = wt wc wpro =wt Cc ui . ti ( w ) = t + 1 Ct (ui . wenv + wspec .2 into two commonly used costs. This is a consequence of the large penalty imposed on a segment with a large local cost in the case of the RMS cost. 0 wspec =wc wF0 =wc wt wc wpro w + F0 + 1 CF0 (ui . preceding vowel centers of voiced phonemes and unvoiced fricative centers.11) We compared the segment selection based on the RMS cost with that based on the average cost. and then the local cost is written as LC ui .e.14 as a function of corpus size. ui01 ( + 1 Cpro (ui . In the segment selection based on the RMS cost (\RMS cost"). ui 1 Capp (ui . ui 0 =1 1 : (4.

t = 0 4. Mean and standard deviation are shown. it might be assumed that these results would be in uenced by the weights for sub-costs rather than by the local degradation. and the standard deviation increases slightly by utilizing the RMS cost.15 shows the concatenation cost as a function of corpus size. Although the means of concatenation costs are equal between the average cost and the RMS cost. c = 0 6 in Equation (4.0.4 0. However. On the other hand.5 1 2 5 10 Corpus size [hour] 20 Average cost RMS cost Figure 4.1 0 0. we tried to analyze the e ects of utilizing other weight sets.5 Local cost 0. Local costs as a function of corpus size.7 0. the mean of the local cost is slightly worse as a consequence of the degradation of the target cost.e. Therefore.10). These results show that the e ectiveness of decreasing the standard deviation of the local cost is dependent on the concatenation cost.6 0. The mean of the target costs is degraded. since we utilized a weight set in which the weight for the target cost was smaller than that for the concatenation cost.3 0. the standard deviation becomes smaller by utilizing the RMS cost. Figure 4. The same results w : w : 64 .2 0. The increase in the standard deviation of the target cost is much smaller than the decrease in the standard deviation of the concatenation cost.13. i.

4 0. Concatenation cost as a function of corpus size.6 0.8 Concatenation cost 0.2 0 -0.2 0 -0.6 0.1 0.8 Target cost 0.5 1 10 2 5 Corpus size [hour] 20 Average cost RMS cost Figure 4.5 1 2 5 10 Corpus size [hour] 20 Average cost RMS cost Figure 4. 65 .4 0.15. Target cost as a function of corpus size.2 0. 1 0.2 0.14.

In the segment selection. Therefore.2 E ect of RMS cost on selected segments In order to clarify what causes the decrease in the standard deviation of the concatenation cost.17 shows the segment length in number of syllables as a function of corpus size for reference. 4. which can take account of the local degradation as well as the the total degradation.e.25 when the syllable is comprised of CV. and 2 to 1. In calculating the segment length. or 0. concatenations at C-V boundaries (C: Consonant and V: Vowel) are prohibited. Therefore.were obtained for all weight sets.4. since such segments often cause a decrease in the number of candidate segments and thus cannot always synthesize speech naturally. We perform the pruning process. Moreover. 1 to 1. we do not consider whether segments are connected in the corpus. remaining candidate segments do not always 66 . by considering the target cost and the mismatch of phonetic environments. when we use the large-sized corpus that includes many candidate segments having target phonetic environments. i. This result is caused by pruning candidate segments to reduce the computational complexity of segment selection. Figure 4.5 to 1.5. while the standard deviation is nearly equal. the e ectiveness mentioned above depends not on the weights for subcosts but on the function used to integrate the local costs.5 when the syllable is V. the number of syllables for half-phonemes in a diphone unit is set to 0. although the standard deviation of the segment length is large. The segment length is shorter in the selection with the RMS cost compared with that in the case of the average cost. in which the ratios of the target cost to the concatenation cost were set to 1 to 2. since our segment selection is essentially based on syllable units. only less than 1. called pre-selection [29]. The important point is to select the best segment sequence by considering not only the degradation caused by concatenation but also that caused by various factors. It can be seen that the segment length is unexpectedly short. prosodic distance.g. it is not necessarily best to select the longest segments.16 as a function of corpus size. we investigated characteristics of the segments selected by minimizing the RMS cost. 1. The segment length in the number of phonemes is shown in Figure 4. Namely. It is also shown that the segment length becomes shorter as the corpus size increases to a level over 20 hours. e. 1 to 1.4 syllables.

The cost di erences are calculated by subtracting the costs of the segment sequences selected by utilizing the RMS cost from those of the segment sequences selected by utilizing the average cost 67 . a larger number of segments with shorter lengths. are selected by utilizing the RMS cost. although the number of concatenations is small. which only cause slight local degradation. the concatenation at a boundary between any phoneme and a voiced consonant (\* . The rate of increase in the number of concatenations is shown in Figure 4. we investigated the di erences in average costs. Therefore. the concatenations at both a phoneme center (\Phoneme center") and a boundary between any phoneme and an unvoiced consonant (\* . As a whole.4.19 shows the concatenation cost in each type of concatenation when the corpus size is 32 hours. This rate is calculated by dividing the number of concatenations in the case of utilizing the RMS cost by that in the case of utilizing the average cost.18 as a function of corpus size. The concatenation between any phoneme and an unvoiced consonant can often reduce the concatenation cost compared with that between any phoneme and a voiced consonant. F 4. However. These results show a tendency to avoid performing concatenations that cause much local degradation of naturalness by instead performing more concatenations that cause slight audible discontinuity. By utilizing the RMS cost. It was also found that the concatenation at the phoneme center tends to reduce the concatenation cost compared with the other types of concatenation. the number of concatenations increases rather than decreases.20.Unvoiced consonant") increase. since the former type of concatenation has no discontinuity caused by concatenating 0 s at a segment boundary. The results are shown in Figure 4.connected in the corpus even if these segments have target phonetic environments.Voiced consonant") decreases in any corpus size. Figure 4.3 Relationship between e ectiveness of RMS cost and corpus size In order to clarify the relationship between the e ectiveness of using RMS cost and the corpus size. RMS costs. and maximum costs between the segment sequences selected by utilizing the average cost and those by utilizing the RMS cost.

Segment length in number of syllables as a function of corpus size.16.1 1/10 standard deviation 2 0.2 1/10 standard deviation 1.3 1. Segment length (Number of syllables) 1.17.5 1 10 2 5 Corpus size [hour] 20 Figure 4.4 1.5 2.2 2.5 Average cost RMS cost 1.6 2. Segment length in number of phonemes as a function of corpus size.3 2.5 1 2 5 10 Corpus size [hour] 20 Average cost RMS cost Figure 4.1 0.4 2.Segment length (Number of phonemes) 2. 68 .

Increase rate in the number of concatenations as a function of corpus size.Voiced consonant * .3 0.18.4 * .7 0.5 1 10 2 5 Corpus size [hour] 20 Figure 4.Vowel 1 * .1.Vowel Figure 4.8 0. The corpus size is 32 hours.6 0.8 0.9 Concatenation cost 0.6 Increase rate of the number of concatenations Phoneme center 1.5 0.2 * .Voiced consonant 0.2 500 2000 1000 Number of concatenations 5000 * . 1 0.Unvoiced consonant 1.Unvoiced consonant Phoneme center Average cost RMS cost * . Concatenation cost in each type of concatenation. \*" denotes any phoneme.4 0. 69 .19.

04 Difference in RMS cost 0.4 Evaluation of segment selection by estimated perceptual score The performance of segment selection is shown in the cost.1 Difference in average cost Difference in average cost 2 5 10 Corpus size [hour] 20 -0. 4.03 0. the RMS cost works well for alleviating the local degradation of naturalness. since the maximum cost becomes small. as described in Section 4.6. Chu and Peng have also estimated the MOS from their cost [25][86].e.20. the di erences in the maximum cost are positive.21 as a function of corpus size.01 -0. we converted the cost value into a perceptual score by using the regression line on the RMS cost shown in Figure 4. the e ectiveness of utilizing the RMS cost can be found in a corpus of any size.4 .03 0.02 -0.1 0 -0.4.2. Therefore. However. Small-sized corpora have been constructed so that various phonetic contexts 70 Difference in maximum cost Difference in RMS cost 0.01 0 -0. Estimated perceptual score is shown in Figure 4.3 Figure 4. Moreover.3 0.2 0.3. In order to indicate the performance of segment selection in a more intuitive quantity than the cost.5 1 Difference in maximum cost 0. the di erences in all costs have little dependence on the corpus size.0.2 -0. Di erences in costs as a function of corpus size.02 0. it is dicult to estimate the naturalness of synthetic speech from the cost value directly. From the results. i.

21.1.4 0. As a result. we evaluated a cost for segment selection based on a comparison with perceptual scores determined from the results of perceptual experiments on the naturalness of synthetic speech. we clari.4 0. As the corpus size becomes larger.2 0 -0. are included. 4. In this chapter.5 Summary In segment selection for concatenative TTS. This result means that the quality of the segment selection is higher and more consistent by utilizing the larger corpus.8 0.2 1 0.6 0. the estimated perceptual score becomes higher and its standard deviation becomes smaller.2 -0.5 1 2 5 10 Corpus size [hour] 20 Figure 4.4 Estimated perceptual score 1. it is important to utilize a cost that corresponds to the perceptual characteristics. Estimated perceptual score as a function of corpus size.

which takes into account both the average cost and the maximum cost. Furthermore. has a better correspondence to the perceptual scores than does the maximum cost. which captures the total degradation. has the best correspondence.ed that the average cost. which captures the local degradation. we found that the RMS (Root Mean Square) cost. We also clari.

ed 71 .

As a result. the correspondence of the RMS cost seemed to be good. In this selection. it was found that segment selection based on RMS cost performed a larger number of concatenations that caused slight local degradation in order to avoid concatenations causing greater local degradation. Therefore. the optimum segment sequences are selected by minimizing the RMS cost instead of the conventional average cost. a larger number of segments with shorter units were selected. We also performed the perceptual evaluation of the RMS cost in a lower range of the RMS cost. the e ectiveness of this selection was found for any size of corpus. Moreover. We investigated the e ect of considering not only the degradation of naturalness over the entire synthetic speech but also local degradation in segment selection. Namely. When the RMS costs were distributed widely. The results clari. the quality of the segment selection was higher and more consistent by utilizing the larger corpus. we evaluated the performance of segment selection by the estimated perceptual score by varying the corpus size.that the naturalness of synthetic speech could be slightly improved by utilizing the RMS cost. Namely. it was possible to accurately estimate the perceptual score from the RMS cost for synthetic speech of various qualities. From the results of experiments comparing this approach with segment selection based on the average cost.

it is necessary to determine the optimum weight set for sub-costs. Therefore. we should further improve the cost function based on perceptual characteristics. We will determine this weight set from the results of perceptual experiments on the naturalness of synthetic speech with a set of stimuli covering a wide range in terms of individual sub-costs. since our TTS does not consistently synthesize suciently natural speech. In particular. which is naturally a dicult problem. it is obvious that the RMS cost is not accurate enough for making comparisons between similar segments.ed that the correspondence of the RMS cost to the perceptual scores is inadequate in this case. 72 . However.

However. we propose a GMM-based algorithm with Dynamic Frequency Warping (DFW) to avoid such over-smoothing. The voice conversion algorithm based on the Gaussian Mixture Model (GMM). In the proposed algorithm. which is a conventional statistical voice conversion algorithm. In this chapter. Results of evaluation experiments clarify that the converted speech by the proposed algorithm has a higher quality than that by the GMM-based algorithm and that the conversion-accuracy for speaker individuality is the same as that by the GMMbased algorithm. can convert speech features continuously by using the correlation between a source feature and a target feature. 73 . the quality of the converted speech is degraded because the converted spectrum is excessively smoothed by the statistical averaging operation.Chapter 5 A Voice Conversion Algorithm Based on Gaussian Mixture Model with Dynamic Frequency Warping Speech of various speakers can be synthesized by utilizing a voice conversion technique that can control speaker individuality. the converted spectrum is calculated by mixing the GMM-based converted spectrum and the DFW-based converted spectrum to avoid deterioration of spectral conversion-accuracy.

1 Introduction The speech of various speakers can be synthesized by preparing speech corpora of the various speakers. each corpus must include a sucient amount of waveform segments having various phonetic environments and prosodic variety. Another approach for synthesizing speech of various speakers is speech modi.5. However. it is necessary to record a large number of speech utterances for each speaker. recording speech is very dicult and expensive work. it needs an enormous amount of time [63]. Furthermore. Therefore. In order to synthesize natural speech consistently.

By utilizing the conversion rules.cation by a voice conversion technique used to convert one speaker's voice into another speaker's voice [68]. In voice conversion. A large amount of speech data is not required in the training. conversion rules are extracted in advance through training data comprised of utterance pairs of a source speaker and a target speaker. Therefore. speech of any speaker who is regarded as the target speaker can be synthesized from speech of a speci. any utterance of the source speaker can be converted into that of the target speaker.

Therefore.g. even if the converted speech sounds like that of the target speaker. the quality of the converted speech is degraded considerably. the speech quality is not always high.c speaker who is regarded as the source speaker by only preparing a small number of the target speaker's utterances to extract the conversion rules. by applying voice conversion to corpus-based TTS. Moreover. it is important to realize both high speech quality and high conversion-accuracy for speaker individuality. To continuously represent the acoustic space of a speaker. e. we aim to develop a high-quality voice conversion system. e. we need to evaluate the performance of voice conversion in terms of both speech quality and conversion-accuracy for speaker individuality. such converted speech cannot sound like that of the target speaker. Therefore. it is possible to easily synthesize any utterance of any speaker easily. the speech quality is degraded although speaker individuality is kept. in telephone speech. However. a voice conversion algorithm based on the Gaussian Mixture Model (GMM) has been proposed F 74 . In voice conversion.g. Namely. converting only 0. Although we can synthesize high-quality speech using a simple conversion rule.

The quality of an analysis-synthesis method is also important for achieving a high-quality voice conversion since voice conversion is usually performed with an analysis-synthesis method. and acoustic features are converted from a source speaker to a target speaker by a mapping function based on the feature-parameter correlation between two speakers. we summarize this chapter in Section 5. In Section 5.4. As a high-quality analysis-synthesis method.2.6. STRAIGHT has been proposed by Kawahara et al. and the e ect of mixing converted spectra is described in Section 5.by Stylianou et al.5 ms.3. the quality of converted speech is degraded because the converted spectrum is excessively smoothed by statistical processing. [98][99]. However. This algorithm provides a high-quality vocoder type algorithm. In this chapter. 75 . [58][59]. In this GMM-based algorithm.2 GMM-Based Conversion Algorithm Applied to STRAIGHT The mel-cepstrum of the smoothed spectrum analyzed by STRAIGHT [58] is used as an acoustic feature. Therefore. the GMM-based voice conversion algorithm applied to STRAIGHT is described. the spectral conversion-accuracy of the DFW algorithm is worse than that of the GMM-based algorithm because spectral intensity cannot be converted by the DFW. we propose a GMMbased algorithm with Dynamic Frequency Warping (DFW) [114] to avoid oversmoothing. 5.5. Finally. sampling frequency is 16000 Hz and the order of the mel-cesptrum is set to 40. i. In Section 5. In order to perform voice conversion. quefrency is 2. the proposed voice conversion algorithm is described. In this chapter. In Section 5. We clarify that the proposed algorithm can improve the naturalness of synthetic speech by experimental evaluation. This chapter is organized as follows. the acoustic space is modeled by the GMM without the use of vector quantization. the converted spectrum in the proposed algorithm is calculated by mixing the GMM-based converted spectrum and the DFW-based converted spectrum in order to avoid the deterioration of spectral conversion-accuracy. the 1st to 40-th order cepstral coecients are converted.e. In the GMM-based voice conversion algorithm applied to STRAIGHT. experimental evaluation is described.

we use 50 utterance-pairs comprised of the source speaker's speech and the target speaker's speech. In each pair. silence frames are removed by performing silence detection based on power information. the correspondence between mel-cepstrum sequences extracted from two speakers' utterances is determined by using Dynamic Time Warping (DTW). Then. Step 2: GMM parameters for spectral conversion are trained with the timeStep 3: Di erences between the source speaker's speech and the target speaker's speech have a bad in uence on the accuracy of time-alignment.The 0-th order cepstral coecient is converted so that the power of the converted speech becomes equal to that of the original speech. Therefore. mel-cepstral distortion between two speakers is calculated. aligned mel-cepstrum sequences. In the training of the mapping function. the source speaker's utterances are converted into the target speaker's in order to perform more accurate time-alignment between the mel-cepstrum sequences. As a feature on source information. Step 4: Steps 2 and 3 are repeated to re. Then. The GMM parameters for spectral conversion are trained as follows: Step 1: All utterances are converted into the mel-cepstrum sequences. the log-scaled fundamental frequency extracted by STRAIGHT [59] is used.

the correspondence of 0 sequences from two speakers is determined by using time-alignment of the mel-cepstrum sequences determined in the training of spectral conversion. F0 conversion are trained with the time-aligned 76 . Then. In each pair. The GMM parameters for F F F0 conversion are trained as follows: Step 1: 0s are extracted from all utterances.ne the GMM parameters until the mel-cepstral distortion calculated in Step 3 is saturated. F F0 Step 2: GMM parameters for sequences. frame-pairs including unvoiced frames are removed from the time-aligned 0 sequences.

5.2. We performed voice conversions between male A and male B. Figure 5. As shown in this .2. and between male B and female B. the quality of the converted speech is degraded because the converted spectrum is excessively smoothed by the statistical averaging operation.1 Evaluation of spectral conversion-accuracy of GMMbased conversion algorithm We evaluated spectral conversion-accuracy of the GMM-based algorithm by the mel-cepstral distortion given by Equation (3. The evaluation set was comprised of 50 utterances that were not included in the training data.1) clarify that the GMM-based voice conversion can decrease the acoustic distance." 5. In each case. it can be seen that the distances become almost equal in all cases of voice conversion shown as \After conversion. The distances between speakers of opposite sex are larger than those between speakers of same sex in the case of original speech shown as \Before conversion. The number of Gaussian mixtures is set to 64.8).2 Shortcomings of GMM-based conversion algorithm In the GMM-based algorithm applied to STRAIGHT. that is. the mel-cepstral distortion." However. The frame shift is set to 5 ms. we performed two kinds of conversion by alternating the speakers as a source speaker and a target speaker. between female A and female B. The results (Figure 5. between the source speaker's speech and the target speaker's speech.2 shows an example of the GMMbased converted spectrum and the target speaker's spectrum. The mel-cepstral distortion was calculated in every frame-pair in which the silence frames were removed. between male A and female A.

gure. it is not necessary to preserve the information removed by statistical processing. in speech synthesis. it is necessary to improve the adaptation algorithms in order to achieve high-quality voice conversion. Therefore. 77 . most statistical approaches to speaker adaptation can be applied to voice conversion.3. In speech recognition. such information is important for synthesizing natural speech. As mentioned in Section 2. However. there is over-smoothing on the GMM-based converted spectrum.

Mel-cepstral distortion.1. Example of spectrum converted by GMM-based voice conversion algorithm (\GMM-converted spectrum") and target speaker's spectrum (\Target spectrum"). 120 Target spectrum GMM-converted spectrum Power [dB] 100 80 60 0 2000 4000 Frequency [Hz] 6000 8000 Figure 5.2.Mel-cepstral distortion [dB] 11 10 9 8 7 6 5 4 3 Male A and male B Female A and female B Male A and female A Male B and female B Before conversion After conversion Figure 5. 78 . Mean and standard deviation are shown.

3. Dynamic Frequency Warping function Warping Frequency-warped spectrum 1-w(f) w(f) 4. 5. This procedure is performed in each frame. Calculation of warping function by Dynamic Frequency Warping 3. This function is calculated as a path minimizing a normalized melspectral distance between the STRAIGHT logarithmic mel-spectrum of the source 79 .1. we propose a GMM-based algorithm with DFW.1 Dynamic Frequency Warping Spectral conversion is performed with DFW [72][114] to avoid over-smoothing of the converted spectrum.3. In this technique.3 Voice Conversion Algorithm Based on Gaussian Mixture Model with Dynamic Frequency Warping To avoid over-smoothing. Mix of both spectra (0 < w(f) < 1) Converted spectrum Figure 5. An overview of the proposed algorithm is shown in Figure 5.3. GMM-based algorithm STRAIGHT spectrum Converted spectrum by of source speaker GMM-based algorithm (Original) (Target) 2. 5. GMM-based voice conversion algorithm with Dynamic Frequency Warping. the correspondence between an original frequency axis and a converted frequency axis is represented by a warping function.

j d i . the number of FFT points is set to 1024. In this chapter.2) ( 0 1 0 1) + 2 ( ) ( 0 2 0 1) + 2 ( 0 1 ) + ( ) where ( ) denotes an accumulated distance in the grid ( ). . the converted spectrum c( ) is written as S f = exp[ ( ) ln j g( )j + f1 0 ( )g ln j d ( )j] subject to 0  ( )  1 w f S f w f S f w f . (5. These parts can be removed by performing smoothing or liftering. (5. g i. j g i. and it is given by g i d i. j .j d i. and y denotes a mel-spectrum converted by the GMM-based algorithm at the bin.4) 80 . j g i . j d i.4 shows an example of the frequency warping function for a certain frame. j d i.3) i S j where x denotes a mel-spectrum of the source speaker at the -th bin. j ( )= . g i .speaker and the GMM-based converted logarithmic mel-spectrum. ( ) denotes a mel-spectral distance in a grid ( ). Therefore.3. j 2 6 ) = min 6 4 + ) j . j i. Therefore. Unnatural sound is often caused by synthesizing with DFW-based converted spectra. (5. and the number of bins to calculate the mel-spectral distance is 513. we perform the smoothing processing used in STRAIGHT [58].j s( i D i. jSc (f )j . S 5. we also propose that the converted spectrum be calculated by mixing the GMMbased converted spectrum and the DFW-based converted spectrum in order to avoid the deterioration of the spectral conversion-accuracy.2 Mixing of converted spectra The spectral conversion-accuracy of DFW is slightly worse than that of the GMMbased algorithm because spectral intensity cannot be converted.1) 3 7 7 5 ( ) = (20 log10 j x( )j 0 20 log10 j y ( )j)2 S i S i . In the proposed algorithm. since such spectra have unnatural changing parts. Figure 5. j ( 0 1 0 2) + 2 ( 0 1) + ( ) ( (5.j d i. g i. j . j d i. j . The normalized mel-spectral distance ( ) is given by D i. j i.

where d( ) and g( ) denote the DFW-based converted spectrum and the GMMbased converted spectrum. ( ) denotes the weight for mixing spectra.4. As the mixing-weight is closer to 1. Example of frequency warping function. the converted spectrum is closer to the GMM-based converted spectrum. respectively.8000 Converted frequency [Hz] 4000 2000 1000 0 0 1000 2000 4000 Original frequency [Hz] 8000 Figure 5. The results of preliminary experiments clari.

we use the mixing-weight as follows: s) ( ) = 2 + 2 tan 1 1 0 sin(2 cos(2 s) s subject to 0 1 1 0 s 2  s 2 (5. Therefore.5) S f S f w f w f .ed that the quality of the converted speech is degraded considerably when a spectrum is excessively smoothed in the low-frequency regions [110].

.

.

.

.

f 0 a f =f f a f =f !.

.

.

. .

.

5 shows the variations of the mixingweights that correspond to the di erent parameters . .6 shows an example of converted spectra by the GMM-based algorithm (\GMM"). Figure 5. f = f f = . the GMMbased algorithm with DFW (\GMM & DFW"). and denotes the parameter that changes the mixing-weight. where s denotes the sampling frequency. Figure 5. < a < . and the GMM-based algorithm f a a 81 .

and 10 listeners participated in each subjective 82 . w ( f ) 0.5.4 E ectiveness of Mixing Converted Spectra Evaluation experiments were performed to investigate the e ects of the mixingweight.5 0 0 2000 4000 Frequency [Hz] 6000 8000 a = 0. The male-to-male voice conversion and female-to-female voice conversion were performed.6 0. The converted spectrum by the GMM-based algorithm with DFW has clearer peaks than that by the GMM-based algorithm.8 Weight. Variations of mixing-weights that correspond to the di erent parameters .5 a=0 Figure 5.2 a = -0. a 5. The total duration of this data is about 4 or 5 minutes.4 0.1 0. a with DFW and the mixing of converted spectra when the parameter is set to 0 (\GMM & DFW & Mix of spectra"). The amount of training data was set to 58 sentences. The mixed converted spectrum is close to the converted spectrum by the GMMbased algorithm with DFW in the low-frequency regions but close to that by the GMM-based algorithm in the high-frequency regions. We performed objective experiments on spectral conversion-accuracy and subjective experiments on speech quality and speaker individuality.

1 E ect of mixing-weight on spectral conversion-accuracy In order to investigate the e ects of the mixing-weight on spectral conversionaccuracy. Example of converted spectra by the GMM-based algorithm (\GMM").7 clarify that the conversionaccuracy of the GMM-based algorithm with DFW (\GMM & DFW") is worse than that of the GMM-based algorithm (\GMM"). It is also shown that the deterioration of the conversion-accuracy can be alleviated by mixing the converted spectra. 5. As the parameter of the mixing-weight becomes closer to 1. the mel-cepstral a 83 .120 Power [dB] 100 GMM GMM & DFW GMM & DFW & Mix of spectra (a = 0) 80 60 0 2000 4000 Frequency [Hz] 6000 8000 Figure 5. The experimental results shown in Figure 5. the proposed algorithm without the mix of the converted spectra (\GMM & DFW").6.8) between the converted speech and the target speech. Ten sentences that were not included in the training data were used in the evaluation. we performed the experimental evaluation on the mel-cepstral distortion given by Equation (3. and the proposed algorithm with the mix of the converted spectra (\GMM & DFW & Mix of spectra").4. experiment.

i. the preference (XAB) test was performed.e.7. \Original speech of source" shows the mel-cepstral distortion before conversion.2 4. a distortion in the GMM-based algorithm with DFW and the mixing of converted spectra (\GMM & DFW & Mix of spectra") becomes small. X was the synthesized speech by converting 0 and replacing the source speaker's spectra with those of the target speaker. Mel-cepstral distortion as a function of parameter of mixing-weight.6 5. Two sentences that were not included in the training data were used in the evaluation.4 6 5.2 Mel-cepstral distortion [dB] 6. In each trial.8 6. X was presented .5 0 Parameter.8 -1 Original speech of source ( Male Female) GMM ( Male-to-male Female-to-female) GMM & DFW ( Male-to-male Female-to-female) GMM & DFW & Mix of spectra ( Male-to-male Female-to-female) -0. A and B were the converted speech by the voice conversion algorithms. a 0. the perfect spectral conversion.5 1 Figure 5.4. a 5. In the XAB test.2 Preference tests on speaker individuality In order to evaluate the relationship between the conversion-accuracy for speaker individuality and the parameter of the mixing-weight.7. This is because the converted spectra by the proposed algorithm includes more components of the GMM-based converted spectra than those of the DFW-based converted spectra when the parameter is near 1.

rst. and A a F 84 .

9. the conversion-accuracy becomes equal to that of the GMM-based algorithm by mixing the converted spectra (\GMM & DFW & Mix of spectra").4. Four sentences that were not included in the training data were used in the evaluation.and B were presented in random order. The converted speech quality of the GMM-based algorithm with DFW is signi.8. This demonstrates that the mel-cepstral distortion cannot capture the conversionaccuracy for speaker individuality with sucient accuracy. In each trial. the preference test was performed. The experimental results are shown in Figure 5. However. Listeners were asked to choose the converted speech having better speech quality. The preference score of the proposed voice conversion algorithm compared with the GMM-based algorithm is shown in Figure 5. The tendency of these results is similar to that of the experimental results for the mel-cepstral distortion. The conversion-accuracy for speaker individuality of the GMM-based algorithm with DFW (\GMM & DFW") tends to be slightly worse than that of the GMM-based algorithm. However. 5. the conversion-accuracy for speaker individuality is not always degraded. stimuli synthesized by the two voice conversion algorithms were presented in random order. Listeners were asked to choose either A or B as being the most similar to X.3 Preference tests on speech quality In order to evaluate the relationship between the quality of the converted speech and the parameter of the mixing-weight. even if the mel-cepstral distortion is degraded.

These results shows the opposite tendency to that of the conversion-accuracy for speaker individuality. The results of the two experiments clarify that both the converted speech quality and the conversion-accuracy for speaker individuality show good performance when the parameter of the mixing-weight is set to 0 or 00 5. the converted speech quality of the GMM-based algorithm with DFW and the mix of the converted spectra (\GMM & DFW & Mix of spectra") becomes worse compared with the GMM-based algorithm with DFW. As the mixing-weight becomes close to 1.cantly better than that of the GMM-based algorithm (\GMM"). a a : 85 .

we performed subjective evaluation experiments on speaker individuality and speech quality. A preference score of 50% shows that the conversion-accuracy is equal to that of the GMM-based algorithm. respectively. and X was one 86 . which provides good performance in terms of speaker individuality.5 a=0 a = 0. The parameter was set to 0 and 00 5 in the male-to-male and the female-to-female voice conversions. a : 5.8. A and B were the source and target speaker's speech. a preference (ABX) test was performed. The experimental conditions were the same as those described in the previous section.100 Male-to-male Preference score compared with GMM [%] 80 60 40 20 0 Female-to-female Total 95% confidence interval GMM & DFW a = -0.5 Experimental Evaluation In order to evaluate the performance of the proposed algorithm.5. In the ABX test. a 5.5 GMM & DFW & Mix of spectra Figure 5.1 Subjective evaluation of speaker individuality In order to evaluate the conversion-accuracy for speaker individuality of the proposed algorithm. respectively. Relationship between conversion-accuracy for speaker individuality and parameter of mixing-weight.

Preference score compared with GMM & DFW [%] 100 80 60 40 20 0 Male-to-male Female-to-female Total 95% confidence interval a = -0. synthesized speech by performing only the F F0 conversion 1 1 1 \ 0 only". which provides good performance in terms of speech quality. a of the types of converted speech as follows:      F converted speech by the proposed algorithm 1 1 1 \Proposed algorithm". target speaker's speech synthesized by STRAIGHT 1 1 1 \STRAIGHT. F F synthesized speech by performing the 0 conversion and replacing the source speaker's spectrum with that of the target speaker 1 1 1 \ 0 & Spectrum". Relationship between converted speech quality and parameter of mixing-weight.5 a=0 a = 0. A preference score of 50% shows that the converted speech quality is equal to that of the GMM-based algorithm with DFW.9." \ 0 & Spectrum" were used to evaluate the conversion-accuracy for speaker individuality when the spectral conversion was performed perfectly. converted speech by the GMM-based algorithm 1 1 1 \GMM".5 GMM GMM & DFW & Mix of spectra Figure 5. \STRAIGHT" 87 .

and it can be improved by converting spectra. was used to evaluate the conversion-accuracy when both the spectral conversion and the conversion of source information were performed perfectly. F F F F 88 .Male-to-male Femal-to-female 95% confidence interval 100 Total Correct response [%] 80 60 40 20 0 STRAIGHT F0 & Spectrum F0 only GMM Proposed algorithm Figure 5.10. Two sentences that were not included in the training data were used in the evaluation. Listeners were asked to choose either A or B as being the most similar to X. The di erence in the conversion-accuracy between the \ 0 & Spectrum" and the proposed algorithm shows that the spectral conversion is not sucient.10. Correct response for speaker individuality. The experimental results are shown in Figure 5. In each trial. The conversion-accuracy for speaker individuality of the proposed algorithm (\Proposed algorithm") is the same as that of the GMM-based algorithm (\GMM"). The results clarify that the conversion-accuracy of only the 0 conversion (\ 0 only") is insucient. and then X was presented. A and B were presented in random order. the di erence between STRAIGHT and \ 0 & Spectrum" shows that conversion of prosody should be improved in order to achieve better performance. Moreover.

5. 3: fair. 5.95% confidence interval 5 Standard deviation Male-to-male Female-to-female Total 30 4 MOS 3 20 2 10 1 Original Speech STRAIGHT GMM Proposed algorithm Figure 5.11. 2: poor. 4: good.11. 1: bad). an opinion test was performed.2 Subjective evaluation of speech quality In order to evaluate the quality of the converted speech by the proposed algorithm. Four sentences that were not included in the training data were used in the evaluation. The experimental results are shown in Figure 5. The converted speech quality by the proposed algorithm (\Proposed algorithm") is signi. Mean Opinion Score (\MOS") for speech quality. An opinion score for evaluation was set to a 5point scale (5: excellent.

Therefore.cantly better than that of the GMM-based algorithm (\GMM") because over-smoothing of the converted spectrum can be avoided in the proposed algorithm. it is necessary to improve both the quality of the voice conversion algorithm and that of the analysis-synthesis method in order to achieve higher-quality voice conversion. the converted speech quality is degraded compared with that of the synthetic speech by STRAIGHT analysis-synthesis. it is shown that the quality of STRAIGHT cannot reach that of the original speech. The results of the two subjective evaluations clarify that the proposed algo89 Equivalent Q value [dB] . However. Moreover.

90 . which causes the degradation of speech quality. which is a statistical algorithm. the converted spectrum is excessively smoothed by statistical processing. However. it is necessary to improve not only the conversion algorithm but also the speech analysis-synthesis method. we proposed a voice conversion algorithm based on the Gaussian Mixture Model (GMM) with Dynamic Frequency Warping (DFW) of the STRAIGHT spectrum. the performance of the proposed voice conversion is not high enough.6 Summary In this chapter. prosody conversion. i. However. the conversion of source information. We also proposed a converted spectrum calculated by mixing the GMM-based converted spectrum and the DFW-based converted spectrum. The GMM-based voice conversion algorithm. The proposed algorithm can avoid the typical over-smoothing of a converted spectrum. is needed since prosody also contains important information on speaker individuality. In order to reach a practical level of performance. we performed experimental evaluations of speech quality and speaker individuality by comparison with the conventional GMM-based algorithm. 5. Moreover.rithm can improve the naturalness of synthetic speech while maintaining equal conversion-accuracy on speaker individuality compared with the GMM-based algorithm. In order to evaluate the proposed algorithm. The experimental results revealed that the proposed algorithm can improve converted speech quality while maintaining equal conversion-accuracy for speaker individuality compared with the GMM-based algorithm.e. can convert speech features continuously with the correlation between a source feature and a target feature.

We . However. Furthermore. The other is how to improve the control of speaker individuality in order to realize a more exible speech synthesis. we addressed two problems in speech synthesis in this thesis.Chapter 6 Conclusions 6. One is how to improve the naturalness of synthetic speech in corpus-based TTS. the exibility of TTS is still inadequate. In order to improve the performance of corpus-based TTS.1 Summary of the Thesis Corpus-based Text-to-Speech (TTS) enables us to dramatically improve the naturalness of synthetic speech over that of rule-based TTS. so far no general-purpose TTS has been developed that can consistently synthesize suciently natural speech.

speech synthesized with CV units sometimes have auditory discontinuity due to V-V and V-semivowel concatenations. CV units are often used in concatenative TTS systems for Japanese. Since various vowel sequences appear frequently in Japanese. The various techniques in each module were reviewed. Since Japanese syllables consist of CV (C: Consonant or consonant cluster. V: Vowel or syllabic nasal /N/) or V. In Chapter 3. We also described some conventional statistical voice conversion algorithms and compared conversion functions of the algorithms. Almost all corpus-based TTS systems have been developed on the basis of this structure. we proposed a novel segment selection algorithm for Japanese speech synthesis in order to improve the naturalness of synthetic speech. it 91 . However.rst described the structure of a corpus-based TTS system in Chapter 2. except when a vowel is devoiced.

In order to address this problem.is not realistic to prepare long units that include all possible vowel sequences to avoid V-V concatenation. we proposed a novel segment selection algorithm that does not avoid the concatenation of the vowel sequences but alleviates the discontinuity by utilizing both phoneme and diphone units. The experiments on concatenation of vowel sequences clari. non-uniform units allowing concatenation not only at phoneme boundaries but also at vowel centers can be selected from a speech corpus. In the proposed algorithm.

Moreover. From the results of perceptual experiments. we clari. In order to achieve high-quality segment selection for concatenative TTS. We also performed perceptual experiments.ed that the number of better candidate segments increases by considering concatenations both at phoneme boundaries and at vowel centers. The results showed that speech synthesized with the proposed algorithm has better naturalness than that of the conventional algorithms. we performed a perceptual evaluation of costs for segment selection. it is important to utilize a cost that corresponds to the perceptual characteristics. We also compared the proposed algorithm with the algorithm based on half-phoneme units. In Chapter 4. a cost function for selecting the optimum waveform segments in our TTS system was described.

it was clari.ed the correspondence of the cost to the perceptual scores and then evaluated various functions to integrate local costs capturing degradation in individual segments. As a result.

has better correspondence to the perceptual scores than the maximum cost. We also showed that the naturalness of synthetic speech can be slightly improved by utilizing the RMS cost. a larger number of concatenations causing slight local degradation were performed in order to avoid concatenations causing greater local degradation. which captures the degradation of naturalness over the entire synthetic speech. the RMS (Root Mean Square) cost taking account of both average cost and maximum cost has the best correspondence. Then. it was found that (1) in segment selection based on the RMS cost. From the results of experiments comparing this approach with segment selection based on the conventional average cost. we investigated the e ect of using the RMS cost for segment selection. which captures the local degradation of naturalness. and (2) the e ect of the RMS cost has little dependence on the size of the corpus.ed that the average cost. Furthermore. We also clari.

ed that utilizing a larger corpus improves the quality and 92 .

which is one of the conventional algorithms. However. In Chapter 5. In order to evaluate the proposed algorithm. the quality of the converted speech is degraded because the converted spectrum is excessively smoothed by the statistical averaging operation. we proposed the GMM-based algorithm with Dynamic Frequency Warping (DFW). In summary. To overcome this problem. The voice conversion algorithm based on the Gaussian Mixture Model (GMM). we con. we performed experimental evaluation of speech quality and speaker individuality by comparison with the conventional GMM-based algorithm. The experimental results revealed that the proposed algorithm can improve converted speech quality while maintaining equal conversion-accuracy for speaker individuality compared with the GMMbased algorithm.consistency of the segment selection. The converted spectrum in the proposed algorithm is calculated by mixing the GMM-based converted spectrum and the DFWbased converted spectrum in order to avoid deterioration of spectral conversionaccuracy. can convert speech features continuously with the correlation between a source feature and a target feature. control of speaker individuality by voice conversion was described.

E ective search algorithm: The computational complexity of segment selec- tion needs to be reduced while maintaining the naturalness of synthetic speech.rmed that the proposed segment selection algorithm and the proposed cost function based on perceptual evaluation are e ective for improving the naturalness of synthetic speech. the proposed voice conversion algorithm can help improve the control of speaker individuality. As an approach to this problem. some clustering algorithms have been proposed to decrease the number of candidate segments [14][35]. In addition. Decision trees are constructed by utilizing target features in segment selection in advance in order to cluster similar candidate segments into the same classes.2 Future Work Although we have improved the corpus-based TTS system. a number of problems still remain to be solved. 93 . 6.

Bulyko et al.2. However. Measures against voice-quality variation: Variation in voice quality is caused by recording the speech of a speaker for a long time. based on perceptual characteristics [25][86]. It is expected that a larger-sized corpus will be used in the future to synthesize high-quality speech more consistently. target information predicted from contextual information. Moreover. we did not determine the optimum weight set for sub-costs. Although a determination algorithm based on linear regression has been proposed. proposed a selection algorithm that considers multiple targets [15]. it is important to avoid selecting segment sequences with unacceptable accent information. and construction of a practical and ecient cache of sub-costs with high computational cost [9]. much more research is needed to solve this problem. which is applied in our TTS. cepstral distortion. Moreover. the predicted target information is not always the best in cases where only a small number of candidate segments having the predicted target exist in the corpus. it might be promising to utilize not only accent information but also other contextual information. since accent information is crucial. it uses an acoustic measure. However. a pruning algorithm that considers concatenation between candidate segments has been proposed [18]. Some approaches as described in Section 2. syntactic structure in segment selection. proposed a selection algorithm based on acceptable prosodic targets in Japanese speech synthesis [42]. More approaches from various viewpoints are needed to address this problem. e.g.Other approaches have proposed pre-selection by sub-costs with small computational cost [29]. Especially in Japanese.g. However. the correspondence to perceptual characteristics of acoustic 94 Utilization of multiple targets: Almost all algorithms use the most suitable Improvement of cost: Cost functions for segment selection should be improved . e. as an objective measure in the regression [47]. Hirai et al. As an interesting approach to this problem. Moreover. In this thesis. It is assumed that segment sequences with smooth concatenations cannot be selected under such conditions.5 have been proposed [63][94].

it is necessary to explore a measure having better correspondence to perceptual characteristics. Therefore. spectral modi. Various applications of TTS: Not only general-purpose TTS but also limited domain TTS have been studied [15][26].measures is not sucient. As an e ective approach in the case of a small-sized corpus.

95 . However. a prosodic conversion algorithm is also needed since prosody is an important factor for representing speaker individuality. cross-language voice conversion has been proposed by Abe et al. Moreover. [3]. Further improvement of voice conversion: Voice conversion is an attractive Cross-language voice conversion: As an interesting application of voice con- version. it is necessary to synthesize not only speech in a reading style but also speech in various speaking styles [54][65] as well as expressive speech [16][48] in order to realize rich communication between man and machine. We applied our voice conversion algorithm to this type of cross-language voice conversion and evaluated the performance of the conversion [73].cation has been applied to concatenation speech synthesis [116]. This technique can synthesize other-language speech uttered by a speaker who cannot naturally speak the language. More research is needed to advance cross-language voice conversion. no practical voice conversion systems has been developed yet. technique. Further improvement in performance is indispensable. A number of these problems still remain to be solved. Moreover.

Semivowels were considered vowels. e.7590 1 16069643 85.1 shows the frequency of vowel sequences in newspaper articles com- Table A.Appendix A Frequency of Vowel Sequences prised of 571. and that of devoiced vowels is set to 0.3038 1154 5 10306 0.0000 4 96 .0548 1155 6 1582 0.g.4999 11 2 1677866 8.4450 575 4 57090 0. Frequency of vowel sequences Length Frequency Normalized frequency [%] Number of di erent sequences 0 518553 2.0016 146 8 28 0.0002 30 9 4 0.1. and then we calculated the frequency of vowel sequences while ignoring consonants. Phoneme sequences were divided into CV* units [60].0084 473 7 306 0.283 sentences. Table A. The length of long vowels is set to 1. both /kai/ (CVV) and /ai/ (VV) are considered /ai/ (length = 2).9272 99 3 459538 2.

B De.

nition of the Nonlinear Function P We performed perceptual experiments on the degradation of naturalness caused by prosody modi.

cation with STRAIGHT [58][59] in order to de.

2) . Perceptual score was calculated as an average of the normalized score calculated as a Z-score (mean = 0.1) (B. Listeners evaluated the degradation on a scale of seven levels. x 2 (B. The perceptual scores were modeled by a nonlinear function as follows: z z x. y g x. and denote 0 modi. y a y Sy Sx = " (    ! 021 1 1 3 !9 = 5+ + .ne a sub-cost function on prosodic di erence. Sx r x y Sx Sy 2 b. Sxp x > x . g x. Sxm Sy Syp y > Sym y (B. 8 8 < ( 0) < ( 0) = : (  0) : (  0) . variance = 1) for each listener in order to equalize the score range among listeners. y ( ( ) = minf ( )g ) = 1 exp 0 s. namely 1 (very bad) to 7 (very good).3) where.

cation ratio and duration modi.

The nonlinear function in Equation (3. Parameters . respectively. yp.3) is de. . .cation ratio by octave. . xm .4) 0 568179 xp = 0 400522 xm = 0 913813 yp = 0 924295 ym = These parameters were estimated by minimizing the error between the perceptual score and that estimated by the function ( ). and ym are given by x y F a b r s S S S S = 01 574549 = 2 649368 = 0 062317 = 0 942505 (B. xp .

b : .5) The function is shown as the contour line in Figure B. y P P x. S : . S : : z x. s : . : . S : . y ) + s: (B.4) and 2) no signal processing for prosody modi.ned as follow: a r : . y ( ) = F 0z (x. In the cases of 1) the sub-cost on 0 discontinuity as described in Equation (3.1. S : .

P 97 .cation in waveform synthesis.

Duration modification ratio [octave] 0.2 -0.8 2 -0.6 0.2 0 -0.2 0 0.0 0.4 1.4 0.05 1.4 0.2 0. (B.4 -0. = = = = 01:574549.6 1.4 F0 modification ratio [octave] P Figure B.1. 1 074819 0 400522 0 913813 : : : .8 0.2 1 0. .4 1. Nonlinear function for sub-cost on prosody. the parameters are given by a r Sxp Syp = = = = 2 649368 00 0 400522 0 913813 : : .6) : 98 . : : .2 0. . b s Sxm Sym .

e. on Mismatch of Phonetic Environment s p The sub-cost functions were determined from results of perceptual experiments. P hu ( ) = 0z (P h.C Sub-Cost Functions. which is utilized in order to convert the perceptual score into a positive value. P h Ph Ph b : S 99 . (C. S and S . namely 1 (very bad) to 7 (very good). is set to 1 5. P hu ) + b. replacing a phoneme e with a phoneme u. The cost p of capturing the degradation caused by a mismatch with the preceding environment is determined in a similar way. Listeners evaluated the degradation on a scale of seven levels. P he . variance = 1) for each listener in order to equalize the score range among listeners. in which listeners evaluated the degradation of naturalness by listening to the speech stimuli synthesized by concatenating phonemes extracted from various phonetic environments. z P h. P he . P h . The cost s of capturing the degradation caused by a mismatch with the succeeding environment is given by S Ss P h. Perceptual score was calculated as an average of the normalized score calculated as a Z-score (mean = 0. The experimental method is described in [64]. i.1) Ph where ( e u ) denotes the perceptual score in the case of a mismatch of the succeeding environment in a phoneme .

Heinz. Voice conversion through vector quantization. T. ATR Technical Report. Conkie. pp. (E). No. 1990. Reduction of speech spectra by analysis-by-synthesis techniques. 1725{1736. 100 . A voice conversion based on phoneme segment mapping. Kuwabara. J. K. 1998. Jpn. TR-I-0166. A. Am. J. and A. [8] M.M. Syrdal. [5] M. Acoust. 1927{1930. Kagoshima. 90. Joint Meeting of ASA. J.K. EAA. J. Kuwabara. Vol. Speech database user's manual. Abe. H. 3. [7] M. Jpn. Nakamura. ICSLP. and A. K. Vol. Vol. pp. and H. K. Stevens. Nov. House. Am. and A. Australia. pp. Y. Statistical analysis of bilingual speaker's speech for cross-language voice conversion. 2. Vol. Soc. [3] M. pp. Proc. 1961. pp. Fujisaki. [2] M. Conkie.. (E). Shikano. S. Abe and S. Soc.References [1] M. [4] M. Australia. Stylianou.N. 3rd ESCA/COCOSDA International Workshop on Speech Synthesis. Kuwabara. 12. 11. Acoust. Akamine and T. A. 1990. and DAGA.G. Shikano. Umeda. J. Syrdal. Soc. 1991. Abe. Sagayama. 1998. No.K. Schroeter.S. No. 13. Beutnagel. and H. J. 185{190. Y. 1. Dec. 131{139. Abe. The AT&T Next-Gen TTS system. No. pp. 71{76. Soc. [6] C. Analytic generation of synthesis units by closed loop training for totally speaker driven text to speech system (TOS Drive TTS). 1992. Beutnagel. Acoust. Diphone synthesis using unit selection. Sagisaka. 76{82. Sydney. Jenolan Caves. 33. Bell. Proc. Acoust.. H.

601{604.com/projects/tts/pubs. 2002. Proc. Rapid unit selection from a large speech corpus for concatenative speech synthesis.W.W. Budapest.research. Proc. Proc. Riley. Limited domain synthesis. Taylor. A. 607{610.S. pp. Oct.att. Mohri. Black and K.html M. 1265{1268. Bulut.[9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Berlin. http://www. Vol. 1995. Sep. 411{414. 983{986. ICSLP.W. CHATR: a generic speech synthesis system. Narayanan. Ostendorf. A. pp. Beijing. U. 1385{1388. Lenzo. Black and A. 1996.W.A. A. Philadelphia. Sep. Ecient integrated response generation from multiple targets using weighted . Kyoto. Sep. 1996.W. 2000.J. S. EUROSPEECH. Hungary. Germany. Greece. ICSLP. Syrdal. 1997. pp. Taylor. Generating 0 contours from ToBI labels using linear regression. Optimizing selection of units from speech database for concatenative synthesis. China.S. TR-IT-0163. Aug. Japan. 2. Black and N. Beutnagel. Automatically clustering similar units for unit selection in speech synthesis. Bulyko and M. Madrid. 1994. A. Mar. and M. Hunt. ATR Technical Report. Denver. I. Building practical speech synthesis systems. 581{584. Proc. Sep.K. M. ICSLP. pp. Rhodes.W. 1999. Proc. EUROSPEECH. 1999.. COLING. A. Proc.S. Proc. Expressive speech synthesis using a concatenative synthesizer. pp. Black and P. Campbell. Sep. EUROSPEECH. pp.. A. M. A. U. Black and P.A. Black. Spain. pp.

Robust splicing costs and ecient search with BMM models for concatenative speech synthesis. U. Vol. M. 533{550..A. 16.S. 3{4.nite state transducers. May 2002. Proc. Bilmes. pp. pp. F 101 . Bulyko. Orlando. Computer Speech and Language. No. and J. ICASSP. Ostendorf. 2002. I. 461{464.

Canada. 1992.A.N. Proc. 61{64. New York. Prosodic encoding of syntactic structure for speech synthesis. ICSLP. Campbell.W. Campbell and C. Campbell. pp. U. pp. Ban . 1994. [20] W.N. 369{372. Sep. Oct.[19] W. CHATR: a high-de. Prosody and the selection of units for concatenation synthesis. [21] W.N.. Wightman.S. 2nd ESCA/IEEE Workshop on Speech Synthesis. Proc.

Dec. Proc. http://www. [22] W.S. Chu and H.N. 1997. Progress in Speech Synthesis. Olive. and E.N.A. ICASSP. Yang. 293{304. Proc. EAA. Joint Meeting of ASA. Chang.P. and J. Sproat. J.W. ICSP. and E. R. [24] M. Hirschberg. J. Isard.com/projects/tts/pubs. Proc. Conkie. Campbell and A. 453{456. [27] A. H. Optimal coupling of diphones.H.A. pp. Campbell. U. U. Peng. pp. J. pp.att. Selecting non-uniform units from a very large corpus for concatenative speech synthesizer. Aalborg.Y. Chu. Germany.research.. Peng. Peng.H. Van Santen. Conkie and S. 2001. and J. May 2002. C.A.W. Domain adaptation for TTS systems. Proc. H. Chu. ICASSP..S. [23] W. 1999. Seoul. [26] M. Orlando. Van Santen.nition speech re-sequencing system. pp. N.. 1996. 2087{2090. Sep. 279{292. Springer. Hirschberg. 1223{1228. EUROSPEECH. pp. Li. An objective measure for estimating MOS of synthesized speech. Robust unit selection system for speech synthesis.P. May 2001. U. Olive. Chang. R. Mar. Black. Aug.S. and DAGA. Springer. N. [28] A. Berlin. [25] M. J.html 102 . Hawaii. Joint Meeting of ASA and ASJ. pp. 785{788. Progress in Speech Synthesis. Denmark.W.P. 1997. Korea.Y... Sproat. Processing a speech corpus for CHATR synthesis. Prosody and the selection of source units for concatenative synthesis. Proc. H. 1997. pp. Salt Lake City. 183{186.P.

[29] A. [30] A. Proc. M.P. Brown.E. Royal Statist. and D. China. A. 3. Campbell. pp. Conkie. Beutnagel.B. and N. ICSLP. Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modi. Laird. Vol. Preselection of candidate units in a unit selection-based text-to-speech synthesis system. pp. and P. No. [31] W.K.M. N. Beijing. Maximum likelihood from incomplete data via the EM algorithm. 39. 279{282. Ding. Oct. 1{38. 2000. K. 1. Syrdal.. Rubin. J. Soc. Fujisawa. 1977. Dempster. Vol.

pp.M. Remaking speech. Dusterho and A. A. The IBM trainable speech synthesis system. Jenolan Caves. 3rd ESCA/COCOSDA International Workshop on Speech Synthesis. pp. Computer Speech and Language. Sep.cation. Sydney. Greece. Woodland. 1998. pp. pp.E. pp. Proc. Hungary. Generating 0 contours for speech synthesis using the Tilt intonation theory. Vol.D. 1. pp. No. Donovan and E. Athens. Proc. Soc. 1998. Proc. 223-241. Acoust.E. June 2000. [33] R. Budapest. ESCA Workshop on Intonation. F 103 .C. Sep. 1939.R. Vol. Donovan and P. A hidden Markov-model-based trainable speech synthesizer. 1703{1706. 13. 40{50. Proc. pp. 191{194. Audio Electroacoust. Australia.E. ICASSP. Australia. Istanbul. Maxey. Am. 937{940. Using decision trees within the Tilt intonation model to predict 0 contours. Dixon and H. 1999. Eide. No. [35] R. 3. 169-177. ICSLP.E. Vol. pp. 1968. Taylor. 2.. Proc. Terminal analog synthesis of continuous speech using the diphone method of segment assembly. Dudley. No. 11. 1997. Black. Segment pre-selection in decision-tree based speech synthesis systems. 1999. IEEE Trans. [34] R. 16.W. J.. Donovan. Turkey. Dec.W. and P. Dusterho . [36] H. 107{110. Black. F [38] K.E. EUROSPEECH. Nov. [37] K. 1627{1630. [32] N.

Nov. IEEE 2002 Workshop on Speech Synthesis. S. Hakoda. 1984. 5. K. J. and K. 1997. N. Hirokawa. 1989. J. Hirai. and H. Speech synthesis using a waveform dictionary.P. 333{346. pp.P. Speech unit selection based on target values driven by speech data in concatenative speech synthesis.. 4.W. pp. Hakoda. T. [44] T. Hirai. Sproat. 809{812.S. Tenpaku. pp. Sep.H. pp. A new Japanese text-to-speech synthesizer based on COC synthesis method. Sep. N. Santa Monica. Olive. 1990. (E). No. 233{242. Hirokawa and K. 2002. Progress in Speech Synthesis. ICSLP.[39] H. Fujisaki. Soc. Paris. Hirose. Van Santen. Jpn. Proc. Analysis of voice fundamental frequency contours for declarative sentence of Japanese. Sagisaka. Segment selection and pitch modi. Hirschberg. Hirokawa. Shikano. France. N. Automatic extraction of 0 control rules using statistical analysis. Acoust. Mizuno. Vol. S.Y. EUROSPEECH. [43] T. Japan. Kobe. Proc. Proc.A. F [42] T. Nakajima. Higuchi. Springer. J. Iwahashi. 140{143. and Y. R. and J. [40] K. U.. [41] T.

Kobe. pp. [45] K. S. Japan. May 1996. ICSLP. Black. Iida. and M. Unit selection in a concatenative speech synthesis system using a large speech database. Improved corpus-based synthesis of fundamental frequency contours using generation process model. Eto. U. U.. [47] A. Apr. M. Minematsu. 2085{2088. Yasumura. ICASSP. 1990. ISCA 104 . pp. Proc. Generation of prosodic symbols for rule-synthesis of connected speech of Japanese. Kawai.S. 337{340.A.cation for high quality speech synthesis using waveform segments. and N. pp. N. [48] A. H. [46] K.S.W.A. Higuchi. ICASSP. ICSLP. Proc. Atlanta. 1986. 2002. A speech synthesis system with emotion for assisting communication. Proc. Proc. Nov. Japan. Fujisaki. Tokyo. Hirose. F. Iga.. pp.J. Proc. 373{376. Hirose. Hunt and A. and H. Campbell. Sep. Denver. 2415{ 2418.

Sumita. Mel log spectrum approximation (MLSA) . Northern Ireland. Imai. Sep. 2000. pp. Furuichi. 167{172. [49] S.Workshop on Speech and Emotion. and C. K. Belfast.

963{966. ICASSP. Iwahashi. No.A. S. No. 1997. and S. and T. Sydney. ICASSP. IEICE Trans. Sep. Isogai and H. and Y. Apr. 1995.. pp. Kain and M. Australia. 2. [53] N. Vol. [50] M. pp. 577{580. [54] K. F [57] A. A new 0 contour control method based on vector representation of 0 contour. pp. F F [51] K. Hirokawa. pp. 1994. Proc. 139{ 151. IEEE 2002 Workshop on Speech Synthesis. Sagisaka. [55] T. 105 .. Yamada. Kaiki. 2002. 1993. Macon.A. E76-A. Australia. Kagoshima and M. Sagisaka. pp. Speech Communication. Fundamentals. Nakajima. J66-A. Furui. [56] T. Vol. 1975{1978. Munich. Seattle. Iwahashi and Y. 1983 (in Japanese). Togawa. N. 285{288. Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Iwano. Automatic generation of speech synthesis units based on closed loop training. ICASSP.lter for speech synthesis. Akamine. Germany. Akamine. 2. pp. Santa Monica. Speech-rate-variable HMMbased Japanese TTS system. 1999. T. 727{730.S. Kagoshima and M. [52] N. Proc. Apr. 1942{1948.W. M. Adelaide. pp. IEICE Trans. Proc. Proc. Proc. Dec. Itoh. 16. 1998. No.S. Vol. U. Mizuno. 11. A new waveform speech synthesis approach based on the COC speech spectrum. 122{129. pp. Spectral voice conversion for text-to-speech synthesis. EUROSPEECH. Sep.. U. Budapest. Speech segment selection for concatenative synthesis based on spectral distortion minimization. ICSLP. Hungary. An 0 contour control model for totally speaker driven text to speech system. Proc. May 1998.

Adelaide. pp. Shimizu. pp. IEEE Trans.A. Kawahara. J. F [59] H. No. Masuda. 1994. 2002. Kawanami. Sep. 2781{2784.S.[58] H. Sep. Shikano. pp. ICSLP. T. T. Proc. Santa Monica. and T. S. 187{207. 1. Masuda-Katsuse. Higuchi. 2002. and S. Budapest. Veldhuis. pp. [61] H. H. 6. Proc. 3{4. Patterson.A. Apr. 420{425. ICASSP. Proc. 1999.S. [63] H. ICSLP. pp. Shimizu.D.A. A study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis. Speech and Audio Processing. No. Hirose. Hungary..de Cheveign e. Vol. 2425{ 2428. U. Katayose. pp. Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of 0 and periodicity.de Cheveign e. N. 106 . Development of a textto-speech system for Japanese based on waveform splicing. Designing Japanese speech database covering wide range in prosody. 2002. Klabbers and R. 2001. 2621{2624. Denver. pp. and R. Proc. U. Kawai. Oct. Kawai and M. [66] E. Sep. Toda.. N. Acoust. Reducing audible spectral discontinuities. 9. ICSLP. 39{51. phonetic features as predictors of audible discontinuity in concatenative speech synthesis. Vol. [64] H.S. Denver. A. Rules for generating prosodic features for text-to-speech synthesis of Japanese. Kawai. 3. [62] H. Vol. F [60] H. Jpn. 433{442. EUROSPEECH. K. Soc. Yamamoto. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based 0 extraction: possible role of a repetitive structure in sounds. Sep. Proc. [65] H. A design method of speech corpus for text-to-speech synthesis taking account of prosody. 50. Tsuzaki. pp. 2000. Kawai and M. Fujisaki. 27. and K.. Yamamoto. Proc. Higuchi. Australia. Kawai. IEEE 2002 Workshop on Speech Synthesis. 1994 (in Japanese). 569{572. U. Speech Communication. Acoustic measures vs. and A. and H. T. Tsuzaki. China. I. Vol. Beijing. 1999. (J). Kawahara. No.

[73] M. 2. and Satoshi Imai. 2002. Banno. 82. 1995. Proc. Speech Communication. Yamashita.H. Lin. and F. Itakura. Tokuda. No. Vol. Voice characteristics conversion for HMM-based speech synthesis system. pp. pp. Kajita. Japan. 1993. pp. [70] M. and N. Proc.. No. 827{830.C. T. Am. Vol. Acoust. 161{164. Inoue. pp. K. [71] C. IEEE Trans. 1991. 1990. Proc. Lee. Germany. Nov.H.J. Unsupervised speaker adaptation from short utterances based on a minimized fuzzy objective function. Proc. IPSJ Journal. J. Vol. ICSLP. 14. 2001. K. 16. 2227{2230. 5. 4. ICASSP. [75] H. 737{793. Kobe. Masuko. S. Kuwabara and Y. Denmark. 3. 2177{2185. Review of text-to-speech conversion for English. pp. (E). H. No. Lee. Vol. [68] H. K. 353{361. Perceptual cost functions for unit searching in large corpus-based concatenative text-to-speech. Acoustic characteristics of speaker individuality: control and conversion. Sep. Vol. A study on speaker adaptation of the parameters of continuous density hidden Markov models. [69] C. Signal Processing. Mashimo. Soc. Soc.H. Juang.[67] D. EUROSPEECH. [72] N. Vol. Budapest. T. and B. No. [74] T. 171{185. Matsumoto and Y. 43. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models.1611{ 1614. Leggetter and P. pp. Crosslanguage voice conversion evaluation using bilingual databases. Woodland. H. Kobayashi. C. EUROSPEECH. [76] H. Munich. 1999. 1995. 2. 39. Jpn. 806{814. Shikano. A minimum distortion spectral mapping applied to voice quality conversion. No. Campbell. Apr. Maeda. J. pp. Sep. Klatt. Kawanami. 165{173. pp.H. Speaker conversion through non-linear frequency warping of STRAIGHT spectrum. Takeda. pp. 9. Aalborg. Matsumoto and H. 107 . Acoust. 7. Toda. Hungary. pp. 1987. No. Sagisaka. 1997. Computer Speech and Language.

1988. Proc. ICASSP. Charpentier.. and K. Speaker adaptation applied to HMM and neural networks.A.S. [81] S. U. Santa Monica. Moulines and F. Apr. [80] S. Proc. pp. 453{467. Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt.S. Murthy. 16. Minematsu. Vol. [82] M. ICASSP. B. 9. Yegnanarayana.. New York.A. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Sep. H. 1995. Mizuno and M. S. No. No. Scotland. Nakamura and K. 5{6. Speech Communication. Abe.A. Vol. pp. pp. Proc. 1990. [79] E. 2. May 1989. 89{92. Automatic estimation of accentual attribute values of words to realize accent sandhi in Japanese text-to-speech conversion. Narendranath. Kita.[77] N. R. IEEE 2002 Workshop on Speech Synthesis. Glasgow. Automatic generation of synthesis units based on context oriented clustering. Transformation of formants for voice conversion using arti. [78] H. Rajendran. pp. 2002. Shikano. Speech Communication. Hirose. Hamada. 659{662. Nakajima and H. 153{164. U.

M. Vol. 1995. Sagayama. and S. [83] K. 207{216. Speech Communication. No. Sugiyama.cial neural networks. Speaker adaptation based on transfer vector . 16. Ohkura. 2. pp.

Ban . Perpetually optimizing the cost function for unit selection in a TTS system with one single run of MOS evaluation.V. 1977. Proc. 1972.A.S. 1992.A. [84] J. Johnson.P. ICASSP. 108 .H. Zhao. [86] H. Hartford. Rule synthesis of speech from diadic units. [85] A. IEEE. Discrete representation of signals. Proc. Sep. 2002. Chu. 60. 2613{2616. ICSLP.eld smoothing with continuous mixture density HMMs. Denver. Y. Oppenheim and D.S. pp. Oct. Olive. pp.. Vol. ICSLP. 6. Proc. No. pp. U. Proc. 681{691. and M. Peng. U. pp. 369{372. Canada.. 568{570. May.

1958. Shi. Proc. 1988. Higuchi.. Japan. Vol. J. Sagisaka. 51. Proc. N. [97] K. 1978 (in Japanese). J. Peng. 8. 10.. 1984 (in Japanese). Jpn. 1986. 2002. Denver. (J). pp. 1992. pp. Sep. W. 739{742. ICSLP. Speech Res. Acoust. 11. ATR -talk speech synthesis system. Segmentation techniques in speech synthesis. Speaker adaptation through vector quantization. 1. Satoh. 1983 (in Japanese). J66-D.. Soc. 849{856. No. [96] T. Speech synthesis using CVC concatenation units and excitation waveform elements. IPSJ Journal. Soc. and E. M. Price. Vol. Ostendorf. 3{13. Vol. pp. Power spectral density based channel equalization of large speech database for concatenative TTS system. Sato. U. Acoust. ICSLP. and J. S83-69. A morphological analyzer for a Japanese text-to-speech system based on the strength of connection between words. No. 1993 (in Japanese). [94] Y. [95] K. H. Acoust. 867{870.  [91] Y. 1992. Beckman. Kawai. No. pp. Chang. 541{546. Mimura.. Speech synthesis by rule using an optimal selection of nonuniform synthesis units.A. 7. Hirschberg. ICSLP. pp. IEICE Trans. pp. [90] Y. Jpn. Pitrelli. Silvertsen. Sagisaka and H. C. Pierrehumbert. Silverman. ICASSP. 2369{2372. [89] Y. Peterson.S. Apr. Kaiki. 2643{2646.. P. Accentuation rules for Japanese word concatenation. Canada. U. ToBI: a standard for labeling English prosody. Vol. Iwahashi. J. N. and R. [92] H. Comm. and S. and K. Lee. 1281{1286. [88] Y. H. Chu.. ICASSP. pp. Ban . Canada. Trans. Speech synthesis on the basis of PARCOR-VCV concatenation units. Sagisaka. Wang.. 34. Proc. Oct. E. [93] H. Sagisaka. Soc. Tokyo. pp. and M. Sato. 1995 (in Japanese). Proc. 109 . Vol. Wightman. Proc. Ban . New York.F. Natural language processing in speech synthesis. N. 30. Shimizu. J61-D. No. Yamamoto. IEICE Trans. Shikano.A. 679{682. pp. pp. pp.[87] G. Am. 483{486. K. Oct. No. 858{865.S. J. Apr. Reddy. M.

Stylianou. Moulines. pp. 1995. Capp e. O. [99] Y. Stylianou. Capp e. Madrid. and E. EUROSPEECH. and O. Statistical methods for voice quality transformation. 447{450. Proc. A system voice conversion based on probabilistic classi. Spain. Sep.[98] Y.

Proc. [102] A. Syrdal. Speech and Audio Processing. Vector-. Proc. Wightman. Sep. No. 2001. pp.J. U.K. K-S.cation and a harmonic plus noise model. 979{982.K. Corpus-based techniques in the AT&T NextGen synthesis system. pp. Takahashi and S.W. IEEE Trans. Seattle. Salt Lake City. J. pp. Conkie.S. Schroeter. Vol. Strom. Applying the harmonic plus noise model in concatenative speech synthesis. and M. Lee. Vol. Syrdal. V. Oct. 9. Denmark.A. Stylianou. ICASSP. Phonetic e ects on listener detection of vowel concatenation. ICASSP. M. Proc. May 2001. Perceptual and objective detection of discontinuities in concatenative speech synthesis. Beutnagel. pp. Stylianou and A. 21{29. ICSLP. A. 2001. Y. [100] Y. 2000. Makashay.. pp. [104] J. 410{415. 3. C. Stylianou. 1. Aalborg.A. [101] Y.K. Beijing. 281{284. Sagayama. 837{840. Proc. China. Syrdal. U.S. May 1998. EUROSPEECH. [103] A..

eld-smoothed Bayesian learning for incremental speaker adaptation. A Japanese TTS system based on multiform units and a speech modi. Abe. Takano. pp. ICASSP. Detroit. 696{699. K. [105] S. May 1995. Nakajima. Tanaka. Mizuno. M. U. Proc. and S.S..A. H.

and designs. Elsevier Science. Bailly. On the basic scheme and algorithms in non-uniform unit speech synthesis.cation algorithm with harmonics reconstruction. models. Talking machines: theories. 9. No. 1. Sagisaka. C. 1992. Sawallis. Vol. and T. K. 3{10. G. 93{105. pp. pp. North-Holland. 110 . Speech and Audio Processing. Benoit. 2001. IEEE Trans. and Y. [106] K. Takeda. Abe.

2{3. 3. Tubach. Tamura. Proc. 175{178. U. Synthesizing conversational intonation from a linguistically rich input. Valbret. Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR. Tsuzaki and H.A. Istanbul. Salt Lake City. No. Moulines and J.P. T. Toda. pp. U.A.. [109] T. 1994. Dec. 1315{1318. Kobayashi. Tokuda. Beijing. ICSLP. Feature extraction for unit selection in concatenative speech synthesis: comparison between AIM. Kobayashi. Proc. Taylor and A. T. Sydney. ICASSP. 805{808. 111 . Sep. 2747{2750. Kitamura. Shikano. Proc. Voice transformation using PSOLA technique. 1998. Toda. ICASSP. pp. Black. and T. Masuko. Masuko. [115] J.S. Denver. Proc. and T.A. Macon. H.S. June 2000. [110] T. Australia. Yoshimura.[107] M. and K. Proc. Tokuda.S. ICASSP. Lu. 1992. pp. Phoenix. 137{140. LPC. May 2001. Speech Communication. 2000. pp. 2002. Oct. and K. H. 11. Shikano. Tokuda. N.. New York. Salt Lake City. ICASSP. Saruwatari. [114] H. Vol. Masuko. U. T. and T.S. 279{282. Sep.. May 1999. 841{844. pp. May 2001. E.A. K.S. Kawai. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling. pp. STRAIGHT-based voice conversion algorithm based on Gaussian mixture model. Miyazaki.A. Proc. Proc. 229{232. 2nd ESCA/IEEE Workshop on Speech Synthesis. China. pp. and MFCC. T. Vol.W. pp. Proc. Saruwatari. T.W. J. U. Wouters and M. ICSLP. 175{187. [112] K. [108] P. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. [113] M. pp. ICSLP.. A perceptual evaluation of distance measures for concatenative speech synthesis. Kobayashi. Speech parameter generation algorithms for HMM-based speech synthesis.. [111] K. U. Turkey.

199{206. Proc. 1. Wouters and M. Speaker interpolation for HMM-based speech synthesis system. Soc. K. No. Masuko. Jpn. T. Kobayashi. Hungary. Tokuda. and T. 30{38. T. 4. 2000. T. Kobayashi. Sep. Yoshimura. [118] T. [117] T. pitch and duration in HMM-based speech synthesis. 2001. Yoshimura. IEEE Trans. and T. 112 . pp. EUROSPEECH. 2347{2350. Tokuda. Control of spectral dynamics in concatenative speech synthesis. J. Vol. (E). K. pp.[116] J. 21. Acoust. Kitamura. Simultaneous modeling of spectrum. Masuko. Kitamura. 9. Budapest. T.W. pp. No. Vol. 1999. Speech and Audio Processing. Macon.

Crosslanguage voice conversion evaluation using bilingual databases. No. 10. Kawanami. 1760{ 1770. 7. H. M. H.. No. Vol. 2002 (in Japanese). 3. 12. Saruwatari. Kajita. Saruwatari. F. No. Shikano. J83-D-II. H. Lu. M. Toda. 4. 3. K.. T. 43. Shikano. Vol. 2000. 113 . 2001 (in Japanese). pp. pp. A segment selection algorithm for Japanese concatenative speech synthesis based on both phoneme unit and diphone unit. IPSJ Journal. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping. and K. pp. Shikano. pp. Beijing. Proc. Shikano. Vol. 279{282. Dec. ICSLP. International Conferences 1. Shikano. Mashimo. Takeda. Toda. 2177{2185. Tsuzaki. Nov.List of Publications Journal Papers 1. IEICE Trans. IEICE Trans. 2000 (in Japanese). and N. J. T. No. pp. T. 2181{2189. Improvement of STRAIGHT method under noisy conditions based on lateral inhibitive weighting. 2. Kawai. Lu. S. IEICE Trans. Toda. Vol.. and K. Vol. Toda. 11. T. T. Toda. Oct. J. Campbell. China. Banno. Itakura. J85-D-II. and K. STRAIGHT-based voice conversion algorithm based on Gaussian mixture model. H. J84-D-II. Oct. 2180{2189. H. July 2002. and K. K.

pp. Evaluation of cross-language voice conversion using bilingual and non-bilingual databases. Mashimo. ICASSP. High quality voice conversion based on Gaussian mixture model with dynamic frequency warping. H. pp. EUROSPEECH. Toda. J. T. and K. Toda. Kawai. Toda. Toda. Proc. Proc. T. 293{296. T. Shikano. ICASSP. pp. U. Kawanami. T. Oct. 2003 (accepted). and K. M.A. LREC. M. 6. K. Campbell.. Segment Selection Considering Local Degradation of Naturalness in Concatenative Speech Synthesis. H. Spain. and K. 361{364. 2000. H. Proc. Toda.S. 169{172. M. Lu.. Las Palmas. Shikano. T. pp. Masuda. Shikano. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. and N. K. Kawai. and K. 2002. 9. Aalborg. Shikano. Perceptual evaluation of cost for segment selection in concatenative speech synthesis. 2001. Kashioka. Voice conversion algorithm based on Gaussian mixture model applied to STRAIGHT. Sep. Toda. China. pp. H. Shikano.S. and K. 8. T. S. Toda. Kawanami. U. 841{844. 7. 2039{2042. 3. Unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit. T.S. U. Sep. Denmark. 349{352. M. H. 465{468. Shikano. Denver. IEEE 2002 Workshop on Speech Synthesis. Mashimo. Kumamoto. Designing speech database with prosodic variety for expressive TTS system. H. H. U. 10. Sep. Tsuzaki.. Saruwatari. WESTPRAC VII.A. Santa Monica. May 2002.A. May 2002. Orlando. ICASSP. Nakamura. Salt Lake City. Shikano. Proc. Kawai. Japan.. 4. EUROSPEECH. Toda. pp. Saruwatari. Evaluation of crosslanguage voice conversion based on GMM and STRAIGHT. H. Aalborg. Denmark.A. Shikano. and K. Proc. Proc. Sep. 5. and K. pp. Tsuzaki. Toda. T. and N. T. 2001. T. 2002.2. Proc.S. Hong Kong. 114 . May 2001. M. Proc. Apr. Proc. Tsuzaki. ICSLP. Campbell. Shikano.

23{29. and F. 25{32.. Shikano.. May 2000 (in Japanese). Shikano. T. 2425{ 2428. Takeda. T. H. An evaluation of automatic phoneme segmentation for concatenative speech synthesis. IEICE Tech. 4. Lu. and K. Toda. H. T. Rep. Toda. Improvement of STRAIGHT method under noisy conditions based on lateral inhibitive weighting. 49{54. Aichi. 7. IEICE Tech. pp. H. SP2001-122. 2002. Rep. Perceptual evaluation of cost for segment selection in concatenative text-to-speech synthesis. H. SP2002-69. T. ICSLP. T.. Toda. Toda. T. Toda. 45{52. IEICE Tech. Kyoto. Fukuoka. Rep. Kawai. A visual speech synthesis method using real image database.. H. Technical Reports 1. Toda. T. U. Gunma. pp. 115 . 2003 (in Japanese). 6.. Sep. pp. pp. Proc. Shiga. S. H. 19{24. Rep. Toda. 2002 (in Japanese). A study on the speech synthesis method by using database with variety of speech-rate.S. Toda. IEICE Tech. SP2002-170. A unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit. pp. 2002 (in Japanese). 3. IEICE Tech. Kajita. Itakura. M. May 1999 (in Japanese). SP2000-7. Shiga. Saruwatari. pp. Rep. and K. Shiga. Shikano. 5{10. Aug. M.11. pp. Jan. T. Tsuzaki. Kawanami.. Masuda. pp. H. S. Banno. and K. Shikano. Jan. SP2001-120. EA99-7. Voice conversion algorithm based on Gaussian mixture model applied to STRAIGHT. Shiraishi. Saruwatari. Designing Japanese speech database covering wide range in prosody. T. J. Shikano. 2002 (in Japanese). and K. IEICE Tech. Kawai. 5. Kawai and T. Nakamura. Jan. H. Masuda. H. Kawanami. IPSJ SIG Notes. 2. and K. and K. 61{68. Kawanami. Rep.A. K. July 2002 (in Japanese). Tsuzaki. Shikano. Denver.. T. 2002-SLP-42.

Proc. pp. Kawanami. SP2002-171. Jpn.. IEICE Tech. M. Proc. 237{238. Soc. 2000 (in Japanese). H. pp. Proc. Tsuzaki. 1-6-5. Iwami. Emotional speech synthesis using GMM-based voice conversion technique. pp. Sep. A unit selection algorithm for Japanese speech synthesis allowing concatenation at vowel center. Voice conversion algorithm based on Gaussian Mixture Model with dynamic frequency warping. Kawai. Sep. Acoust. Toda. and K. Shikano. Soc. Ibaraki. Jpn. Toda.. Kawai. H. 1-6-13. T. Jpn. Spring Meeting. Akita. H. Shikano. 11{16. Toda. Chiba. H. and K. Proc. Proc. Toda. Tsuzaki. Mar. and K. H.. H. Shikano. Shikano. Soc. Acoust. T. S. Shikano. Spring Meeting. Mar. Shikano. Toda. T. Acoust. 2-1-6. Toda. Soc. Soc. Soc.. Acoust. and K. Acoust. T. Jan. Spring Meeting. Mar. Saruwatari. Rep. Lu. Saruwatari. Fall Meeting.. Acoust. 2. Jpn. and K. Kanagawa. 2000 (in Japanese). 185{186. Evaluation of segment selection considering local degradation of naturalness in concatenative speech synthesis. Iwate. M. 116 . Saruwatari. 1-7-6. 3. pp. Application of voice conversion algorithm based on Gaussian mixture model to STRAIGHT. 6.. Spring Meeting. J. 1-10-6. Shiga. 235{236. 2003 (in Japanese). Meetings 1. T.8. 5. Nakamura. Study on conversion-accuracy on speaker individuality of voice conversion algorithm with dynamic frequency warping. Toda. 2-10-2. Shikano. Jpn. 4. Kawai. Evaluation based on perceptual characteristic of cost for segment selection in concatenative speech synthesis. Proc. Y. Tsuzaki. 2001 (in Japanese). pp. Tokyo. J. Lu. H. Fall Meeting. T. 2003 (in Japanese). pp. T. 2002 (in Japanese).. 247{248. and K. Jpn. M. 2002 (in Japanese). 207{208. Mar. 267{268. and K. pp.

H. STRAIGHT-based prosody modi.7. T. Kawanami. T. and K. Masuda. Shikano. H. Saruwatari. Toda.

. pp. and K. Sep. Kawanami. Jpn. T. H. Shikano. Y. Toda. T. Spring Meeting. Soc. Acoust. Fall Meeting. Jpn. 315{316. Acoust. H.. 117 . Mar. T. 2003 (in Japanese). 10. 2001 (in Japanese). Mar. Mar. Akita. 267{268. Sep. T. 1-2-20. 2003 (in Japanese). 1-10-16. Mashimo. 2-10-13. Proc. Cross-language voice conversion and vowel pronunciation. Acoust. H. Kawanami. Kawanami. Proc. 289{290. Kawai and T. 14. Shikano. 1-P-17. Soc. 261{262. Proc. Acoust. Soc. Saruwatari. Saruwatari. Mashimo. Soc. Soc. Jpn. 2-10-15. An evaluation of a design method of speech corpus for concatenative speech synthesis. Soc. Jpn. and Nick Campbell. Jpn. Iwami. H. 2002 (in Japanese). K. pp. Soc. pp. Acoust. 1-6-23. pp. Oct. 1-10-24. pp. 12. Proc. Proc. and K. Fall Meeting.. 389{390. Designing and evaluation of speech database with prosodic variety. Jpn. H. H. H. Proc. A visual speech synthesis method using real image corpus. Spring Meeting.. 2002 (in Japanese). T. pp. Toda. and K. Jpn. Kanagawa. Akita. Fall Meeting. Shikano. M. 11. Kawanami. Ohita. and Nick Campbell. H. H. Toda. 261{262. Y. and K.. Shikano. Oct. Mar. Adaptation of voice conversion method based on GMM and STRAIGHT to cross-language speech synthesis. Masuda. Shikano.cation of CHATR output. pp. Saruwatari. Proc. H.. pp. Toda. 1-6-20. Toda. Acoust. Iwami. Emotion control of speech using GMM-based voice conversion.. T. 2001 (in Japanese). 245{246. 2002 (in Japanese). Acoust. M. Kawanami. 9. Fall Meeting. Toda. Saruwatari. Ohita. Soc. Spring Meeting. 13. T. Proc. 277{278. Spring Meeting. Evaluation of emotional control with GMM-based voice conversion. Toda. Acoust. H. Kawanami. Tokyo. 2002 (in Japanese). Tokyo. Kanagawa. Shikano. K. 8. T.. Shiraishi. Jpn.

Toda. Saruwatari. T. Shikano. Corpusbased visual speech synthesis using perceptual de. Shiraishi. H. T. H.15. and K. Kawanami.

Department of Information Processing. Jpn. 2001 (in Japanese). Mar. Nara Institute of Science and Technology. Acoust. Graduate School of Information Science.. Feb. 118 . 399{400. 2-Q-13. Proc.nition of viseme. Master's Thesis 1. Toda. T. Spring Meeting. pp. Soc. Tokyo. NAIST-IS-MT9951077. 2003 (in Japanese). High quality voice conversion based on STRAIGHT analysissynthesis method. Master's thesis.