This action might not be possible to undo. Are you sure you want to continue?

# A STUDY ON THE SPEECH ACOUSTIC-TO-ARTICULATORY MAPPING USING MORPHOLOGICAL CONSTRAINTS

a dissertation submitted to the graduate school of engineering of nagoya university in partial fulfillment of the requirements for the degree of doctor (engineering)

By Hani Camille Yehia

Abstract

The representation of speech based on articulatory parameters provides a fertile paradigm for a better modeling of the speech process. This modeling is important, for example, for the development of applications, such as speech synthesis, coding and even recognition, whose performance is directly related the method used to represent speech. However, articulatory representation of speech is a goal that, to be achieved, still requires the solution of several problems. A key issue in this context is the inversion of the articulatory-to-acoustic mapping in speech. This study is focused on this point. The articulatory-to-acoustic mapping in speech is de ned here as the mapping that, to every possible articulatory con guration of the vocal-tract, associates a unique acoustic image. This image is de ned as the set of acoustic properties inherent in the vocal apparatus. The spaces formed by all possible articulatory con gurations and by their acoustic images are called articulatory space and acoustic space respectively. The articulatory space maps onto the acoustic space, i.e. every point in the articulatory space has a unique image in the acoustic space, and every point in the acoustic space has an inverse image in the articulatory space. However, since the inverse image is not unique, the map is not a bijection (i.e. it is not a one-to-one mapping). The objective of this study is to estimate a restricted case of the acoustic-toarticulatory mapping, using constraints imposed by the human morphology and by the dynamics of the vocal-tract to determine the point in the articulatory space most likely to map onto a given point contained in the acoustic space. In the restricted case under analysis, only oral vowels are considered. The vocal-tract is represented articulatorily by the log-area function (i.e. the logarithm of the cross-sectional area along the vocal-tract) and acoustically by the set formed by the rst three formant ii

frequencies (i.e. resonant frequencies of the vocal-tract). The area function was chosen to represent the vocal-tract articulatorily because it is the articulatory characteristic most directly related to the acoustic characteristics of the vocal-tract. By its turn, the set formed by the rst three formant frequencies was chosen to represent the vocal-tract acoustically because it is the acoustic characteristic of the vocal-tract least in uenced by factors other than the area function. A major di culty in estimating appropriately the area function from formant frequencies is the one-to-many characteristic of the acoustic-to-articulatory mapping: there are many area functions that map onto the same set of formant frequencies. The main contribution of this study is the formulation of a framework to incorporate constraints imposed by the human morphology into the mapping of the articulatory space formed by all possible area functions onto the acoustic space formed by all possible sets of formant frequencies. (The morphological information is extracted from a corpus of midsagittal vocal-tract pro les obtained with cineradiography.) By doing so, the articulatory space is limited to contain only area functions compatible with the human morphology and, consequently, the ambiguity observed in the acoustic-to-articulatory mapping is reduced. Subsequently, positional and continuity constraints are added to complete the model. Tests carried out con rmed that, under the constraints of the model proposed, plausible sequences of area functions can be estimated from sequences of formant frequencies. The procedure followed to formulate and evaluate the method proposed is described step by step along the chapters of the study. Chapter 1 is the introduction. Chapter 2 describes the corpus and the physical model used to represent the relation between the area function and formant frequencies. Chapter 3 explains the parametric models used to represent the area function. Chapter 4 shows the procedure used to estimate the area function from formant frequencies under morphological, continuity and positional constraints. Chapter 5 presents and interprets the results obtained with the tests carried out with the model. Finally, Chapter 6 concludes the study summarizing the main points of the previous chapters, pointing the strong and weak points of the method, and indicating directions in which research e orts should be carried out in the future. A brief description of each chapter is given in the following paragraphs.

iii

Chapter 1 gives an overview of the context in which the problem analyzed is inserted, presents the problem and its importance, summarizes the history and alternative methods available in the literature, explains in general lines the method proposed to solve the problem, and describes the general organization of the remaining chapters. The rst part of Chapter 2 describes the procedure used to obtain a corpus of 519 area functions from midsagittal tracings extracted from cineradiograhic data and from labiograms synchronously acquired at a rate of 50 frames per second. The process to obtain the area functions from the raw data is as follows: rst, each midsagittal pro le is plotted on a semi-polar grid which follows approximately the orientation of the tube de ned by the vocal-tract walls. The grid lines divide the vocal-tract into sections. After that, each section is represented by its mean length and by its mean midsagittal distance, i.e. the mean distance between anterior and posterior walls of the tract. In the next step, the natural logarithm of the mean midsagittal distance of each section is converted into the natural logarithm of the cross-sectional area by means of a linear transformation. (This transformation is, however, only an approximation because, although strongly associated with, the cross-sectional area is not completely determined by the midsagittal distance.) At this point, each vocaltract shape is represented by a set of log-areas (i.e. logarithms of cross-sectional areas) and corresponding section lengths. The nal step is to resample the log-areas so that they become evenly spaced along the tract. By doing so, each vocal-tract shape can be represented as a vector of log-areas plus the vocal-tract length. The corpus is then arranged in a matrix whose rows contain the log-area vectors. This matrix provides an e cient way to handle the corpus of area functions throughout the study. The second part of Chapter 2 explains the physical model used to calculate formant frequencies from the area function. At rst, the vocal-tract is modeled as a rigid, lossless tube. After that, viscous losses, glottal opening, lip radiation load and yielding walls are incorporated into the sound propagation model, and their e ects on the formant frequencies are analyzed. Finally, a numerical procedure to compute formant frequencies from the area function, approximated by a concatenation of uniform tubes, is described. This procedure is the tool used in this study to express formant frequencies as a function of the area function (or, more speci cally, of the iv

log-area function.) The objective of Chapter 3 is to obtain an e cient parametric representation of the log-area function. The number of uniform sections necessary to obtain a good approximation of the vocal-tract log-area function (32 sections in this study) is considerably larger than the number of dimensions of the articulatory space. More e cient representations can be achieved if each log-area function is expressed as the sum of a few basis functions weighted by a set of coe cients. Since the basis functions (basis vectors, in the discretized case) are xed, only the coe cients are enough to represent a given log-area function (log-area vector, in the discretized case.) Two possibilities of parametric representation are used in this work: Fourier analysis, and principal component analysis (PCA). In the rst part of Chapter 3, a reasonably good approximation of the log-area function is obtained when it is represented by the rst eight terms of its Fourier cosine series expansion. In this case, the set of basis functions is formed by cosine functions. An important property of this representation is that there is a one-toone relationship between the rst three odd coe cients of the Fourier cosine series expansion of the log-area function and the rst three formant frequencies determined by it. This property is used in the method proposed here to determine the area function most likely to have produced a given set of formant frequencies. Cosine functions form a set \general purpose" basis functions which, in principle, do not contain any information about the morphology of the vocal-tract. In the second part of Chapter 3, a more e cient representation of the log-area vector (from now on, the description will be done for the discretized case), which makes use of such information, is obtained by means of principal component analysis (PCA). In this case, the set of basis vectors is formed by the eigenvectors associated with the ve largest eigenvalues of the covariance matrix of log-area vectors, which is estimated using all vectors present in the corpus described in Chapter 2. After the PCA procedure is carried out, some transformations are still necessary to nd a representation for the log-area vectors which exhibits a one-to-one mapping between a subset of its components and the corresponding sets of formant frequencies (as in the case of Fourier representation.) In this procedure, rst, independent component analysis (ICA) is used to nd a ne transformations which reduce the dependence among the principal components used to represent the articulatory space, v

and among the formant frequencies (represented in log-scale) used to represent the acoustic space. Next, singular value decomposition (SVD) is used to nd rotations of both articulatory and acoustic spaces so that each component of the acoustic space is subject to the major in uence of one and only one component of the articulatory space. By its turn, each component of the articulatory space has major in uence on at most one component of the acoustic space. When either PCA or Fourier representation is used, it is observed that the mapping of the articulatory space onto the acoustic space, albeit nonlinear, has a strongly linear characteristic. The Fourier representation is useful to study analytical aspects of the articulatory-to-acoustic mapping, whereas the PCA representation is optimal from a statistical point of view. The last part of Chapter 3 takes advantage of the fact that the vocal-tract moves continuously in time, and so do articulatory parameters. Based on this fact, it is possible to think about the parametrization of trajectories of articulatory components and form the concept of an articulatory trajectory space. This procedure is successfully carried out using either Fourier or PCA approach. In this temporal parametrization the basis functions (of time) obtained with PCA have basically the shape of cosine functions. In the tests realized, sequences of ten frames were well represented as linear combinations of four basis functions. At the end of Chapter 3, sequences of log-area vectors can be represented in a very compact parametric form. As a quantitative example, the log-area vectors contained in a sequence of 10 frames (200 ms of speech) can be represented by only 20 parameters. Chapter 4 focuses on the main objective of this work: the inversion of the articulatoryto-acoustic mapping or, equivalently, the acoustic-to-articulatory mapping. This is, as already stated, a one-to-many mapping, since the same set of formant frequencies can be produced by an in nite number of area functions. In other words, a given point in the acoustic space is associated with an entire subspace in the articulatory space. The number of dimensions of this subspace is equal to the number of dimensions of the articulatory space minus the number of dimensions of the acoustic space. Among all the points contained in the articulatory space mapping onto a given point in the acoustic space, it is possible to look for the point that can be reached with minimum e ort by the vocal-tract. This procedure is carried out by representing the vi

vocal-tract articulatory e ort as a quadratic cost function and, subsequently, nding the point of minimum cost among all points in the articulatory space that map onto a given point in the acoustic space. The same procedure can be used in the case of articulatory trajectories. In this case, the problem is to nd the trajectory of minimum cost among all points in the articulatory trajectory space that map onto a given point in the acoustic trajectory space. The mathematical formulation of this problem results in a non-linear system which is numerically solved using a Newton-Raphson procedure. A stable solution was always achieved, taking three iterations on average to converge. At the end of Chapter 4 the method used to estimate area functions from formant frequencies is complete. In Chapter 5 the method is tested and the results obtained are analyzed. The 258 area vectors of the corpus that correspond to oral vowels were used for this purpose. Isolated articulatory positions were analyzed rst. In this case, continuity constraints are not imposed. Using the numerical procedure described in Chapter 2, the rst three formant frequencies were extracted from the transfer functions of each area vector under analysis. These sets of formant frequencies were then used to estimate the area vectors from where they had been extracted. A comparison of the original area vectors with those estimated from the formant frequencies shows that the inversion procedure works satisfactorily for most of the analyzed cases. Nevertheless, despite the good agreement observed between practically all the acoustic transfer functions derived from original and estimated vectors; large articulatory distortions were observed in some cases. Part of these distortions was considerably reduced when continuity constraints were imposed, i.e. when articulatory trajectories were estimated instead of isolated articulatory positions. The distortions that remained can be mainly attributed to the quadratic cost function adopted to represent the vocal-tract articulatory e ort. The quadratic cost allows an e cient mathematical solution, but does not have a physiological meaning. Still in the scope of Chapter 5, it is interesting to note that, since the obtained transfer functions were derived only from the formant frequencies, and since a good spectral matching was obtained; it is possible to say that, if morphological information is available, it is possible to estimate the vocal-tract transfer function from the formant frequencies. It remains to be shown if the human being makes use of such redundancy and, if so, in what way. vii

Chapter 6 concludes the study. In summary, it is possible to say that the use of morphological, positional, and continuity constraints can be e ciently combined in the analysis of the acoustic-to-articulatory mapping during speech. A method to combine these constraints was described and tested. Correlation coe cients of 0.83 were found in the articulatory domain. In the acoustic domain, the correlation coe cients found were above 0.999, con rming that the acoustic constraints were respected. As a nal note about the future, there is still a lot to be done to improve the model. A physiologically meaningful function to measure the cost of vocal-tract positions must be obtained to reduce the discrepancies observed. Also important is the development of a better representation of the acoustic space, so that the model can be applied to sounds other than oral vowels.

viii

Acknowledgements

This work would not be concluded without the support received from several people. First of all, I would like to thank Prof. Fumitada Itakura, my PhD advisor, for guiding me through the PhD course, for teaching me many important things, and for giving me the freedom necessary to learn several more in Nagoya University. I would like to thank also Prof. Noboru Ohnishi, for analyzing and discussing the contents of this study. In the same way I thank Prof. Kazuya Takeda for interesting comments and suggestions about this work, and for helping me with important problems that I would not be able to solve alone. In Nagoya University I appreciated very much to study together with Shoji Kajita, who was always by my side during the PhD program, helped me every time I needed, and divided with me all happy and di cult moments along these years. I would like to thank also Hong Wang, now at PictureTel, who was always a good friend during the years we were together in Nagoya; and Motohiko Yada, from whom I always received all possible cooperation and incentive. I cannot forget the guidance received previously from Osvaldo Catsumi Imamura and Prof. Fernando Toshinori Sakane, my MSc advisors, and from Prof. Marcos Botelho, my personal counselor, who taught me how to learn during the time I was at the Instituto Tecnologico de Aeronautica in Brazil. Also fundamental for this work was the support received from Shinji Maeda and Rafael Laboissiere, from whom I received most of the data used as base for this research, and with whom I had several fruitful discussions. I am grateful for having had the opportunity of carrying out experiments at NTT Basic Research Laboratories in Atsugi, Japan. There I had the chance of working with several good scientists (and persons). In special, I would like to thank Masaaki Honda, Tokihiko Kaburagi and Takemi Mochida, for their perfect collaboration during ix

the time I was at NTT. Currently, I am with ATR-HIP (in Kyoto, Japan), where I could nd an incredibly fertile environment for scienti c studies. In special, I would like to thank Yoh'ichi Tohkura for his constant support and encouragement, Eric Bateson and Mark Tiede for the several discussions we had together, and for the cooperation on the development of our research goals, Tatsuya Nomura and Shinobu Masaki for translating the abstract of the thesis into Japanese, and Erik McDermott for sweating over his dissertation at the same time I was sweating over mine. I cannot forget also my friends in Brazil, Japan, and all over the world, who were always ready to collaborate when I asked, and even when I did not ask. Friendship is undoubtedly the most precious thing existent in the world. I hope we have the opportunity to help each other many other times in the future. Finally, but most importantly, I am extremely grateful to all members of my family for all the orientation, incentive and support that they have been giving me all through my life. In special, I thank my wife, Ana Helena da Costa Fragoso Yehia, for the years of my life that she made happier; my parents, Camille Hani Yehia and and Tamira Hamdan Yehia, for their unconditional encouragement since I was born; and my sister Aline, my brother Salim, and my aunt Badiah, who were also by my side during all these years. I apologize for not citing explicitly all the names I wanted to, including the scientists whose works formed the theoretical base for this study. I hope I can return at least partially all the things I have received through these years.

x

List of Symbols

x A(x) d(x) (x) (x)

Distance from the glottis along the tract. Cross-sectional area function. Midsagittal distance. Proportional coe cient used in model. Exponential coe cient used in model. Log-area vector of frame i. Area of the k-th uniform section used to approximate the area function of frame i. Vocal-tract length of frame i. Natural logarithm of Aki, k = 1 : : : K . yKi = Li. Number of vectors present in the corpus. Number of uniform sections used to approximate the area function. Amplitude of the sound pressure. Amplitude of the volume velocity. Angular frequency of the sound pressure and volume velocity. Velocity of sound inside the vocal-tract. Density of air inside the vocal-tract. m-th eigenvalue of the Webster's horn equation. m-th formant frequency. E ective lip radius. Angular frequency of the mth formant for a tract with yielding walls. Angular frequency of the mth formant for a tract with rigid walls. Lowest angular resonance frequency when the tract is closed at both ends. Mass of the tract walls per unit length. Compliance of the tract walls per unit length. xi

yi

Aki Li yki P K P U !

c

m

Fm a !m !m ^ !0

Lw Cw

Rw a b c1

l M N an

Uay Cy S U

y

a

**Uy T f g Tfg Tg G B Uhg U h
**

Q Sh

f

Resistance of the tract walls per unit length. Ratio of wall resistance to mass. Squared angular frequency of the mechanical resonance. Correction for thermal conductivity and viscosity. Section length. Number of formants used as acoustic parameters. Number of articulatory components. n-th coe cient of the Fourier cosine series expansion of the area function. Vector of Fourier cosine coe cients. Matrix of cosines used to transform a into y. Covariance matrix of log-area vectors. Mean log-area vector. Diagonal matrix containing the eigenvalues of Cy in decreasing order. Unitary matrix whose columns contain the normalized eigenvectors of Cy . Vector of principal component articulatory coe cients. Mean of the Q vectors used to \ ll" the articulatory space. Matrix used to transform into y. Vector of independent component articulatory coe cients. Matrix used to transform into . Vector formed by the rst three formant log-frequencies associated with a given area function. Mean of the Q vectors used to \ ll" the acoustic space. Vector of independent component acoustic coe cients. Matrix used to transform f into g. Matrix that approximates g as a linear transformation of . Matrix containing an ensemble of vectors g. Matrix containing an ensemble of vectors . Number of vectors contained in the ensembles G and B. \Pseudo diagonal" matrix containing the eigenvalues of T gT g t. Unitary matrix containing the normalized eigenvectors of T gT gt. Unitary matrix containing the normalized eigenvectors of T gtT g . Vector of articulatory variables used in the principal component representation. Vector of acoustic variables used in the principal component representation. xii

0 y Tfh Rh Rf

Ty

H

h

p q

P0

C

i

V V Yi P Pa Py H

i

Ha Hy C CP h

Matrix used to transform y into . Mean of the Q log-area vectors used to \ ll" the articulatory space. Matrix used to transform f into h. Matrix of correlation coe cients between h and . Matrix of correlation coe cients between f and . Matrix containing an ensemble of Q acoustic vectors h. Matrix containing an ensemble of Q articulatory vectors . Column vector containing the standard deviation of h. Column vector containing the standard deviation of . Number of frames contained in a given articulatory trajectory. Number of coe cient vectors necessary to parametrize an articulatory trajectory of p frames. Matrix containing a sequence of p articulatory vectors, starting at frame i. \Covariance" matrix of . Number of sequences of length p contained in the corpus. Diagonal matrix containing the eigenvalues of C . Matrix whose columns are the normalized eigenvectors of C . Principal component representation of . Matrix containing the rst q columns of V. Matrix containing a sequence of p log-area vectors, starting at frame i. Quadratic positional cost of a given articulatory vector . Quadratic positional cost of a given articulatory vector a. Quadratic positional cost of a given log-area vector y. Weight matrix containing information about the morphology of the vocal-tract. (Principal component representation case.) Weight matrix containing information about the morphology of the vocal-tract. (Fourier representation case.) Weight matrix containing information about the morphology of the vocal-tract. (Log-area representation case.) Covariance matrix of the articulatory vectors . Average of the costs of all articulatory vectors in the corpus. Variation in the acoustic vector h. Variation in the articulatory vector . xiii

a

1 2

1 2

0

M

0

IN

p

h k k

M

V

X E

G H

H

0

Jacobian matrix de ned by dh=d . Vector containing the rst M components of . Vector containing the last N M components of . Jacobian matrix de ned by @ h=@ 1. Jacobian matrix de ned by @ h=@ 2. Articulatory vector corresponding to the neutral position of the vocal-tract. d =d 2 . Identity matrix of order N M . 0 tH . Matrix formed by the combination of a and p . Vector of variations containing containing acoustic and minimum e ort targets. Norm function de ned by the largest component (absolute value) of a vector. \Matrix of matrices" containing the locally linear relation between a sequence of articulatory variations and their acoustic and cost counterparts. Sequence of articulatory variations i vertically arranged as a column vector. Sequence of vectors hi, vertically arranged as a column vector. Np Nq matrix whose nonzero elements vij are the entries of V . Each row of V is repeated N times. Columns of a variation i rearranged in a column vector. Np 1 column vector containing the approximation error between X and H. The squared error E t H E . Weight matrix used in the computation of the squared error . Neutral trajectory determined by the vocal-tract sustaining the neutral position along the analyzed interval.

MV

xiv

Contents

Abstract Acknowledgements List of Symbols 1 Introduction 2 Area Function and Formant Frequencies

2.1 Corpus : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1.1 Sampling vocal-tract pro les : : : : : : : : : : : 2.1.2 From pro les to midsagittal distances : : : : : : 2.1.3 From midsagittal distances to area function : : 2.1.4 Log-area function : : : : : : : : : : : : : : : : : 2.2 Computation of the formant frequencies : : : : : : : : 2.2.1 Lossless tube : : : : : : : : : : : : : : : : : : : 2.2.2 Lossy tube : : : : : : : : : : : : : : : : : : : : : 2.2.3 Numerical determination of formant frequencies 2.2.4 Comparison of lossless and lossy models : : : :

iii x xii 1

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

8 8 9 12 12 15 15 16 18 20

6

3 Parametric Models for the Area Function

3.1 Fourier Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : 3.1.1 Truncation E ects : : : : : : : : : : : : : : : : : : : : 3.1.2 Formant frequencies as functions of Fourier coe cients 3.2 Statistical Analysis : : : : : : : : : : : : : : : : : : : : : : : : 3.2.1 Principal Component Analysis : : : : : : : : : : : : : : 3.2.2 Independent component analysis : : : : : : : : : : : : xv

24

25 26 27 29 29 38

3.2.3 Singular Value Decomposition : : : : : : : : : : : : : : : : : : 3.3 Temporal Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : :

41 46

4 The Inverse Problem

4.1 Isolated frames : : : : : : : : : : : : : : : : : : 4.1.1 Representing Morphological Constraints 4.1.2 Solving the inverse problem : : : : : : : 4.2 Trajectories : : : : : : : : : : : : : : : : : : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

54

57 57 63 68 74 75 83

5 Results and Discussion

5.1 Isolated Frames : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.2 Trajectories : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.3 Quantitative Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : :

73

6 Conclusion A Numeric Information B Results for Isolated Frames Bibliography List of Publications

86 88 92 126 134

xvi

List of Tables

2.1 5.1 5.2 5.3

Sentences contained in the corpus Numerical Results: Areas : : : : : Numerical Results: Length : : : : Numerical Results: Formants : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: 8 : 84 : 84 : 85

xvii

List of Figures

2.1 Spectral properties of the speech signal determined by di erent properties

of the vocal apparatus. Only the vocal-tract shape has major in uence on the formant frequencies. : : : : : : : : : : : : : : : : : : : : : : : : :

7

2.2 Left: midsagittal pro le tracing extracted from cineradiographic frame and

labial tracing extracted from video frame recorded synchronously. Center: points sampled from the grid and from the labial region to represent the pro le. Right: Pro le approximated by sampled points. : : : : : : : : : : given section. The section length is determined by the distance between the intersections of the bisection of the angle formed by the lines de ned by the section walls with the grid lines that delimit the section. The midsagittal distance is de ned as the distance between the intersections of the line passing through the midpoint of the segment that determines the section length, and orthogonal to it, with the section walls. Section walls and grid lines are represented by thick and thin solid lines, respectively. In the case shown in the left, the grid lines are part of a Cartesian region of the grid, whereas in the case shown in the right, the grid lines belong to the polar region of the grid. : : : : : : : : : : : : : : : : : : : : : : : : function of the distance from the glottis. Center: area function estimated with the model. Bottom: Log-area function sampled at uniformly spaced points along the vocal-tract. : : : : : : : : : : : : : : : : : : : :

9

2.3 Procedure used to determine the midsagittal distance and the length of a

10

2.4 Top: midsagittal distances of the pro le shown in Fig. 2.2 plotted as a

11

xviii

2.5 Comparison between vocal-tract transfer function computed from the area

2.6

2.7

3.1 3.2 3.3

function estimated from midsagittal distances, and spectral envelope of the speech recorded synchronously. For the example shown in the left, the spectral envelope obtained from the pre-emphasized speech signal (gray line) has its formants (peak frequencies) matching fairly well the formants of the transfer function derived from the area function (black line). However, in the case shown in the right, due to the relatively low energy of the speech signal, the colored noise generated by the experimental apparatus produces a spurious peak around 1.3kHz, considerably a ecting the estimated spectral envelope. : : : : : : : : : : : : : : : : : : : : : : : : On each column the second panel from the top shows a comparison of the transfer functions estimated from lossy (black solid line) and lossless (dashed line) models of the vocal-tract area function. The gray line represents the speech power spectrum envelope. The speech signal, area function, midsagittal distance and vocal-tract pro le are also given as references. Note that there is little discrepancy between lossy and lossless formant frequencies. : : : : : : : : : : : : : : : : : : : : : : : : : : : Formant frequency variation due to yielding walls and radiation load. Each chart shows a histogram of the relative deviation of formants derived from a lossless vocal-tract model with respect to formants derived from a lossy model: (Flossless Flossy )=Flossy . The ordinates show the percentage of points in the corpus whose relative deviation falls within the the abscissa interval under a given bin. : : : : : : : : : : : : : : : : : : : : : : : : : Area function for the French /i/, and the same area approximated by truncated Fourier cosine series expansions (thick lines), compared with the original area (thin lines). : : : : : : : : : : : : : : : : : : : : : : : : Formant frequencies as a function of the number of Fourier cosine coe cients used to represent the are function of the French /i/. (Obs. Note that N includes the 0th order term.) : : : : : : : : : : : : : : : : : : : First three formants as functions of the rst 6 Fourier cosine coe cients. In each graph, all other coe cients are kept equal to zero. A histogram of the coe cients of the log-area functions of the corpus analyzed is plotted above each chart. : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

13

22

23 27 28

30

xix

3.4 First and second formant frequencies as functions of rst and third Fourier cosine coe cients a1 and a3; when all other coe cients are equal to zero. 31 3.5 Eigenvalues of the log-area covariance matrix. : : : : : : : : : : : : : : 33 3.6 Eigenvectors corresponding to the rst 5 eigenvalues obtained from the

decomposition of the log-area covariance matrix. All eigenvectors are normalized to have unit Euclidean norm. The rst K = 32 components correspond to the log-area along the tract; and the last component corresponds to the tract length. The corresponding eigenvalue square root is given as a reference to the \importance" of each eigenvector. : : : : : : Area function approximations by Fourier cosine series expansion (dashed line), and by statistically optimum eigenvalue expansion (solid line). The thick solid line shows the original area. Above: expansion with 3 components. Below: expansion with 5 components. : : : : : : : : : : : : : : Vocal-tract length, lip area, and alveopalatal area trajectories along the sentence (in French): Ma chemise est roussie. The dashed lines show the original measured trajectories, while the solid lines show the trajectories parametrized by the model proposed here. For each case, the mean and the standard deviation values of the relative di erence (in percentage) between parametrized and original trajectories are also shown. : : : : : : : : : : : (a) Parametric subspace determined by the rst two components of . (b) Points corresponding to realistic area functions (articulatory space). (c) The same points shown in a coordinate system with \less dependent" components. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : (a) Normalized histograms of the rst 3 formant log-frequencies corresponding to an articulatory space lled with approximately uniformly distributed points. (b) Histograms of the variables obtained after independent component analysis (ICA) of the formant log-frequencies. (c) and (d) Scatter plots of the rst 2 variables shown in (a) and (b), respectively. Basis vectors (rows of Ty ) and mean vector ( y ) used to represent logarea vectors (y). Units: the rst K = 32 components are log-areas along the tract expressed in log(cm2 ). The last component is the tract length expressed in normalized units (1 unit = 0.53cm for the basis vectors and 1 unit = 20 0.53cm for the mean vector). : : : : : : : : : : : : : : : :

3.7

35

3.8

36

3.9

37

3.10

40

3.11

42

45

xx

3.12 Scatterings representing the joint distributions of the components of the

acoustic variable h and the components of the articulatory variable . Note the high correlation between 1 and h1, and between 2 and h2. See also the nonlinear relation between 1 and h3 . : : : : : : : : : : : : : : rst two articulatory components ( 1 and 2 ), when all other components ( 3 ; 4 ; and 5 ) are equal to zero. Note that h1 is almost independent of 2 , and that there are one-to-one relationships between h1 and 1 , and between h2 and 2 . : : : : : : : : : : : : : : : : : : : : : : : : : : : :

47

3.13 First two acoustic components (h1; and h2) expressed as functions of the

48

3.14 Articulatory and acoustic component trajectories along the sentence (in French): Ma chemise est roussie. Note the similarity between the rst

two articulatory trajectories and the rst two acoustic trajectories. (The dashed lines in the acoustic trajectories indicate the intervals where the formants cannot be reliably extracted from the speech signal due to very narrow constrictions in the area function.) : : : : : : : : : : : : : : : : of sequences of parametrized log-area vectors, and corresponding rst four eigenvectors. : : : : : : : : the diphthong /ui/, uttered in the (French) sentence \Luis pense a ca." (b) Sequence of areas reconstructed from the parametric principal component representation of the original areas shown in (a). (c) Sequence of areas reconstructed from the parametric Fourier representation of the original areas shown in (a). (d), (e) and (f) show formant frequency trajectories corresponding to the sequences of areas shown in (a), (b) and (c) respectively. The dashed lines shown in (e) and (f) are the original formant trajectories shown in (d). For each pair of formant trajectories, the maximum relative di erence (in percentage) is also shown. : : : : : :

49 52

3.15 Eigenvalues of the \covariance" matrix

3.16 (a) Sequence of area functions, taken from the corpus, corresponding to

53

4.1 Representation of the one-to-one relationship between the N M dimensional subspaces that form the N dimensional articulatory space. Compare with gures 4.2 and 4.3, where the level curves are N M = 1 dimensional subspaces (contained in an N = 2 dimensional articulatory space) which map onto an M = 1 dimensional acoustic space. : : : : : : : : : xxi

55

4.2 Top left: the rst formant F1 as a function of the Fourier cosine coe cients a1 and a2; when all other coe cients are equal to zero. Top right: paraboloidal surface representing the cost function P a = at Ha a used to

quantify the vocal-tract e ort. Bottom: The solid thick lines show level curves of the surface shown in the top left panel. (Compare with the general case in Fig. 4.1.) The solid thin ellipses show level curves of the cost function shown in the top right panel. The dashed circles represent the particular case when P y is an unweighted squared Euclidean distance (i.e. when Hy is an identity matrix.) : : : : : : : : : : : : : : : : : : : : : : component coe cients 1 and 2 ; when all other coe cients are equal to zero. Top right: paraboloidal surface representing the cost function P = tH used to quantify the vocal-tract e ort. Bottom: The solid thick lines show level curves of the surface shown in the top left panel. (Compare with the general case in Fig. 4.1.) The solid thin ellipses show level curves of the cost function shown in the top right panel. The dashed ellipses represent the particular case when P y is an unweighted squared Euclidean distance (i.e. when Hy is an identity matrix.) : : : : : : : : : column, the central panel shows original (thin line) and estimated area (thick line). The estimated area is obtained from the formant frequencies determined by the original area. Vocal-tract pro le, midsagittal distances, transfer functions and speech signal are also shown for reference purposes. From left to right the columns correspond to the neutral, and French /a/, /i/, and /u/ vowels. : : : : : : : : : : : : : : : : : : : : : : : : : : : : cases. Left: French /a/ with excessively open lips. Center-left: French /u/ with excessively large front cavity. Center-right: French /i/ with excessively short length. Right: French /e/ with excessively closed lips compensating underestimated length. As in Fig. 5.1, in the central panel of each column, the thin line is the original area and the thick line is the area estimated from the formant frequencies determined by the original area. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

58

4.3 Top left: the rst acoustic component h1 as a function of the principal

59

5.1 Results obtained with the inversion technique for isolated frames. In each

76

5.2 Problems with the inversion procedure. The columns show the following

77

xxii

5.3 Top: Sequence of area functions, taken from the corpus, corresponding

to the diphthong /ui/, uttered in the French sentence \Luis pense a ca." Bottom: Formant frequency trajectories corresponding to the sequences of areas shown in the top panel. : : : : : : : : : : : : : : : : : : : : :

79

5.4 Top: Sequence of areas estimated from the formant trajectories shown

in the bottom panel of Fig. 5.3, under continuity and minimum e ort constraints. Bottom: The solid lines show the formant frequency trajectories corresponding to the sequences of areas shown in the top panel. The dashed lines reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3. : : : : : : : : : : : : : : : : : : : : : : : : :

80

5.5 Top: Sequence of area functions estimated from the formant trajectories

shown in Fig. 5.3 under the following constraint: The areas are represented by the rst six components of its Fourier cosine series expansion with the even coe cients set to zero. Bottom: The solid lines show the formant frequency trajectories corresponding to the sequences of areas shown in the top panel. The dashed lines reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3. : : : : : : : : : : : : : : : : : :

81

5.6 Top: Sequence of area functions estimated from the formant trajectories

shown in Fig. 5.3 under the following constraint: The areas are represented by the rst nine components of its Fourier cosine series expansion determined under morphological and continuity constraints. Bottom: The solid lines show the formant frequency trajectories corresponding to the sequences of areas shown in the top panel. The dashed lines reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3. :

82

xxiii

5.7 (a) Scattering of the cross-sectional areas obtained from the parametric

principal component representation of the original areas plotted against their original counterparts. The scattering of the formant frequencies derived from the areas is shown in (e). (b) Cross-sectional areas estimated from formant vectors in the case of isolated frames plotted against original areas. The formant frequencies derived from the areas are plotted in (f). (c) Cross-sectional areas estimated from formant vector trajectories plotted against original areas. The formant frequencies derived from the areas are plotted in (g). (d) Cross-sectional areas estimated from isolated frames plotted against areas estimated from formant vector trajectories. The formant frequencies derived from the areas are plotted in (h). The correlation coe cients are given in the top right corner. The 258 oral vowel frames available in the corpus were used to generate the scatterings. : : : : : :

85

xxiv

Chapter 1 Introduction

\A voice cannot carry the tongue and the lips that gave it wings. Alone must it seek the ether." Gibran Kahlil Gibran (1883{1931) The Prophet

The speech production process is the result of the combination of articulatory movements acting on and interacting with the air owing from the lungs to the mouth. The part of the acoustic e ects of such actions and interactions that is radiated (mainly) through the lips and nostrils constitutes the speech signal (Flanagan, 1972). An e cient representation of speech is fundamental for obtaining good performance in applications such as speech synthesis, coding, and recognition (Rabiner and Schafer, 1978). While parametrization of the speech waveform itself is often su cient, it does not explicitly give speech features like spectral envelope and pitch information that are important, for example, in speech recognition systems (Rabiner and Juang, 1993). Moreover, waveform parametrization allows only a limited degree of compression of the speech signal (Jayant and Noll, 1984). 1

2

CHAPTER 1. INTRODUCTION

Higher levels of compression, as well as a clearer representation of meaningful speech acoustic parameters, can be achieved by modeling the physical process that generates the speech signal. Today, most of the methods following this line are based on linear prediction theory (Markel and Gray, 1976). Such methods are very e cient in parametrizing short intervals (frames) of speech. However, since speech acoustic parameters can vary abruptly, smoothness constraints, in general, can not be imposed. If such constraints were possible, they could be invoked to improve the accuracy of parameter estimation (especially under adverse conditions), to attain even higher compression levels, and to simplify models for speech synthesis. Although, in general, smoothness constraints can not be imposed on acoustic parameters, they can be used successfully to characterize vocal-tract articulatory movements (Perkell, 1969). Therefore, assuming that articulatory synthesis (synthesis of speech based on vocal-tract con guration parameters) is a goal that can, in principle, be achieved (Maeda, 1982; Sondhi and Schroeter, 1987; Scully, 1990; Lin, 1990), the use of articulatory parameters can be very useful for an e cient representation of the speech process (Flanagan, 1972). If speech is to be represented by articulatory parameters, then, besides developing methods to generate speech from such parameters (the direct problem), it is necessary to be able to estimate the vocal-tract con guration from the speech signal (the inverse problem). This includes determination of subglottal and glottal conditions (voice source), vocal-tract shape and losses, and radiation load. This study focuses on the estimation of the vocal-tract shape, which is the primary determinant of the formant structure of the speech signal (Fant, 1980). The extraction of geometrical characteristics of the vocal-tract from its acoustic features has been discussed in several previous studies: Schroeder (1967) analytically described the relationship between the singularities (poles and zeros) of the vocal-tract admittance measured at the lips and the vocal-tract cross-sectional logarea function (represented by its Fourier cosine series expansion). The analysis was performed for variations within the limit of applicability of rst order perturbation theory. For larger variations, Mermelstein (1967) developed a numerical procedure to estimate the area function (parametrized by the rst 6 coe cients of its Fourier cosine series expansion) from the admittance singularities. He showed that the formant frequencies, which correspond to the admittance poles, are not su cient to

3 uniquely determine the log-area function. The remaining necessary information can be obtained from the admittance zeros but, unfortunately, these cannot be estimated from the speech signal. Schroeder (1967) developed then an experimental apparatus to measure the vocal-tract admittance at the lips and, using a frequency domain approach, was able to determine good approximations for the area function. However, the problem of estimating the area function from the speech signal still remained to be solved. With the advent of linear prediction theory applied to speech (Itakura and Saito, 1968; Atal and Hanauer, 1971), Wakita (1973, 1979) developed an inverse ltering technique to estimate the vocal-tract area function from the speech waveform. However, that technique makes use of information about voice source, loss distribution, tract length, and lip radiation that can not be assumed to be accurately known a priori. In fact, Sondhi (1979) showed that the speech signal alone does not contain enough information for a unique determination of the vocal-tract area function, con rming the conclusions of Mermelstein (1967) and Schroeder (1967). Thus, on the one hand, in order to achieve a practical system of articulatory speech representation, it is necessary to obtain the vocal-tract shape from the speech signal. On the other hand, the speech signal itself does not contain enough information to uniquely determine such shape. Therefore, it is necessary to constrain the universe of possible tract con gurations, so that the problem can be e ciently solved. Since the speech signal is assumed to be produced by a human vocal-tract, the human physiology can be invoked as a natural possibility for a constraint formulation. In other words, vocal-tract data obtained from acoustic (Sondhi and Resnick, 1983; Yehia et al., 1995a, 1995b; Yehia and Itakura, 1995a), X-ray (Bothorel et al., 1986), magnetic resonance imaging (MRI) (Baer et al., 1991; Tiede et al., 1996; Tiede and Yehia, 1996; Yehia and Tiede, 1997), electromagnetic midsagittal articulometer (EMMA) (Perkell et al., 1992; Yehia et al., 1996), or any other kind of tract measurement can be used (directly or indirectly) as prior information for the vocal-tract shape estimation from the speech signal. Following this line, various frameworks have been formulated to combine the acoustic information contained in the speech signal with the constraints determined by the human physiology. A computer sorting technique followed by a ne optimization procedure was used by Atal et al. (1978) and, in a more elaborate model, by

4

CHAPTER 1. INTRODUCTION

Schroeter and Sondhi (1991). Model matching techniques were used by Flanagan et al. (1980) and by Shirai and Kobayashi (1986). Shirai (1993) also proposed a neural network approach for the estimation of articulatory motion. Another connectionist approach, making use of a control theory framework (whose principle was rst proposed by Jordan, 1990) was presented by Bailly et al. (1991). As a nal example, McGowan (1994) considered the use of genetic algorithms, obtaining interesting results for the dynamic case of the inverse problem. (A much more complete description of the techniques developed to estimate vocal-tract shapes from the speech signal can be found in Schroeter and Sondhi, 1994.) The approaches cited above have two points in common: the rst one is that, during the optimization procedure, acoustic and shape parameters are represented in distinct spaces. This fact, besides resulting in a large number of optimization parameters, often leads to the problem of a cost function with local minima (Schroeter and Sondhi, 1991). The second point is that an explicit articulatory model is always used. The problem here is that the design of articulatory models is normally oriented toward the direct problem of speech production (Coker and Fujimura, 1966; Mermelstein, 1973; Coker, 1976; Maeda, 1990). Although such models can be successfully used in the inverse problem, problems of redundancy and ambiguity may occur (Gupta and Schroeter, 1993). Within the present study, these two problems are avoided. The rst, by mapping acoustic and shape parameters into a common space. The second by using only the statistical behavior of the vocal-tract, rather than an explicit articulatory model, to formulate a cost function (Yehia and Itakura, 1996; Yehia, Takeda and Itakura, 1996). In the case of the method proposed here, which can be included in the model matching category, the following series of steps must be carried out: Acquisition of vocal-tract morphological, dynamic and acoustic information. Parametrization of the articulatory and acoustic spaces. Representation of the mapping from the articulatory space onto the acoustic space. Formulation of a cost function which quanti es morphological and dynamic constraints.

5 Combination of acoustic, morphological and dynamic information to solve the inverse problem. These topics are addressed one by one in the following chapters. The main point is the mapping from the articulatory onto the acoustic space. These spaces are appropriately represented such that the resulting mapping is simple enough to support a one-to-one approximately linear relationship between the components of a subspace of the articulatory space and the corresponding components of the acoustic space. This fact is then exploited to nd a plausible solution for the restricted case of the inverse problem considered here. The results obtained are then evaluated and interpreted.

**Chapter 2 Area Function and Formant Frequencies
**

\Everything ows." Heraclitus ( 535{475 BC) On Nature

The Area function and formant frequencies play an important role in the study of speech production: they form a bridge between articulatory con gurations of the vocal-tract and acoustic characteristics of speech. Formant frequencies are primarily determined by the vocal-tract shape, with little in uence from other articulatory factors (Flanagan, 1972, pp. 58{69). This is in contrast with other spectral properties of the speech signal, over which factors other than the vocal-tract shape can also have considerable in uence (see Fig. 2.1). For the case of sound plane wave propagation, the cross-sectional area along the vocal-tract (the area function) is the geometric property of the vocal-tract shape that determines the formant frequencies. But the converse does not hold; that is, the formant frequencies do not in turn uniquely determine the area function. In this chapter, the relationship between area function and formant frequencies will be analyzed. The objective is to determine the amount of information about the area function contained in the formant frequencies, and to characterize the mapping between the spaces formed by the area function and by the formant frequencies. These pieces of information are important in obtaining a consistent base to solve the inverse problem. 6

7

Formant Frequencies Formant Bandwidths Speech Signal Spectral Tilt Harmonic Structure Energy

Vocal-tract Shape Wall and viscous losses Radiation Load Exciatation Pulse

**Figure 2.1: Spectral properties of the speech signal determined by di erent properties
**

of the vocal apparatus. Only the vocal-tract shape has major in uence on the formant frequencies.

8

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

2.1 Corpus

The corpus used in this study consists of cineradiographic data described in Bothorel et al. (1986). (More details about the capabilities and limitations inherent to X-ray measurements of the vocal-tract can be found in Fant (1970), and in Chiba and Kajiyama (1941).) The procedure used to estimate the area functions from the cineradiography is described here. The analysis starts from digital tracings of midsagittal pro les corresponding to ten French sentences (listed in Table 2.1) as spoken by a female subject (PB), acquired at a rate of 50 frames per second. The corresponding labiograms were acquired simultaneously. A sample is shown in Figure 2.2. Table 2.1: Sentences contained in the corpus Ma chemise est roussie. Voila des bougies. Donne un petit coup. Une reponse ambigue. Louis pense a ca. Mets tes beaux habits. Une p^te a choux. a Pr^te-lui seize ecus. e Chevalier du gue. Il fume son tabac.

**2.1.1 Sampling vocal-tract pro les
**

In order to convert the original pro les into area functions, a sequence of transformations is necessary. At rst, each midsagittal pro le is plotted on a semipolar grid, using the hard palate as reference (Heinz and Stevens, 1964; Maeda, 1990). The grid lines are spaced by 0.5cm in the linear regions, and by 11 degrees in the polar region. Lip and larynx regions are speci ed in a special manner: The lips are modeled by a uniform elliptical tube whose shape is determined from the labiogram, and whose length (protrusion) is de ned as the distance from the upper incisors to the point of minimum separation between upper and lower lips. The laryngeal cavity is modeled as a trapezoidal tube (with circular cross-section) de ned by the two points where the tract pro le intersects the sixth line of the semipolar grid, and by the lowest left

2.1. CORPUS

9

Digitized Profile

15 PB0146 15

Sampled Points

30 25 20

Regenerated Profile

15 PB0146

PB0146

10

10

15

10

cm

5

5

10 5 0

5

0 0

5

10

0 15 0

5

10

0 15 0

5

10

15

cm

cm

cm

**Figure 2.2: Left: midsagittal pro le tracing extracted from cineradiographic frame and
**

labial tracing extracted from video frame recorded synchronously. Center: points sampled from the grid and from the labial region to represent the pro le. Right: Pro le approximated by sampled points.

and right points of the tract pro le. This procedure, illustrated in the central panel of Fig. 2.2, allows the representation of all pro les by the same number of points. In the present case, there are 29 pairs of points, each pair containing a point on the anterior wall and a point on the posterior wall of the tract. The approximation of the pro le by the segments joining these points is shown in the right panel of Fig. 2.2.

**2.1.2 From pro les to midsagittal distances
**

The next step is to represent the set of points plotted on each midsagittal pro le by the corresponding midsagittal distances, plotted as a function of the position along the vocal-tract. If the midsagittal distance is interpreted as the distance between the points where an ideal longitudinal sound wavefront propagating in the tract \touches" the anterior and posterior walls of the pro le, then an appropriate geometric procedure to represent each section of the vocal-tract by a midsagittal distance and a section length is as follows: each of the 28 sections de ned by the 29 pairs of points sampled from a given pro le is seen as part of an in nite conical horn (or a cylindrical horn, if the walls are parallel). The direction of propagation of the wavefront in this `horn' is determined by the line that bisects the angle formed by the lines containing the anterior and posterior wall pro les of the section (see Fig. 2.3). The intersection

10

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

points of this line with the grid lines that de ne the section determine a segment whose length will be taken as the section length. Finally, the intersection points of the line orthogonal to this segment passing through its midpoint with the anterior and posterior wall pro les determine a segment whose length is taken as the midsagittal distance of the section. The top panel of Figure 2.4 shows the midsagittal distance plotted as a function of the distance from the glottis. The vocal-tract length is taken as the sum of the section lengths of all sections. This procedure follows the same physical principle adopted in Maeda (1972), but with a di erent geometrical construction. Alternative procedures can be found, for example, in Fant (1960), Beautemps et al. (1995), and Maeda(1990).

Grid Lines

Sagittal Distances Section Lengths

Figure 2.3: Procedure used to determine the midsagittal distance and the length of a

given section. The section length is determined by the distance between the intersections of the bisection of the angle formed by the lines de ned by the section walls with the grid lines that delimit the section. The midsagittal distance is de ned as the distance between the intersections of the line passing through the midpoint of the segment that determines the section length, and orthogonal to it, with the section walls. Section walls and grid lines are represented by thick and thin solid lines, respectively. In the case shown in the left, the grid lines are part of a Cartesian region of the grid, whereas in the case shown in the right, the grid lines belong to the polar region of the grid.

2.1. CORPUS

11

Midsagittal Distance along the Vocal−Tract Midsag. Dist. (cm) 2

1

0 0 4

5 10 Cross−Sectional Area along the Vocal−Tract

15

Area (cm )

2

2

0 0 ln(Area) (ln(cm ))

2

5 10 Log−Area Function along the Vocal−Tract

15

2

0

−2 0

5 10 Distance from Glottis (cm)

15

**Figure 2.4: Top: midsagittal distances of the pro le shown in Fig. 2.2 plotted as a function
**

of the distance from the glottis. Center: area function estimated with the model. Bottom: Log-area function sampled at uniformly spaced points along the vocal-tract.

12

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

**2.1.3 From midsagittal distances to area function
**

The midsagittal distances are now transformed into the corresponding cross-sectional areas (Central panel of Fig. 2.4), using the \ model" originally proposed by Heinz and Stevens (1964) A(x) = (x)d(x) (x); (2:1) where x is the distance from the glottis, A(x) and d(x) are respectively the crosssectional area and the midsagittal distance at x, and (x) and (x) are coe cients determined in an ad hoc manner. The values used here taken from Shinji Maeda (1990). The problem with such a transformation is that the two-dimensional information contained in the midsagital pro le is not enough to obtain an accurate estimation of the area function. Therefore, except for the lip region, where the labiogram provides the necessary information, there may exist non-negligible discrepancies between the real and the estimated area functions. Even when a more elaborate model, such as those proposed by Perrier et al. (1992) and by Beautemps et al. (1995), is used, it is impossible to eliminate the discrepancies. This is the main reason why, for a given frame, the formants extracted from the speech signal do not match exactly those numerically derived from the estimated area function (see Fig. 2.5). Other reasons for this mismatch are errors in formant measurement from speech, and inaccuracies in the physical model used to calculate the formants from the area function. In order to avoid such discrepancies, the corpus of formant frequencies used in this and in the next chapters consists of the formants numerically derived from the corpus of estimated area functions (see Section 2.2.3), and not of those extracted from the corresponding speech signal. By doing so, it is assured that the inaccuracies that will appear in the results shown in the next chapter are inherent in the method proposed to solve the inverse problem, and do not depend on the factors above. Admittedly, in order to work with real speech, it is necessary to analyze such factors, but this task will not be carried out here.

**2.1.4 Log-area function
**

Instead of working directly with the cross-sectional area along the tract, each area is transformed into a log-area vector, as shown in the bottom panel of Fig. 2.4. This

2.1. CORPUS

13

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1549

1 Speech Signal 0 −1 0 40 20 0

PB1559

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1549

Power Spectrum (dB)

PB1559

2 3 4 Frequency (kHz) PB1549

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1559

5

Area Function

Area Function

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 Distance from Glottis (cm) Midsagittal Distances

15 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 Distance from Glottis (cm) Midsagittal Distances

15

PB1549

PB1559

5 10 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30

15

5 10 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30

15

15

PB1549

15

PB1559

10

15

10

15

cm

10 5 5 0 0 0 5 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

15

function estimated from midsagittal distances, and spectral envelope of the speech recorded synchronously. For the example shown in the left, the spectral envelope obtained from the pre-emphasized speech signal (gray line) has its formants (peak frequencies) matching fairly well the formants of the transfer function derived from the area function (black line). However, in the case shown in the right, due to the relatively low energy of the speech signal, the colored noise generated by the experimental apparatus produces a spurious peak around 1.3kHz, considerably a ecting the estimated spectral envelope.

Figure 2.5: Comparison between vocal-tract transfer function computed from the area

14

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

transformation not only assures that the area function will always be positive, but is also more meaningful from an acoustic point of view, since the area ratios, rather than their values, determine the formant frequencies (see, for example, Mermelstein (1967)). The transformation is simply the natural logarithm of the area, with areas smaller than a given threshold ( = 5mm2 in this study) being clipped to the threshold value to avoid numerical problems with closures. Such clipping, however, does not lead to signi cant inaccuracy from either an articulatory or acoustic point of view. In the case of vowels, it was observed that the minimum area is not less than 25mm2, and that, even in the case of fricative sounds, the minimum area is not less than 15mm2. Basically, areas are smaller than the threshold = 5mm2 only in the case of closures. Due to the procedure used to plot the midsagittal pro les on the semipolar grid, the 29 points shown in the top and central panels of Fig. 2.4 are not evenly spaced. Even spacing allows a simpler representation of the cross-sectional area along the tract since it can then be described by a vector of log-areas plus the vocal-tract length. For this reason, using linear interpolation, the log-area along the vocal-tract is resampled so that it can be represented by K = 32 uniform sections, as illustrated by the stair-step graph shown in the bottom panel of Fig. 2.4. Each log-area function present in the corpus, when approximated by a concatenation of uniform tubes of equal length, can now be represented by a vector containing the natural logarithm of the section areas and the tract length. In this study, the following notation will be used

yi =

yK+1i yki

y1i : : : yKi yK+1i ; i = 1; : : : ; P ; 1 = ln Ai((k 2 ) Li ); k = 1; : : : ; K ; K = Li;

h

i

t

(2.2)

where Li is the tract length of frame i, expressed in units normalized so that the variance of yK+1 is equal to the largest variance of the rst K components of y; Ai(x) is the cross-section area of frame i at distance x from the glottis; K = 32 is the number of uniform sections present in each area function; and P = 519 is the number of vectors present in the corpus.

2.2. COMPUTATION OF THE FORMANT FREQUENCIES

15

**2.2 Computation of the formant frequencies
**

In order to study the relationship between speech acoustic and vocal-tract geometric (articulatory) parameters, the rst step is to understand the physical basis of the process.

2.2.1 Lossless tube

A very simpli ed model of the vocal-tract is a rigid, lossless tube, whose cross-sectional area varies along its length. If, moreover, sound is assumed to propagate in longitudinal plane waves, the acoustic pressure inside the tract is governed by the Webster's horn equation (Webster, 1919; Eisner, 1966) 2 d A(x) dP ] + A(x)P = 0; = !2 ; (2:3) dx dx c with boundary conditions dP = 0 (2:4) dx

x=0

P(L) = 0 (2:5) representing open lips without radiation load. Here, x is the distance from the glottis along the tract, P and ! are respectively the amplitude and the angular frequency of the sound pressure (for a sinusoidal time dependence), A(x) is the cross-sectional area, L is the tract length, and c is the velocity of sound in the air (inside the vocal-tract). The formant frequencies are de ned as the resonant frequencies (or eigenfrequencies) of the tract p c m Fm = 2 ; (2:6) where m is the m-th eigenvalue of Eq. (2.3) under the boundary conditions de ned by Eq. (2.4) and Eq. (2.5). Therefore, for a given vocal-tract length L and a given set of boundary conditions, the formant frequencies are basically determined by the cross-sectional area function A(x). In fact, rewriting Eq. (2.3) as d2P + d ln A(x)] dP + P = 0; (2:7) dx2 dx dx it is possible to see that the eigenvalues, and hence the formant frequencies, depend on the logarithm of the area rather than on the area function itself (as already stated).

representing a closed glottis, and

16

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

At this point, it is interesting to return to the works of Schroeder (1967) and Mermelstein (1967) mentioned in the introduction, and explain them in terms of the Webster's equation. Schroeder (1967) showed, within the limits of rst order perturbation analysis, that the m-th formant frequency is directly related to, and only to, the m-th odd coe cient of the Fourier cosine series of the log-area function. So, even if the whole (in nite) set of formant frequencies were known, only \half of the information" (odd terms) necessary to determine the area function would be available. The remaining information could be obtained from the complete set of eigenvalues obtained under the boundary conditions

dP = 0 and dP = 0; dx x=0 dx x=L

(2.8)

representing a closed glottis and closed lips, respectively. Under these conditions, to rst order perturbation theory, the m-th eigenvalue is linearly related to, and only to, the m-th even coe cient of the Fourier cosine series of the log-area function. Mermelstein (1967) then veri ed experimently that, even for larger variations, the ( nite) set composed of the rst 2M coe cients of the Fourier cosine series expansion of the log-area function has a one-to-one relationship with the set composed of the rst M eigenvalues obtained under the closed glottis and open lips condition, together with the the rst M eigenvalues obtained under the closed glottis and closed lips condition. However, as already stated in the introduction, the latter set of eigenvalues, which correspond to the admittance zeros at the lips, can not be obtained from the speech signal.

2.2.2 Lossy tube

The Webster's equation describes the relationship between formant frequencies and the log-area function for a highly simpli ed vocal-tract model. When a more realistic model is considered, factors like nonplanar wave propagation, viscous and thermal losses, glottal impedance, radiation load, and yielding walls must be taken into account. Nevertheless, it is interesting to note that, although all those factors do a ect the spectrum of speech produced by the tract, most of them have almost no in uence on the formant frequencies. Moreover, the factors that in uence the formant frequencies can have their e ects approximately compensated for by simple transformations.

2.2. COMPUTATION OF THE FORMANT FREQUENCIES

17

Brie y examining these factors one by one, it is possible to say that: (i) considering that the cross-sectional dimensions of the tract normally do not exceed 4 or 5cm, and since c ' 350m/s inside the tract, there are no transversal resonance modes below about 3.5 to 4kHz. Therefore, at least for the rst 3 formant frequencies (which are normally below 3.5kHz), the plane wave propagation assumption is valid (Sondhi, 1974). (ii) Viscous and thermal losses do a ect the formant bandwidths, but have little e ect on the formant frequencies (Flanagan, 1972, pp.58{61). (iii) The glottal boundary condition has strong in uence on the spectral tilt and on the lower formant bandwidths, but its e ect on the formant frequencies is of the order of 1% (for voiced speech). Thus, if only the formant frequencies are to be considered, the closed glottis is a good approximation for the glottal boundary condition, at least in the case of vowels (Flanagan, 1972, pp.63{65). (iv) The approximate e ect of the radiation load at the lips on the formant frequencies is to lower them by a factor of 3 L ; a = A(L) ; 3 L + 8a where `a' is the e ective lip radius and L is the tract-length (Flanagan, 1972, pp.61{ 63). (v) Finally, the e ect of wall vibration on the formant frequencies is to increase them. Such an e ect becomes weaker for higher frequencies, and can be approximately expressed by !2 ' (406 )2 + ! 2 ; ^m (2:9) m where !m and !m are the angular frequencies of the m-th formant of a tract with ^ exible and rigid walls, respectively (Sondhi, 1974).1 More elaborate models do exist (Maeda, 1982; Sondhi and Schroeter, 1987) as seen in the next section, but the main point here is that formant frequencies and the log-area function are basically related by Eq. (2.7). Such a relationship (well analyzed in Fant, 1980), relatively independent of other factors, justi es using the formant frequencies to parametrize the acoustics of the speech signal, and the log-area function to represent the geometry of the vocal-tract. The point here is that the formant frequencies can be determined from the log-area function, even when a lossy model is considered. This is in contrast with other acoustic

into account yielding walls, viscous and thermal losses. It was shown that the formant frequencies of a lossy model and those of a lossless model are approximately related by Eq. (2.9).

1 In his work, Sondhi derived an equation in the same format of the Webster's equation, but taking

s

18

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

parameters present in the speech signal, such as formant bandwidths and spectral tilt, which result from the combination of the tract shape with other factors, such as glottal excitation, yielding walls, and radiation load.2 However, as already stated, the formant frequencies do not contain enough information to uniquely determine the area function. In Chapter 3, information about the vocal-tract structure (morphology) will be used to reduce the ambiguity that arises from the lack of information in the formant frequencies about the area function.

**2.2.3 Numerical determination of formant frequencies
**

Although existent for particular cases (Salmon, 1946; Eisner, 1966), there is no analytical solution for the Webster's equation (Eq. 2.3) when the area A(x) along the vocal-tract is an arbitrary function. For this reason numerical procedures are used to determine the formant frequencies (eigenfrequencies) associated with a given area function.3 The method adopted here approximates the vocal-tract by a concatenation of uniform lossy tubes (Kelly and Lochbaum, 1962; Sondhi and Schroeter, 1987; Scroeter and Sondhi, 1991). For the particular case of a uniform tube,4 the (one-dimensional) sound wave equation can be solved analytically. From the solution, it is possible to express, in the frequency domain, pressure and volume velocity at one end of the tube as the product of a matrix by pressure and volume velocity at the other end. For the K uniform sections that approximate the vocal-tract, this relation can be written as Pk 1 A B P Pk = k k k = 1; : : : ; K ; (2.10) = Kk k ; Uk 1 Ck Dk Uk Uk where Pk , Uk , Pk 1 and Uk 1 are pressure and volume velocity at the section ends closer to the lips and closer to the glottis respectively. Using the model for losses and

2 4 3 5 2 4 32 54 3 5 2 4 3 5

use of additional acoustic information, such as formant bandwidths is, in principle, very di cult to handle. It is so because, up to the author's knowledge, losses due to the vocal-tract and to the glottal source can not be well separated. Moreover, the correct estimation of the bandwidths is a task considerably more di cult to carry out than the estimation of formant frequencies (which, sometimes, are also di cult to estimate, as in the case of the high-pitched voice of female and child speakers). 3 Variational and perturbation methods were also tested (Yehia and Itakura, 1993b). It was found that the range of applicability of perturbation analysis does not cover the entire articulatory vowel space. Variational analysis has shown to be more robust, but at a computaional cost higher than that of the numerical procedure adopted here. 4 In Valimaki and Karjalainen (1994), the interesting alternative approach of conical sections is analyzed.

2 The

2.2. COMPUTATION OF THE FORMANT FREQUENCIES

19

yielding walls described in Sondhi (1974) and in Sondhi and Schroeter (1987), the entries of matrix Kk are given by

Ak = cosh

A Ck = k sinh c

where = = = =

2 !0 =

l ; c l ; c

! ! s

**Bk = sinh c l ; Ak l ; Dk = cosh c c
**

! !

(2.11)

q q

+ j! ; + j! ( + j!)( + j!); j!c1; 2 j!!0 (j! + a)j! + b + ; c2 AL ; 1 ; L C

w w k w Rw ; Lw

(2.12) (2.13) (2.14) (2.15) (2.16) (2.17) (2.18)

a = b =

where Lw , Cw and Rw are mass, compliance and resistance of the tract walls per unit length; a is the ratio of wall resistance to mass; b is the squared angular frequency of the mechanical resonance; c1 is the correction for thermal conductivity and viscosity; !0 is the lowest angular resonance frequency of the tract when it is closed at both ends; c is the sound velocity inside the tract; is the air density; ! is the angular frequency; Ak is the cross-sectional area of the section; and l = L=K is the section length, equal to the vocal-tract length L divided by the number of sections K . The numerical values used are

c = 3:5 104 a b c1

2 !0

= = = = =

1:14 10 130 (30 )2 4 (406 )2

3

(cm/s) (g/cm3) (rad/s) (rad/s)2 (rad/s) (rad/s)2 .

20

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

Now, if sound pressure and volume velocity at the lips (PK and UK ) are known, it is possible to determine pressure and volume velocity at the glottis (P0 and U0) as

K P0 P = Kk K : (2.19) U0 UK k=1 The formant frequencies are obtained by nding the maxima of the vocal-tract transfer function de ned by 20 log10(UK =U0). In order to compute this transfer function, the sound pressure at the lips is expressed as

4 5 Y 4 5 2 3 2 3

PK = Zr UK ;

(2.20)

where Zr is the output impedance determined by the radiation load. The model used to represent the load is an RL circuit in parallel (Flanagan, 1972, pp.36{38) with 128 c ; (2.21) Rr = (3 )2 AK 8 AK = c and Lr = : (2.22) 3 c AK In the nal step, the volume velocity at the lips is arbitrarily set to UK = 1, and Eq. 2.19 is used to obtain the volume velocity at the glottis U0. The transfer function is then given by 20 log10(U0). As already stated, the formant frequencies are determined by the maxima of the transfer function, and can be found by numeric search.

q

**2.2.4 Comparison of lossless and lossy models
**

Before proceeding, it is interesting to compare the formant frequencies corresponding to lossy and lossless area functions of same length and shape. The objective is to nd to what degree losses in uence formant frequencies for case of the areas that compose the cineradiographic corpus analyzed here. Figure 2.6 shows detailed information for two frames extracted from the corpus. In the power spectrum panels, the transfer functions obtained from lossless (dashed lines) and lossy models (solid black lines) of the area function shown below are plotted together with the spectral envelope of the speech frame shown above (gray line).5 The vocal-tract pro le and midsagittal

Frame length = 20ms; LPC order = 12; Hanning window; preemphasis coe cient = r(1)=r(0), where r(0) is the energy and r(1) the rst correlation coe cient of the signal (Markel and Gray, 1976, p. 216).

5 LPC analysis:

2.2. COMPUTATION OF THE FORMANT FREQUENCIES

21

distance along the tract area are also plotted for reference purposes. As expected, losses have stronger in uence on formant bandwidths than on formant frequencies. Note also the imperfect matching between speech spectral envelope peaks and vocaltract transfer function peaks already discussed in Section 2.1.3. The small deviations observed in the left column of Fig. 2.6 may be explained by inaccuracies in the physical model adopted to describe yielding walls and radiation load. However, in the spectra shown in the right column of Fig. 2.6, the larger deviation observed for the second formant is more likely due to misestimation of the area function. In order to get qualitative and quantitative information about the e ects of losses over the whole corpus, Fig. 2.7 shows histograms of the relative deviation of the rst four formant frequencies obtained using a lossless vocal-tract model with respect to formants obtained using a lossy model Flossless Flossy : Flossy The gure illustrates that the absence of losses has the e ect of increasing the frequencies of the second, third and fourth formants. This is due to the nonexistence of radiation load (see Section 2.2.2). The absence of yielding walls e ects has little e ect on the second, third and fourth formants, but tends to substantially lower the rst formant, which is also a ected by the lack of radiation load.

Comment

In this chapter it was seen how to estimate the area function given a vocal-tract pro le; and how to compute formant frequencies given an area function. The problem of estimating the area function given a set of formant frequencies is simpli ed if the area function is expressed by appropriate parameters. This is the target of the next chapter.

22

CHAPTER 2. AREA FUNCTION AND FORMANT FREQUENCIES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1518

1 Speech Signal 0 −1 0 40 20 0

PB1538

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1518

Power Spectrum (dB)

PB1538

2 3 4 Frequency (kHz) PB1518

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1538

5

Area Function

Area Function

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1518

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1538

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1518

15

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1538

15

10

15

10

15

cm

10 5 5 0 0 0 5

cm

10 5 5 0

15

cm

10

0 0

5

cm

10

15

Figure 2.6: On each column the second panel from the top shows a comparison of the

transfer functions estimated from lossy (black solid line) and lossless (dashed line) models of the vocal-tract area function. The gray line represents the speech power spectrum envelope. The speech signal, area function, midsagittal distance and vocal-tract pro le are also given as references. Note that there is little discrepancy between lossy and lossless formant frequencies.

2.2. COMPUTATION OF THE FORMANT FREQUENCIES

23

Formant #1 25 Percentage (%) Percentage (%) 20 15 10 5 0 −0.2 0 0.2 Relative Deviation Formant #3 25 20 15 10 5 0

Formant #2

−0.2 0 0.2 Relative Deviation Formant #4

25 Percentage (%) Percentage (%) −0.2 0 0.2 Relative Deviation 20 15 10 5 0

25 20 15 10 5 0 −0.2 0 0.2 Relative Deviation

chart shows a histogram of the relative deviation of formants derived from a lossless vocal-tract model with respect to formants derived from a lossy model: (Flossless Flossy)=Flossy . The ordinates show the percentage of points in the corpus whose relative deviation falls within the the abscissa interval under a given bin.

Figure 2.7: Formant frequency variation due to yielding walls and radiation load. Each

**Chapter 3 Parametric Models for the Area Function
**

\You people speak in terms of circles and ellipses and regular velocities|simple movements that the mind can grasp|very convenient|but suppose almighty God had taken it into His head to make the stars move like that... (He describes a irregular motion with his nger through the air) ...then where would you be?" Bertold Brecht (1898{1956) Galileo

The number of sections necessary to obtain a good approximation of the vocaltract log-area function by a concatenation of uniform tubes of equal length is considerably larger than the dimension of the space composed by the log-area functions that can be produced by the human vocal-tract. This space, from now on, will be called the articulatory space. Two procedures are analyzed in this chapter to represent it by appropriate components. In the rst section, representation by a Fourier cosine series is examined whereas a parametric statistical representation is seen in the second section. Another point that can be exploited in parametrizing the articulatory space is the fact that the temporal behavior of the area function is subject to continuity 24

3.1. FOURIER ANALYSIS

25

constraints. In the last section of this chapter, time sequences of parametrized logarea functions are represented by series expansions.

3.1 Fourier Analysis

The main reason to represent the log-area function by a Fourier cosine series (Davis, 1963, pp. 107{112) is the property pointed out by Mermelstein (1967) that the rst M formant frequencies depend mainly on the rst 2M terms of the Fourier cosine series expansion of the log-area function. Speci cally, the mth formant frequency depends mainly on the (2m 1)th term. Also, except for some critical cases, it has a one-to-one relationship with this term, when all other terms are kept constant. In fact, except for the above mentioned critical cases, there is a one-to-one relationship between the rst M formants and the rst M odd terms of the Fourier cosine series expansion of the log-area function (Yehia and Itakura, 1993a; Yehia and Itakura, 1996). Due to the above reasons, whose importance will become clear along the text, the rst N 2M log-area function Fourier cosine coe cients1 will be adopted here to parametrize the geometry of the vocal-tract. Mathematically it is de ned2 transformations here as a N 1 ln A(x) ' 2 ( p0 + an cos n x ); (3.1) L 2 n=1 L 2 L (3.2) an = L ln A(x)cos nLx dx; 0 1 L ln A(x)dx; (3.3) a0 = p L 0 or, in a discrete form, 2 ( p0 + N 1 a c ); c = cos n (k 1 )]; k = 1; : : : ; K; (3.4) ln Ak ' K a nk K 2 2 n=1 n nk 2 K L an = K ln Ak cnk ; Ak = A (k 1 ) K ]; n = 1; : : : ; N 1; (3.5) 2 k=1 K ln A ; (3.6) a0 = p1 K k=1 k

s X s Z Z s X s X X

Fourier cosine coe cient means coe cient of the Fourier cosine series, and not cosine coe2 cient of the Fourier series.

1 Here,

This is a convenient de nition because of its properties of symmetry.

26

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

and, in a practical vector notation, area function and tract length can be put together as (see Eq. 2.2),

y ' Uay a; a = Utay y;

where

(3.7) (3.8) (3.9) (3.10) 0 0 . : . (3.11) . 0 1

7 7 7 7 7 7 7 7 7 7 5 3

**a = a1; a3; : : : ; a2M 1; a0; a2; : : :; a2M ; : : :; aN 1; L]t;
**

c11 c12 2 . . K . c1K 0

6 6 6 6 6 6 6 6 6 6 4 2

y = ln A1; : : :; ln AK ; L]t;

**(The reason for this unusual component order will become clear in Section 4.1.2.)
**

s

Uay =

c31 c32 . . . c3K 0

: : : c2M 1 1 : : : c2M 1 2 . ... . . : : : c2M 1 K ::: 0

1 p2 1 p 1 p

.2 . .

2

0

c21 c22 . . . c2K 0

: : : c2M 1 : : : c2M 2 ... . . . : : : c2M K ::: 0

: : : cN 1 1 : : : cN 1 2 . ... . . : : : cN 1 K ::: 0

In this discrete form, K is the number of uniform sections used to approximate the area function. The area of each section is obtained by sampling the continuous area L function at the points xk = (k 1 ) K ; where L is the tract length, and L=K is the 2 section length.

3.1.1 Truncation E ects

The e ects of taking a nite number of Fourier coe cients are illustrated in gures 3.1 and 3.2 for the French vowel /i/. Figure 3.1 shows the area function represented by a concatenation of K = 32 tubes and, following it, the results obtained by its approximation by N = 9; 8; : : : ; 3 and 2 Fourier cosine coe cients. It is possible to see in Figure 3.2 that the m-th formant frequency (calculated with the model described in Section 2.2.3) reaches a value close to its true value when the area function is approximated by the rst N = 2m terms (including the 0-th order) of the corresponding Fourier cosine series. It is also possible to see that, for an approximation of N = 9 terms, the maximal deviation observed in the gure for the rst M = 3 formants is less than 3%, and so,

3.1. FOURIER ANALYSIS

27

less than the JND (just-noticeable di erence) found by Flanagan (1955). Although this limit is exceeded in some frames of the corpus, from now on, when using Fourier representation, the area function will be approximated by the rst N = 9 terms of its log-area Fourier cosine series expansion, plus the vocal-tract length.

20 Area (cm2) Area (cm2) Original /i/ 10 20 N=9 10 Area (cm2) 20 N=8 10

0 0 20 Area (cm2)

10 position (cm) N=7

20

0 0 20 Area (cm2)

10 position (cm) N=6

20

0 0 20 Area (cm2)

10 position (cm) N=5

20

10

10

10

0 0 20 Area (cm2)

10 position (cm) N=4

20

0 0 20 Area (cm2)

10 position (cm) N=3

20

0 0 20 Area (cm2)

10 position (cm) N=2

20

10

10

10

0 0

10 position (cm)

20

0 0

10 position (cm)

20

0 0

10 position (cm)

20

Figure 3.1: Area function for the French /i/, and the same area approximated by truncated Fourier cosine series expansions (thick lines), compared with the original area (thin lines).

**3.1.2 Formant frequencies as functions of Fourier coe cients
**

Formant frequencies can now be interpreted as functions of log-area Fourier coe cients. With the objective of getting a better comprehension of the behavior of these

28

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Formant Frequencies versus Fourier Coefficients 5

4 Formant Frequencies (kHz)

3

2

1

0

1

2

3

4 5 6 7 8 Number of Fourier Coefficients (N)

9 Real Area

Figure 3.2: Formant frequencies as a function of the number of Fourier cosine coe cients used to represent the are function of the French /i/. (Obs. Note that N includes the 0th order term.)

3.2. STATISTICAL ANALYSIS

29

functions, Figure 3.3 shows how the rst three formant frequencies vary with respect to each of the rst six Fourier cosine coe cients (excluding the zero-th order), when all other coe cients are set to zero. A histogram of the Fourier cosine coe cients of the areas contained in the corpus is plotted above each graph. Note that the rst three formant frequencies vary almost linearly with the rst three odd Fourier cosine coe cients. The joint in uence of the rst (a1) and third (a3) Fourier cosine coe cients on the rst and second formant frequencies, when all other coe cients are set to zero, is shown in Figure 3.4. a1 and a3 have dominant in uence (only) on the rst and second formants, respectively.

3.2 Statistical Analysis

The objective of this section is to nd representations for both the log-area (articulatory) space and the formant frequency (acoustic) space so that Each space be e ciently represented by a small number of parameters. The components of each space be as independent as possible. The mapping between both spaces be as simple as possible. These points will be analyzed one by one in the following sections.

**3.2.1 Principal Component Analysis
**

Articulatory space

The relationship between Fourier cosine coe cients and formant frequencies is indeed interesting; however, for the case of the human vocal-tract, it cannot be said that the parametrization by a truncated Fourier series is optimum. It is so because cosine functions, which are the basis functions of a Fourier cosine series expansion, are \general purpose functions" that, in principle, are not directly related to the vocaltract morphology. In this section Principal Component Analysis (PCA) (Horn, 1985, pp. 411{455) will be used to parametrize the articulatory space by an appropriate number of components (Yehia, Takeda and Itakura, 1996). The procedure is as follows: given the

30

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

a1

a2

Freq. (kHz)

4 2 0 −5 0 5

Freq. (kHz)

4 2 0 −5 0 5

a3

a4

Freq. (kHz)

4 2 0 −5 0 5

Freq. (kHz)

4 2 0 −5 0 5

a5

a6

Freq. (kHz)

4 2 0 −5 0 5

Freq. (kHz)

4 2 0 −5 0 5

each graph, all other coe cients are kept equal to zero. A histogram of the coe cients of the log-area functions of the corpus analyzed is plotted above each chart.

Figure 3.3: First three formants as functions of the rst 6 Fourier cosine coe cients. In

3.2. STATISTICAL ANALYSIS

31

**1 First Formant F1 (kHz) 0 1 0 −1 1 0 a1 −1 a3 Second Formant 3 F2 (kHz) 2 1 −1 0 1 a3 −1 0 a1 1
**

Figure 3.4: First and second formant frequencies as functions of rst and third Fourier cosine coe cients a1 and a3; when all other coe cients are equal to zero.

32

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

**corpus of log-area vectors de ned in Eq. (2.2), the corresponding covariance matrix3 given by P Cy = P 1 1 yi y ] yi y ]t; (3:12)
**

X

where

y

is the mean log-area vector; and can be expressed as

i=1

Cy = USUt;

(3:13)

where S is a diagonal matrix containing the eigenvalues of Cy in decreasing order, and U is a unitary matrix whose columns contain the corresponding normalized eigenvectors. The expansion above is a Takagi's factorization, which is a singular value decomposition for the particular case of symmetric matrices (Horn, 1985, pp. 201{ 218). Using the same optimality principle of the Karhunen-Loeve transform (Jayant and Noll, 1984, pp. 535{546), y can then be approximated by

y ' U y + y;

given by

(3:14)

the whole corpus to derive the model as well as to test it. Although this procedure is, admittedly, not the most rigorous, it gives us an extensive base to analyze the model.

(3:15) = Ut y (y y ); where U y is the matrix containing the rst N columns of U, i.e. the normalized eigenvectors corresponding to the N largest eigenvalues of Cy . The K + 1 = 33 eigenvalues are shown in Fig. 3.5. Note that only the rst N = 5 eigenvalues have non-negligible values, and that they \explain" more than 92% of the variance of the corpus of log-areas. The eigenvectors associated with the largest N = 5 eigenvalues are shown in Fig. 3.6. They will be used in this paper to form a parametric model for the vocal-tract log-area function. Since the components of this model cannot be explicitly interpreted as articulators, it cannot be quali ed as an articulatory model (Mermelstein, 1973; Coker, 1976; Maeda, 1990). In spite of that, it is possible to observe in Fig. 3.6 that: the rst and most important eigenvector is associated with the tongue region; the tract-length is the dominant component of the second and fth eigenvectors; the lips determine the dominant component of the third eigenvector; and 3 All P vectors contained in the corpus were used to compute C . The possibility of not including y a few vectors in the estimation of Cy and use them only in the tests carried out in Chapter 5 was considered. However, since by doing so the entries of U y varied less than 5%, we opted for using

3.2. STATISTICAL ANALYSIS

33

Log−Area Eigenvalues 0.5 Eigenvalue 0.4 0.3 0.2 0.1 0 5 10 15 20 25 Log−Area Eigenvalue Sum 30 5% Threshold 1% Threshold

Eigenvalue Sum

1 0.8 0.6 0.4 0.2 0

90% Threshold 95% Threshold 99% Threshold 5 10 15 20 25 Eigenvalue Number 30

Figure 3.5: Eigenvalues of the log-area covariance matrix.

34

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

the tongue apex is the dominant region of the fourth eigenvector. Also, note that there is almost no in uence of the glottal region on the rst three eigenvectors. In order to illustrate the performance of this representation, Fig. 3.7 shows an area function taken from the corpus (thick line), and its approximations by a truncated Fourier cosine series (dashed line) and by the parametric model described here (thin solid line). Note that, in contrast with the Fourier series representation, the parametric model is able to \capture" the vocal-tract structure. In Fig. 3.8, original trajectories followed by the tract length, by the area at the lips, and by the area of a section in the alveopalatal region are shown by the dashed lines. The corresponding trajectories obtained with the parametric model proposed here are shown by the solid lines. Since the parametric model is derived from the log-area function, the approximation is particularly good for small areas, which are critical from the acoustic point of view.

**Dimensionality and degrees of freedom
**

Summarizing, it was shown that vocal-tract log-area vectors can be e ciently represented in an N = 5 dimensional articulatory space. Here, it is interesting to note that most articulatory models are expressed by seven to nine components. This happens because their formulation is oriented to the speech production direct problem. In that case, it is important to consider the number of degrees of freedom of the vocal apparatus, which is usually larger than the dimension of the articulatory space.

Acoustic Space

To each log-area vector there exist one, and only one, set of formant frequencies associated with it. Here, the set composed by the rst three formant log-frequencies4 will be called a formant vector, and the space formed by all formant vectors that can be generated by the vocal-tract will be called the acoustic space. By performing an eigenvalue decomposition on the covariance matrix of the formant vectors (in log-scale) derived from the P = 519 log-area vectors of the corpus, the following eigenvalues were found 159 94 9] 10 4 ;

was veri ed a higher degree of linearity in the mapping between articulatory and acoustic spaces when formant frequencies were represented in log-scale.

4 It

3.2. STATISTICAL ANALYSIS

35

Eigenvector #1

0.5 0 −0.5 Sqrt(Eigenvalue) = 2.66 0.5 0

Eigenvector #4

−0.5 Sqrt(Eigenvalue) = 0.89

Eigenvector #2

0.5 0 −0.5 Sqrt(Eigenvalue) = 1.94 0.5 0

Eigenvector #5

−0.5 Sqrt(Eigenvalue) = 0.84 Glottis−−−−−>Lips−>Length

Eigenvector #3

0.5 0 −0.5 Sqrt(Eigenvalue) = 1.48 Glottis−−−−−>Lips−>Length

Figure 3.6: Eigenvectors corresponding to the rst 5 eigenvalues obtained from the de-

composition of the log-area covariance matrix. All eigenvectors are normalized to have unit Euclidean norm. The rst K = 32 components correspond to the log-area along the tract; and the last component corresponds to the tract length. The corresponding eigenvalue square root is given as a reference to the \importance" of each eigenvector.

36

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Area Function: Vowel /i/ 3 terms Area (cm2) 5

0 0

5

10

15

5 terms Area (cm2) 5

0 0

5 10 Distance from Glottis (cm)

15

Figure 3.7: Area function approximations by Fourier cosine series expansion (dashed

line), and by statistically optimum eigenvalue expansion (solid line). The thick solid line shows the original area. Above: expansion with 3 components. Below: expansion with 5 components.

3.2. STATISTICAL ANALYSIS

37

**Vocal−Tract Length Trajectory Length (cm) 18 16 14 0 5 0.2 0.4 0.6 0.8 Lip Area Trajectory 1
**

Mean Difference: 0.07% Standard Deviation: 0.23%

Area (cm2)

Mean Difference: 3% Standard Deviation: 14%

0 0 5

0.2

0.4 0.6 0.8 Alveopalatal Area Trajectory

1

Area (cm2)

Mean Difference: −7% Standard Deviation: 7%

0 0

0.2

0.4

0.6 Time (s)

0.8

1

Figure 3.8: Vocal-tract length, lip area, and alveopalatal area trajectories along the sentence (in French): Ma chemise est roussie. The dashed lines show the original

measured trajectories, while the solid lines show the trajectories parametrized by the model proposed here. For each case, the mean and the standard deviation values of the relative di erence (in percentage) between parametrized and original trajectories are also shown.

38

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

**the normalized eigenvectors being given by the columns of the matrix
**

2 6 6 6 4

0:933 0:333 0:137

0:360 0:870 0:338

**0:006 0:364 : 0:931
**

7 7 7 5

3

(3.16)

It means that more than 96% of the total variance can be explained by the rst two eigenvalues. For this reason, the possibility of representing the acoustic space in two dimensions was considered. However, since the acoustic information associated with the third eigenvalue can be important for the inverse problem, it was decided to use the rst three formant log-frequencies to parametrize a three-dimensional acoustic space.

**Principal components and formant frequencies
**

Unlike Fourier cosine coe cients, there is not a one-to-one mapping between a subspace determined by a subset of the principal components and the space determined by the formant log-frequencies. Such a mapping is a necessary condition for the solution of the inverse problem described in the next chapter. This obstacle can be overcome by performing appropriate transformations on both principal components of the log-area function and formant log-frequencies.

**3.2.2 Independent component analysis
**

The objective of this section is to perform linear transformations on the coordinate systems of both articulatory and acoustic spaces, so that the components of each space become as independent as possible. The nal objective is to nd a mapping of the articulatory space onto the acoustic space, where each component of the acoustic space is mainly determined by one, and only one, component of the articulatory space. Also, each component of the articulatory space must have major in uence on at most one component of the acoustic space. In order to attain this objective, a necessary condition is that the components of each space be as independent as possible.

Articulatory Space

The rst step is to nd how the articulatory space, de ned in the last section, maps onto the acoustic space. To reach this target, rstly, the hyperrectangle de ned by the

3.2. STATISTICAL ANALYSIS

39

maximum and minimum values of each of the N = 5 components of the parametrized corpus is \ lled" with Q0 = 30; 000 uniformly distributed points.5 Fig. 3.9a illustrates this operation by showing the projection on the subspace de ned by 1 and 2. However, not all the points in the hyperrectangle correspond to realistic vocal-tract areas. For this reason, all points in the hyperrectangle that correspond to areas out of the limits de ned by the P = 519 areas present in the corpus described in Section 2.1 are discarded.6 The remaining Q = 7; 285 points are shown in Fig. 3.9b. After that, the independent component analysis method proposed by Bell and Seijnowski (1995) is applied to these points to nd a linear transformation (T : R 5 ! R 5) that changes the coordinate system of the articulatory space into a system with statistically \less dependent" components. (The term \less dependent" is used because, in the present case, a simple linear transformation is not enough to obtain a complete decomposition into independent components.) Mathematically, this transformation is written as =T ( ); (3:17) where is the mean of the Q = 7; 285 vectors generated to \ ll" the articulatory space. Fig. 3.9c shows the same points shown in Fig. 3.9b, now plotted in the new coordinate system.

Acoustic Space

For a given point in the articulatory space, it is possible to nd the corresponding log-area vector y using the following inverse transformation

y = U y (T

1

+

) + y:

(3:18)

Then, using the wave propagation model described in Section 2.2.3, it is possible to calculate the formant vector f formed by the rst three formant log-frequencies associated with y and, consequently, with

f = f( ) :

5 30; 000 was chosen arbitrarily as a number su

(3:19)

ciently large to characterize a uniform distribution in a ve-dimensional space. 6 All points associated with log-area vectors containing values outside the limits de ned by the maximal and minimal values of each component of the corpus are discarded. (More details are given in Appendix A.)

40

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

(a) 4 Alpha 2 0 −4 −8 −8

Parametric Space

0 4 8 Alpha 1 (b) Log−Area Projected Space 4 Alpha 2 0 −4 −8 −8

−4

0 4 8 Alpha 1 (c) "Less Dependent" Components 6 3 Beta 2 0 −3 −6 −6 −3 0 Beta 1 3 6

−4

**Figure 3.9: (a) Parametric subspace determined by the rst two components of . (b)
**

Points corresponding to realistic area functions (articulatory space). (c) The same points shown in a coordinate system with \less dependent" components.

3.2. STATISTICAL ANALYSIS

41

This procedure was carried out for all Q = 7; 285 points shown in Fig 3.9c. The corresponding formant log-frequency normalized histograms, which are approximations for the probability density functions, are shown in Fig. 3.10a; while the scattering on the plane de ned by f1 and f2 is shown in Fig. 3.10c. After that, the independent component analysis (ICA) method described in Bell and Seijnowski (1995) was used to nd a linear transformation (Tfg : R 3 ! R 3) that changes the coordinate system de ned by the formant log-frequencies into a system with \less dependent" variables. This transformation can be written as

g = Tfg (f

f );

(3:20)

where f is the mean of the logarithm of the Q = 7; 285 formant vectors available. The normalized histograms obtained for the components of g are shown in Fig. 3.10b, and the scattering of the rst two components of g is shown in Fig. 3.10d. At this point, g and de ne respectively acoustic and articulatory vector variables whose components are more independent than the components of f and . The next step is to model the relationship between acoustic and articulatory spaces. Before continuing, it is worthwhile to write some lines about the independent component analysis (ICA) technique used here. The ICA problem consists of nding a linear transformation which, when applied to a given ensemble of random vectors, transforms it into an ensemble of vectors whose components are statistically independent, in an ideal case; or as independent as possible, in practical cases. The theoretical background of the problem is very well described in Comon (1994). The approach described in Bell and Seijnowski (1995) (and used in this paper) is based on entropy maximization which, under appropriate conditions, implies mutual information minimization, and consequent independence maximization. The method was originally used to solve the problem of blind separation of mixed sound sources, but has a potentially larger range of applications.

**3.2.3 Singular Value Decomposition
**

In this section, the mapping from onto g is approximated by a linear transformation (T g : R 5 ! R 3) as follows g'Tg : (3:21)

42

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Number of Samples / All Samples

0.2 0.1 0 2.4 0.2 0.1 0 2.8 0.2 0.1 0 2.6 log(f2) 3 3.2

Number of Samples / All Samples

(a) Formant Dist. log(f1) 2.8 3

(b) ICA 0.2 0.1 0 −4 0.3 0.2 0.1 0 −8 −4 0.2 0.1 0 −4 0 g1 4 g2 0 g3 4

3.4

log(f3) 3.4 3.5 3.6 log[ Frequency (Hz) ]

0 4 8 Norm. Frequency Units (d) g1−g2 Plane

(c) log(f1)−log(f2) Plane 3.4 log[ f2 (Hz) ] 3.2 g2 3 2.8 2.4 2.6 2.8 log[ f1 (Hz) ] 3 4 2 0 −2 −4 −6

−8 −6 −4 −2

0 g1

2

4

6

Figure 3.10: (a) Normalized histograms of the rst 3 formant log-frequencies corresponding to an articulatory space lled with approximately uniformly distributed points. (b) Histograms of the variables obtained after independent component analysis (ICA) of the formant log-frequencies. (c) and (d) Scatter plots of the rst 2 variables shown in (a) and (b), respectively.

3.2. STATISTICAL ANALYSIS

43

In such a case, once there is an ensemble of vectors g and available, a minimum mean square error (MMSE) procedure can be used to estimate T g , yielding

T g = GBt(BBt) 1 ;

with and

(3:22) (3:23)

G = g1 : : : gQ]; B=

1

: : : Q ]: (3:24) In the above equations, Q = 7; 285 is the number of points present in the ensembles. Once T g is determined, a singular value decomposition procedure (Horn, 1985, pp. 411{455) can be used to nd rotations of the acoustic (g) and articulatory ( ) coordinate systems, so that each of the rst three components of the articulatory space has major in uence on one, and only one, component of the acoustic space. The singular value decomposition of T g yields

T g = Uhg Sh U t ;

(3:25)

where Uhg is a unitary matrix containing the normalized eigenvectors of T g T gt, U is a unitary matrix containing the normalized eigenvectors of T g tT g, and Sh is a 3 5 matrix whose rst 3 columns de ne a diagonal matrix containing the square roots of the eigenvalues of T g T g t, and the elements of the last two columns are all equal to zero. Now, since the multiplication of a unitary matrix by a vector represents a rotation of this vector, =Ut (3:26) and

t h = Uhg g

(3:27)

de ne, respectively, \rotated" articulatory and acoustic variables. Vectors and h de ne parametric representations for log-area vectors y and formant log-frequency vectors f. The relation between y and is obtained straightforwardly from equations (3.14), (3.15), (3.17), and (3.26); while the relation between f and h is obtained from equations (3.20) and (3.27) yielding = Ty (y

0 ); y

(3.28)

44

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

y ' T y + 0y ; where Ty = Ut T Ut y ; T y = U y T 1U ; 0 = y +U y ; y and h = Tfh(f f ); f = Tfh1h + f ; where Tfh = Uthg Tfg :

(3.29) (3.30) (3.31) (3.32) (3.33) (3.34) (3.35)

The basis vectors contained in the rows of Ty as well as 0y are plotted in Fig. 3.11. The numerical values are given in Appendix A. The matrix of correlation coe cients (Papoulis, 1991, p. 152) between and hcan be estimated by t Rh = (Q H (3:36) 1) t ; where

h

H = h1 : : : hQ];

=

1

: : : Q ];

(3.37) (3.38)

and are the column vectors containing respectively the standard deviations of h and , Q = 7; 285 is the number of points present in the ensembles, and the division of H t by h t is performed element-wise. The numerical result obtained is shown below

h

**0:939 0:003 0:005 Rh = 0:003 0:953 0:003 0:002 0:001 0:461
**

6 6 6 4

2

0:004 0:004 0:001

**0:004 0:002 : 0:002
**

7 7 7 5

3

This matrix shows that there exists a high degree of correlation between the rst two acoustic components and the rst two articulatory components. There is also a not negligible degree of correlation between the third acoustic and articulatory components. All other correlation coe cients are very small. At this point, in order to see the importance of the independent component analysis described in Section 3.2, it is interesting to compare Rh with the matrix of correlation coe cients obtained when f and are used in place of g and to obtain

3.2. STATISTICAL ANALYSIS

45

0.4 0 −0.4

Basis Vector #1 0.6 0

Basis Vector #4

0.4 Basis Vector #2 0 −0.4 0.3 Basis Vector #3 0 −0.3 −0.6

−0.6 0.4 Basis Vector #5 0 −0.4 Mean Vector 1 0

Glottis−−−−−−−−−−−−>Lips−>Length Glottis−−−−−−−−−−−−>Lips−>Length

= 0.53cm for the basis vectors and 1 unit = 20 0.53cm for the mean vector).

Figure 3.11: Basis vectors (rows of Ty ) and mean vector ( y ) used to represent log-area vectors (y). Units: the rst K = 32 components are log-areas along the tract expressed in log(cm2 ). The last component is the tract length expressed in normalized units (1 unit

46

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

is shown below

**h and , as done in Yehia and Itakura (1995b) and in Yehia et al. (1995). The result
**

2

Rf =

6 6 6 4

0:944 0:270 0:112

0:270 0:944 0:142

0:206 0:266 0:511

0:308 0:258 0:069

**0:035 0:183 : 0:184
**

7 7 7 5

3

Note that, although the correlation between the acoustic components and the corresponding rst three articulatory components continues to exist, the other correlation coe cients are not negligible any more. It should be pointed out, however, that uncorrelation does not imply independence. This fact is illustrated in Fig. 3.12, where scatterings representing the joint cross-distributions of the components of h and of are plotted. There exists, for example, an apparent nonlinear relation between h3 and 1. This kind of dependence cannot be well approximated by the linear transformation used in this work to model the mapping from the articulatory space onto the acoustic space. In spite of these limitations, the model successfully extracted two acoustic variables, namely h1 and h2, which depend approximately linearly on two, and only two, articulatory variables, namely 1 and 2. The remaining articulatory components, 3; 4; and 5, have little in uence on h1 and h2. Moreover, 2 has little e ect on h1, and the in uence of 1 on h2 does not a ect the one-to-one relationship between 2 and h2. These facts are illustrated in Fig. 3.13. Once the parametric model is derived, and its basic characteristics are analyzed, it is interesting to compare articulatory and acoustic component trajectories for a given sequence of vocal-tract shapes. The trajectories associated with the French sentence \Ma chemise est roussie" are shown in Fig. 3.14. It is possible to observe that the rst two articulatory components are indeed closely related to the rst two acoustic components. It is also possible to see that there exist some similarities between h3 and h1, indicating that they are not independent.

3.3 Temporal Analysis

Up to this point, each of the P = 519 log-area vectors present in the corpus was parametrized by N (N = 5 in the statistical analysis, and N = 9 in the Fourier analysis) coe cients, which contain also vocal-tract length information. In this section,

3.3. TEMPORAL ANALYSIS

47

Articulatory−Acoustic Scatterings 4 h1 h2 h3 0 −4 4 0 −4 4 0 −4 −4 0 4 −4 0 4 −4 0 4 −4 0 4 −4 0 4 Gamma1 Gamma2 Gamma3 Gamma4 Gamma5

Figure 3.12: Scatterings representing the joint distributions of the components of the acoustic variable h and the components of the articulatory variable . Note the high correlation between 1 and h1 , and between 2 and h2. See also the nonlinear relation between 1 and h3.

48

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

First Acoustic Component 4 h1 0 −4 4 0 Gamma1 −4 −4 0 Gamma2 4

Second Acoustic Component 4 h2 0 −4 4 0 Gamma1 −4 −4 0 Gamma2 4

Figure 3.13: First two acoustic components (h1; and h2) expressed as functions of the rst two articulatory components ( 1 and 2 ), when all other components ( 3 ; 4; and 5 ) are equal to zero. Note that h1 is almost independent of 2 , and that there are one-to-one relationships between h1 and 1 , and between h2 and 2 .

3.3. TEMPORAL ANALYSIS

49

Articulatory Trajectories Gamma1 5 0 −5 5 0 −5 5 0 −5 5 0 −5 5 0 −5 0 h3 h2 h1 5 0 −5 5 0 −5 5 0 −5 0

Acoustic Trajectories

Gamma3

Gamma2

Gamma5

Gamma4

0.5 Time (s)

1

0.5 Time (s)

1

Figure 3.14: Articulatory and acoustic component trajectories along the sentence (in French): Ma chemise est roussie. Note the similarity between the rst two articulatory

trajectories and the rst two acoustic trajectories. (The dashed lines in the acoustic trajectories indicate the intervals where the formants cannot be reliably extracted from the speech signal due to very narrow constrictions in the area function.)

50

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

the objective is to take sequences of p (e.g. p = 10) frames contained in a sentence, and represent them with less than pN parameters. The procedure below is carried out for sequences of parameters, but the same method can be applied to a (Fourier) parameters as well. The parametrization is as follows. Representing sequences of p frames by t (3:39) i = i ; i+1 ; : : :; i+p 1 ] ; it is possible to compute the \covariance" 7 P (3:40) C = P 0 1 1 i ti; i=1 where P 0 is the number of sequences of length p contained in the corpus. Then, following the same method used for log-area statistical parametrization, it is possible to express C as (Takagi's factorization) C = V Vt; (3:41) where is a diagonal matrix containing the eigenvalues of C in decreasing order, and the columns of V are the corresponding normalized eigenvectors. i can then be approximated by t (3:42) i ' iV ; i given by (3:43) i = iV ; where V is the matrix containing the rst q columns of V (i.e. the normalized eigenvectors corresponding to the q largest eigenvalues of C .) The components of i are orthogonal in the sense that E ti i] = E Vt ti iV ] = Vt C V = q ; E ] denoting expected value, and q being the diagonal matrix containing the q largest eigenvalues of C in decreasing order. Thus, a sequence of p log-area vectors y1i : : : y1;i+p 1 . . Yi = ... (3.44) . yK+1;i : : : yK+1;i+p 1

X

0

2 6 6 6 4

3 7 7 7 5

is quoted because f i ; i = 1; : : :; P g is an ensemble of matrices, and not of vectors as it should be to de ne a classic covariance matrix. Nevertheless, C contains information about the covariance between components that parametrize the log-area function at di erent times.

7 \Covariance"

0

3.3. TEMPORAL ANALYSIS

51

**containing p (K +1) elements (e.g. p (K +1) = 10 33 = 330) can be approximately represented by a matrix
**

2

i=

6 6 6 4

11i

. . .

::: :::

1qi

3 7 7 7 5

. . .

(3.45)

N 1i

Nqi

containing qN elements (e.g. qN = 4 5 = 20), with

Yi ' T

i

= Ty (Yi

0 )V y t + 0: y iV y

;

(3.46) (3.47)

The eigenvalues and rst four normalized eigenvectors of the \covariance" matrix C are shown in Fig 3.15. It can be seen that the eigenvectors have approximately the shape of cosine functions. This indicates that a Fourier expansion series is also appropriate to represent the temporal behavior of parametrized log-area sequences (Yehia and Itakura, 1993a, 1994). The representation of an area function sequence by the method described here is illustrated in Fig.3.16. Panel (a) shows the original sequence taken from the corpus while panel (d) shows the corresponding rst three formant trajectories. Panel (b) shows the sequence recovered from a parametrization by N q = 5 4 = 20 coe cients obtained with the principal component analysis (PCA) described in this and in the previous section. Finally, panel (c) shows the sequence seen in panel (a) recovered from a two-dimensional Fourier cosine series approximation by N q = 9 4 = 36 coe cients. Note the good agreement between the formant trajectories associated to the recovered area function sequences (panels (e) and (f)). Not surprisingly, also note that, even using considerably less parameters, PCA analysis preserves better the morphological characteristics of the vocal-tract.

Comment

Now, the vocal-tract is represented by appropriate parameters. The next task is to estimate these parameters from formant log-frequencies. A procedure to accomplish this task is the topic of the next chapter.

52

CHAPTER 3. PARAMETRIC MODELS FOR THE AREA FUNCTION

Eigenvalues of the "Covariance" Matrix (Lambda) 100

50

0

1

2

3

4

5

6 0.5

7

8

9

10

Normalized Eigenvector #1 0.5

Normalized Eigenvector #2

0

0

−0.5 1 0.5

10

−0.5 1 0.5

10

Normalized Eigenvector #3

Normalized Eigenvector #4

0

0

−0.5 1 Frame Number (i)

10

−0.5 1 Frame Number (i)

10

Figure 3.15: Eigenvalues of the \covariance" matrix

log-area vectors, and corresponding rst four eigenvectors.

of sequences of parametrized

3.3. TEMPORAL ANALYSIS

53

Original Area

Area (cm2) 5 0 0 0.1 Tim e ( 0.2 s) Frequency (kHz) 3 2 1 0 (d) 15 10 (a) 5 0 00 0.1 (b)

PCA

5 0 00 0.1 (c)

Fourier

5

0.2

15

10

5

0.2

**Formant Frequency Trajectories
**

3 2 1 (e)

F3: 4.7% Max. Diff. F2: 9.8% F1: 7%

10 m) c 15 n( itio s Po

5

0

3 2 1 0

(f)

F3: 11% Max. Diff. F2: 9.9% F1: 8%

0.1 Time (s)

0.2

0

0.1 Time (s)

0.2

0.1 Time (s)

0.2

Figure 3.16: (a) Sequence of area functions, taken from the corpus, corresponding to the diphthong /ui/, uttered in the (French) sentence \Luis pense a ca." (b) Sequence of areas reconstructed from the parametric principal component representation of the original areas shown in (a). (c) Sequence of areas reconstructed from the parametric Fourier representation of the original areas shown in (a). (d), (e) and (f) show formant frequency trajectories corresponding to the sequences of areas shown in (a), (b) and (c) respectively. The dashed lines shown in (e) and (f) are the original formant trajectories shown in (d). For each pair of formant trajectories, the maximum relative di erence (in percentage) is also shown.

**Chapter 4 The Inverse Problem
**

\Tangible things become insensible to the palm of the hand." Carlos Drumond de Andrade (1902{1987) Memory

Before entering the details of the speech production inverse problem, it is interesting to analyze it from a more generic point of view: The articulatory space can be viewed as a space whose dimension N is larger than the dimension M of the acoustic space. The problem is then to nd, among all the points in the articulatory space that are mapped onto a given point in the acoustic space, the one that is the most likely to occur. If such points de ne an articulatory subspace of dimension N M (see Figure 4.1), the solution for the inverse problem can be divided in two parts: (i) mathematical description of the N M dimensional articulatory subspaces, each of them corresponding to one and only one point in the M dimensional acoustic space; and (ii) formulation of a cost function to determine, for each subspace, the point that is the most likely to occur. This is the procedure that will be described in this chapter. 54

55

N Dimensional Articulatory Space M Dimensional Acooustic Space

N-M Dimensional Subspaces

Figure 4.1: Representation of the one-to-one relationship between the N M dimensional subspaces that form the N dimensional articulatory space. Compare with gures 4.2 and 4.3, where the level curves are N M = 1 dimensional subspaces (contained in an N = 2 dimensional articulatory space) which map onto an M = 1 dimensional acoustic space. The speech production inverse problem, i.e. the problem of estimating the vocaltract con guration from the speech signal, can be seen as a one-to-many non-linear mapping. This mapping establishes the relationship between an articulatory space, determined by all possible vocal-tract con gurations, and an acoustic space, determined by all possible speech signals. The one-to-many characteristic comes from the fact that a given period of speech can be generated by an in nite number of vocal-tract con gurations.

56

CHAPTER 4. THE INVERSE PROBLEM

The non-linear characteristic is inherent in the process of speech generation. However, its degree of complexity depends on the parameters chosen to represent both articulatory and acoustic spaces. The objective of this chapter is to analyze a restricted case of the speech production inverse problem, namely the estimation of the cross-sectional area along the vocal-tract (or, for simplicity, the area function) from the corresponding formant frequencies. For this particular case, each area function is represented by one point in the articulatory space, while each set of formant frequencies is represented by one point in the acoustic space. The rst di culty to solve this problem comes from the one-to-many characteristic of the inverse problem, i.e. each point in the acoustic space is associated with many points in the articulatory space. To cope with this fact, two kinds of constraints can be invoked: the rst one is related to the morphology of the vocal-tract, which determines the positions that can be reached and the e ort necessary to reach them. Under this constraint, the inverse problem can be stated in the following way: Find, among all points in the articulatory space associated with a given point in the acoustic space, the point that is reached with minimum e ort1 by the vocal-tract. Morphological constraints however, are essentially static and, therefore, cannot account for co-articulation e ects such as anticipation and retention. In order to cope with these e ects, a second kind of constraint can be invoked: it is related to the patterns of motion of the vocal-tract, which can be called gestures, and can be used to determine the trajectories that can be followed by the tract, and the e ort necessary to execute each of them. At this point, it is possible to expand the concept of an articulatory space, and think about an articulatory trajectory space which is formed by all the gestures that can be generated by the human vocal-tract. Such a space maps onto an acoustic trajectory space which is formed by all trajectories that can be generated by the vocal-tract in the acoustic space. Under this point of view, the inverse problem can be restated as: Find, among all points in the articulatory trajectory space associated with a given point in the acoustic trajectory space, the point that is produced with minimum e ort by the vocal-tract.

case of human motor behavior, minimum e ort is a concept di cult to specify. In reality it is a combination of facts which come from the command generation level in the brain down to articulatory motion under physiological constraints. The e ort function used here is a simple quadratic cost that takes into account vocal-tract morphological information.

1 In the

4.1. ISOLATED FRAMES

57

4.1 Isolated frames

In this section, the mathematical formulation used to represent morphological constraints is described for the case of isolated frames. In the following section the method is generalized for the case of trajectories. The procedure can be divided in two parts: rst, a mathematical representation for the \cost" of a given position of the vocal-tract is derived. After that, this cost function is minimized under the acoustic constraint determined by a given set of formant frequencies.

**4.1.1 Representing Morphological Constraints
**

In order to keep a link with the works developed by Schroeder (1967) and Mermelstein (1967), we start the explanation about morphological constraint representation using the truncated Fourier cosine series parametrization of the log-area function. There are many sets of Fourier cosine coe cients that are associated with the same set of formant frequencies (Mermelstein, 1967; Atal et al., 1978). For this reason, it is necessary to impose constraints if we wish to estimate the area function from the formants. As an example, the thick solid lines shown in the bottom panel of Figure 4.2 represent level curves of the surface shown in the top left panel; which shows the rst formant frequency (F1 in kHz) when the vocal-tract cross-sectional area is represented by 1 1 Ak = exp a1 cos K (k 2 ) + a2 cos 2 (k 2 )]; L = 17cm: (4:1) K It is seen that, for a given value of F1, there is an in nite number of combinations of a1 and a2 associated with it. Another important fact is that, for a given a2 there is one and only one a1 associated with a given formant frequency. The same property is observed when and h are used to parametrize articulatory and acoustic spaces. This fact is illustrated in the bottom panel of Figure 4.3, where the thick solid lines show level curves of h1 expressed as a function of 1 and 2 when all other components of are equal to zero. It is seen that each point h1 is associated with a line in the plane de ned by 1 and 2. From now on, since and h allow a more e cient parametrization than a and f, most of the mathematical procedures carried out here will be based on and h.

58

CHAPTER 4. THE INVERSE PROBLEM

First Formant 1 F1 (kHz) Cost 6 0 −6 6 −6a2

Cost Function

0 −6

0

a1

0

a1

6 0 6 −6a2

Level Curves of F1 and of Cost Functions 6 4 2

a2

0 −2 −4 −6 −6 −4 −2 0 2 4 6

a1

Figure 4.2: Top left: the rst formant F1 as a function of the Fourier cosine coe cients a1 and a2; when all other coe cients are equal to zero. Top right: paraboloidal surface representing the cost function P a = at Ha a used to quantify the vocal-tract e ort.

Bottom: The solid thick lines show level curves of the surface shown in the top left panel. (Compare with the general case in Fig. 4.1.) The solid thin ellipses show level curves of the cost function shown in the top right panel. The dashed circles represent the particular case when P y is an unweighted squared Euclidean distance (i.e. when Hy is an identity matrix.)

4.1. ISOLATED FRAMES

59

First Acoustic Component

Cost Function

6 Cost −6 6 γ2−6 −6 γ1 6 6 γ2−6 −6 γ1 6 Level Curves of h1 and of Cost Functions 6 4 2 h1

γ2

0

−2 −4 −6 −6 −4 −2 0 2 4 6

γ1

Figure 4.3: Top left: the rst acoustic component h1 as a function of the principal

component coe cients 1 and 2 ; when all other coe cients are equal to zero. Top right: paraboloidal surface representing the cost function P = t H used to quantify the vocal-tract e ort. Bottom: The solid thick lines show level curves of the surface shown in the top left panel. (Compare with the general case in Fig. 4.1.) The solid thin ellipses show level curves of the cost function shown in the top right panel. The dashed ellipses represent the particular case when P y is an unweighted squared Euclidean distance (i.e. when Hy is an identity matrix.)

60

CHAPTER 4. THE INVERSE PROBLEM

Nevertheless, the counterpart procedures based on a and f can be obtained basically by substituting by a and h by f. Coming back to the topic of mathematical representation of morphological constraints, the (static) constraint considered here is based on the following optimization problem: Given a vector of acoustic variables h, nd, among all possible vectors of articulatory variables associated with h, the one that is the \closest" to the \minimum e ort position." This position can be given by the neutral or the average position of the vocal-tract: it is reasonable to assume that the neutral vowel position corresponds to the minimum e ort position of the tract, since no active articulation is being performed. It is also possible to think that the minimum e ort position corresponds to the average position of the tract. If the neutral position is mathematically interpreted as the point of maximum probability density in the articulatory space, then it coincides with the average position if the probability density function of the points in the articulatory space is symmetric, but may not coincide in other cases. The neutral position seems to be more meaningful, but the average position is more tractable from the mathematical point of view. This point will be addressed again opportunely. A mathematical formulation for a morphological constraint can be carried out in the following way: a log-area vector y can be e ciently parametrized by a vector as already seen in the last chapter (Eq. (3.29))

y ' T y + 0y :

If the \minimum e ort position" is de ned by P can be de ned as

0, y

**then a quadratic positional cost (4.2)
**

0 ): y

P = P y = (y

Now, making

tT

y

Hy T y ;

(y

**which is an approximation for the quadratic form
**

0 )t Hy y

(4.3) (4.4) (4.5)

H = T y Hy T y ;

yields

P =

tH

:

4.1. ISOLATED FRAMES

61

the vocal-tract. It must be chosen so that natural positions of the tract result in a low cost P , while positions incompatible with the morphological characteristics of the tract result in a high cost P . The simplest choice for H is obtained when Hy is taken as the identity matrix of order (K +1). In this case, P is simply the square of the Euclidean distance between the parametrized log-areas of a given vector y and 0y , the log-area vector corresponding to the minimum e ort position. This is, however, not a good cost function, since it gives equal weight to exible and rigid regions of the vocal-tract. A geometric illustration of this case is given by the dashed ellipses shown in the bottom panel of Fig. 4.3. However, the meaning of this unweighted case of the cost function becomes clearer when it is represented by Fourier cosine coe cients: the dashed circles shown in the bottom panel of Fig. 4.2 represent level curves of P a as a function of a1 and a2. It is possible to see that the minimal cost for a given formant corresponds to the intersection of its level curve with the line a2 = 0. It is interesting to note that Mermelstein (1967) used basically the same mathematical constraint: for N = 7 and M = 3, fa0; a2; a4; a6g were kept equal to zero, while the rst M = 3 formant frequencies fF1; F2; F3g were used to determine the rst M = 3 odd Fourier cosine coe cients fa1; a3; a5g. (The problem of length determination was not considered). Also interesting is the fact that the results obtained with this simple and rather arti cial constraint are quite acceptable for some vowels, as shown in Mermelstein (1967), and in the next chapter. A more realistic possibility is obtained when Hy is taken as a diagonal matrix in which the elements of the diagonal associated with the rigid regions of the tract are large, while the elements associated with the exible regions are small. This approach, combined with a smoothness constraint, was used by Yehia and Itakura (1993,1994) with reasonable results. The weak point in this approach is that, although the local characteristics of each region of the tract are well represented, the global articulatory structure (morphology) of the tract is not taken into account. As an example, the cost of a given position of the tongue apical region may depend on the position of the tongue dorsal region. In order to represent the interdependence between di erent regions of the tract, the covariance between those regions must be taken into account. From a probabilistic

H is a positive de nite matrix which contains information about the morphology of

62

CHAPTER 4. THE INVERSE PROBLEM

point of view, given a corpus containing Q > N linearly independent log-area vectors, if H is taken as the inverse of the covariance matrix of the log-area articulatory vectors , 1 Q H = C ; where C ' Q 1 i=1

1

X

t i i;

(4.6) (4.7)

then

P = tC

1

;

i.e. P becomes a squared Mahalanobis distance (Duda and Hart, 1973, pp.23{24). Under the rather strong assumption of normal distribution, it means that, given an acoustic vector h, minimization of P implies maximization of the probability of occurrence of the corresponding . It is interesting to note that the same H can be found by minimizing 1 CP = Q

X

1 P i = Q tiH i=1 i=1

X

Q

Q

i

= trace(C H );

(4.8)

the average of the costs of all articulatory vectors in the corpus, with respect to the elements of H , under the constraint

jH j j C 1 j:

(4.9)

The proof is based on the fact that the trace and the determinant are respectively the sum and the product of the eigenvalues of C H . Then it is not di cult to show that minimization of the trace under the above determinant constraint implies that the eigenvalues must be all equal to 1 (one) and, hence, C H is an identity matrix. Since C is a covariance matrix,

H =C

1

is de ned, and is a positive de nite matrix. From a geometrical point of view, the level hypersurfaces of P are hyperellipsoids whose principal axes are determined by the eigenvectors of C , the eigenvalues determining the length of these axes. An illustration for a two-dimensional case is given by the ellipses shown in the bottom panel of Fig. 4.3, which represent level curves of P as a function of 1 and 2.

4.1. ISOLATED FRAMES

63

The above derived cost function is able to cope with the articulatory e ects determined by the morphology of the vocal-tract. However, it is important to note that, in a wider sense, it can not be considered to be optimal. It is so because the quadratic form adopted for P implies the existence of a single point of minimum, corresponding to the minimum e ort position. Nevertheless, other stable positions, corresponding to local minima, may exist and, if they are to be taken into account, a more elaborated model is needed. Also, the probability distribution of points in the articulatory space is not symmetric relatively to the mean. When solving the inverse problem, it was observed that slightly better results are obtained when the quadratic cost function has its center translated to the point of maximum probability density in the articulatory space (i.e. the neutral position.) Finally, since P is essentially a static constraint, it can not cope with co-articulation e ects. (This point will be analyzed later.) As a nal comment note that, for the case of Fourier cosine components, following the same procedure used for the principal components, the cost function is given by

P a = (a

t a ) Ha (a

a );

Ha = Ca 1 = Q 1 1 (ai i=1

X

Q

a )(ai

t a ) : (4.10)

**4.1.2 Solving the inverse problem
**

For a given acoustic vector h, it is now possible to derive a procedure to estimate its articulatory counterpart, represented by vector , under the morphological constraint described above. The method is as follows.

Relationship between acoustic and articulatory variables A variation in the acoustic vector h is locally linearly related to a variation in the

articulatory vector (Mermelstein, 1967). Thus, it is reasonable to assume that, for su ciently small variations,2 = h; a( ) h( + ) = h( ) + a( )

2 In

;

(4.11) (4.12)

strict terms, the equality holds only for in nitesimal variations. However, this relation is approximately true even for fairly large variations, as seen in Fig. 3.12.

64

CHAPTER 4. THE INVERSE PROBLEM

where a is the Jacobian matrix that gives the partial derivatives @hj =@ i for a given : dh (4.13) a( ) = d ; which is an M N matrix (the number of rows is equal to the number of acoustic components M , and the number of columns is equal to the number of articulatory components N ). Figure 3.12 gives an idea of the \degree of linearity" between h and . It shows the M = 3 acoustic variables h in terms of the N = 5 articulatory variables: 1; 2; : : :; 5. There are two important facts to be noted here: 3 1. There is a quasi-linear relationship between h and . 2. There is a one-to-one relationship between 1; 2; 3] and h1; h2; h3]. (In fact, what is apparent in Figure 3.12 is that h1; h2 and h3 are monotonically increasing functions of 1, 2 and 3, respectively.) This one-to-one relationship leads to the following speculation: taking 1 as the vector formed by the rst M articulatory components of , @h1 @h1 @h1 @ 1 @ 2 ::: @ M @h @h 2 @ h = @ 21 @ 22 : : : @@hM (4.14) ( )= @ 1 . . ... . . . . 1 . . . @hM @hM : : : @hM @ 1 @ 2 @ M is not singular. In fact, numerical tests with log-area functions indicate that det( 1 ) is practically always positive. Exceptions do exist, but did not cause problems during the cases analyzed until now. Therefore, under the assumption that det( 1 ) 6= 0, it is possible to divide

2 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 5

=

1

;

2

;:::;

M;

M +1; : : : ;

t N] ;

(4.15) , and

2

**into two subvectors: 1containing the rst M components of taining the remaining components
**

1 2

3 Similar conclusions

con-

= =

M +1;

1

;

2

; : : :;

M +2; : : : ;

t M] ;

t N] ;

(4.16) (4.17)

formant frequencies.

can be taken from Figure 3.3 for the case of Fourier cosine coe cients and

**4.1. ISOLATED FRAMES where M is the number of acoustic components; and express and 2 as follows.
**

1

65 in terms of h (4.18) (4.19) (4.20)

1 : : : @@hN 2 : : : @@hN @h = (4.21) 2( ) = @ 2 . . ... . ; . . . . . . @hM @hM @hM @ M +1 @ M +2 : : : @ N where 2 is given by the last N M components of . (The equations above are in general terms. In the case under analysis, M = 3 and N = 5.)

6 6 6 6 6 6 6 4 7 7 7 7 7 7 7 5

**1 1 h where 1 was already de ned in Eq. (4.14), and
**

1

2

1

a 1+ 2

2

= = =

h; h;

1

1

2

2

;

@h1 @ M +1 @h2 @ M +1

@h1 @ M +2 @h2 @ M +2

3

**Combining acoustic and morphological constraints
**

The above relation can now be used as an acoustic constraint for the cost function P . By minimizing P + ( 2)] with respect to 2 (using 1 = 1 1 h 1 1 2 2), it is possible to nd min that minimizes P ( + ) under the acoustic constraint a = h: The mathematical formulation is as follows. Given

P

where

0

(

2 )] =

(

2) +

t 0] H

(

2) +

0 ];

(4.22) (4.23) (4.24)

**is the neutral position4 , nd the family of vectors dP = 0: d 2 Doing the calculation, t dP dP = d d 2 d 2 d t d = d H + H t]( + 2
**

" # " #

for which

0 ):

mentioned before, better results were obtained when the neutral position was used instead of the average position as the minimum e ort position.

4 As

66 Here, the derivative of

**CHAPTER 4. THE INVERSE PROBLEM with respect to
**

0= d d

2

2

=

4

IN

will be called 1 1 2 ;

2

3 5

0

and is given by (4.25)

M

where IN M is the identity matrix of order N M . Now, since H is symmetric (it is a covariance matrix),

and, nally, dP =d

**dP = 2 0tH ( d 2 2 = 0 implies that
**

0 tH

+

0t H

0)

(4.26)

=

(

0 );

0 );

(4.27)

which can be rewritten as

p

where

=

p(

0t H

(4.28)

p =

:

(4.29)

The linear system above gives the necessary N derdetermined system given by Eq. (4.11):

**M equations to complete the un= h =) =
**

3

a p

where

= =

p(

2

h

9 =

0)

;

=)

2 4

1

h; (4.30)

=

4

a p

3 5

and

h=

p(

h

0)

5

:

(4.31)

Iterative solution

For larger variations, the system above is an approximation, since depends on . However, as seen in Fig. 3.12, there is a high degree of linearity in this non-linear system. In the experiments performed, it was successfully solved by the following Newton-Raphson procedure

4.1. ISOLATED FRAMES

1

67

1

f

= (h1; H ) n Function to compute

from h1 and prior information H . n

= 0; n Initialize to the neutral position and compute h = h( 0); n the corresponding acoustic vector. h = h1 h; while (k h k > ")

n n

f

g g

1

= ( ; H ); h = h( h; ; ); = 1 h; = + ; h = h( ); h = h1 h; = ;

n n n n n

Calculate new , h, and . Update . Update h.

n n n n n n

n Return 1.

value around 0:01 (1%) was found to be a good compromise between precision and computation cost. The input h1 is obtained from the rst M formant frequencies using Eq. (3.33), while the area vector is obtained from the output 1 using Eq. (3.29). For the analyzed cases (see next chapter) this procedure took on average three iterations to converge, the hardest cases taking six iterations.

k k is a norm function. In the implemented system it is the maximal deviation between the desired and obtained acoustic vectors (h). " is an error criterion. A

68

CHAPTER 4. THE INVERSE PROBLEM

4.2 Trajectories

The log-area function moves smoothly due to dynamical constraints. Here, instead of modeling physically the dynamics of the vocal-tract, only a constraint of smoothness in time will be imposed. The dynamic case can be derived as an expansion of the static case. In the end of Chapter 3 , it was seen that, for a given time interval, the trajectories of the components of vector (or a, in the case of representation by Fourier cosine series) can be approximated by a linear combination of eigenvectors whose components follow the approximate shape of cosine functions, as expressed by Eq. (3.42) rewritten below as

i

=

iV

t

+ Ei;

(4.32)

where Ei is the approximation error matrix, i is a matrix whose columns are a sequence of p articulatory vectors ( i; : : :; i+p 1), and the q columns of V are eigenvectors whose sum, weighted by the coe cients contained in i; approximates i. Now, using Eq. (4.30), and dropping the time index i for convenience, it is possible to write where

M

G = H;

(4.33)

M=

2 6 6 6 6 6 6 6 4

( i)

0

. . .

0

::: ( i+1 ) : : : . ... . . 0 :::

0

0 0

(

i+p 1 )

3 7 7 7 7 7 7 7 5

. . .

(4.34)

is a \matrix of matrices" containing the locally linear relation between a sequence of articulatory variations and their acoustic and cost counterparts;5

2 6 6 6 6 6 6 6 4 3

i

G =

5 Compare with

. . .

i+1 i+p 1

7 7 7 7 7 7 7 5

(4.35)

on is made explicit here.

Eq. (4.30) and observe that the dependence of

**4.2. TRAJECTORIES contains a sequence of articulatory variations vector; and
**

2 6 6 6 6 6 6 6 4

69

i

**vertically arranged as a column
**

3 7 7 7 7 7 7 7 5

H =

. . .

hi hi+1 h i+p

1

(4.36)

contains a sequence of vectors hi, vertically arranged as a column vector, composed by the acoustic and positional cost constraints associated with the articulatory variation vectors contained in G . The relation given by Eq. (4.32) can be applied to Eq. (4.33) yielding

MV

X = H + E;

(4.37)

where

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

V=

v11 : : : v1q 0 : : : 0 0 : : : 0 v11 : : :v1q . . . . . . 0:::0 0:::0 v21 : : : v2q 0 : : : 0 0 : : : 0 v21 : : :v2q . . . . . . 0:::0 0:::0 . . . . . . . . . . . . vp1 : : : vpq 0 : : : 0 0 : : : 0 vp1 : : :vpq . . . . . . 0:::0 0:::0

::: 0:::0 ::: 0:::0 . ... . . : : : v11 : : : v1q ::: 0:::0 ::: 0:::0 . ... . . : : : v21 : : : v2q . . . . . . . . . . . . ::: 0:::0 ::: 0:::0 . ... . . : : : vp1 : : : vpq

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

(4.38)

is a Np Nq matrix whose nonzero elements vij are the entries of V . Each row of

70

CHAPTER 4. THE INVERSE PROBLEM

**V is repeated N times in the matrix shown above.
**

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

V

11 i

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

. . .

12 i 1q i 21 i 22 i 2q i

X =

. . . . . . . . . . . .

(4.39)

N1i N2i Nq i

contains the columns of a variation i rearranged in a column vector and, nally, E is an Np 1 column vector containing the approximation error. Eq. (4.33) de nes an overdetermined system which can be solved by minimizing a weighted version of the squared error (X ) = E t H E t = X H H

h

where H is an Np Np positive de nite (Horn, 1985, p. 250) matrix which can be used to give di erent weights to acoustic and morphological constraints, and to di erent subintervals of the speech interval under analysis. It may be interesting when dealing with more complex speech intervals, or when studying the trade-o between e ort and acoustic accuracy in speech (Lindblom, 1990). However, in the present status of this study, such possibilities are not being explored yet, and H is being taken as an identity matrix. Minimization of with respect to X is carried out as follows d = 0 dX

MV

i

h

MV

X

H ;

i

(4.40) (4.41)

4.2. TRAJECTORIES =) =) =)

h i

71

h

h

h

=) X

MV H MV + MV H MV = 0 MV (H + H ) MV = 0 MV (H + H ) MV = MV (H + H ) MV (H + H ) (4.42) = MV (H + H ) MV

t

i

t

X

H

i

h

X

H

i

t

h

i

t

h

i

t

X

i

H

i

t

h

X

h

i

t

t H

h

i

t

t

h

i

1h

i

t

t H

**and, for the particular case of H being an identity matrix,
**

X =

h

MV MV MV

i h

t

i

1h

i

t

H:

(4.43)

Iterative solution

As in the case described in Section 4.1, the equality of Eq. (4.43) holds only for su ciently small variations. For larger variations, the problem is solved by the same kind of Newton-Raphson procedure used in the previous section. The acoustic components of H are initialized by the trajectory given by the di erence between a given sequence of acoustic vectors6 and the acoustic vector determined by the articulatory neutral position. The minimum cost components of H are initially set to zero, and adapted as the articulatory trajectory vector X changes during the iterative procedure. For a given X , i is obtained by simply rearranging the entries of X as (see Eq. (4.39))

2

=

6 6 6 4

. . .

11i

::: :::

iV t

. . .

1qi

3 7 7 7 5

;

(4.44)

N 1i

Nqi

**and the corresponding articulatory trajectory is approximated by (see Eq.(3.42)
**

i

'

+ 0;

(4.45)

where 0 is the \neutral trajectory" determined by the vocal-tract sustaining the neutral position along the analyzed interval. Finally, the log-area vector trajectory is obtained from (see Eq. (3.29))

Y ' T y + 0y :

6 The

(4.46)

acoustic vectors are determined from formant log-frequency vectors using Eq. (3.33).

72

CHAPTER 4. THE INVERSE PROBLEM

Comment

Now it is possible to estimate a plausible articulatory trajectory from a given acoustic trajectory. Conceptually, the method looks consistent. However, it must not be forgotten that the vocal-tract articulation e ort was arbitrarily represented by a quadratic cost function, which is convenient from the mathematical point of view, but has no physiological base to be adopted. From the results presented in the next chapter, it will be possible to evaluate the performance of the method under the limitations imposed by this admittedly arti cial e ort measure.

**Chapter 5 Results and Discussion
**

\Computers are useless, they only give answers." Pablo Ruiz y Picasso (1881{1973)

Some results obtained with the method described in the previous chapter are given here. The isolated frame procedure is analyzed in the rst section. The second section analyzes the case of trajectories. Not surprisingly, the results obtained are not always perfect, since the quadratic cost function used to measure the e ort of each vocal-tract position (or, in a broader sense, trajectory) was chosen more because it allows a simple mathematical minimization procedure than for physiological characteristics of the vocal-tract. Nevertheless, the results are enough to show that using a simple relation between acoustic and articulatory parameters it is possible to represent acoustic constraints in the articulatory space, and combine them directly with minimum e ort and continuity constraints. 73

74

CHAPTER 5. RESULTS AND DISCUSSION

5.1 Isolated Frames

The procedure for inversion of the articulatory-to-acoustic mapping for isolated frames described in the previous chapter was applied for the oral vowel frames contained in the corpus. All 258 analyzed frames are shown in Appendix B. Some selected frames are used here to interpret the results. The four frames shown in Fig. 5.1 show good results obtained for di erent vowels. In general, the good agreement between original areas (derived from midsagittal distances) and estimated areas (obtained from the formant frequencies determined by the original areas) prevails in most of the frames analyzed. In spite of that, large discrepancies also exist, and is exactly the comprehension of the di erent types of discrepancies that may allow future improvements for the system. Four types of error are shown in Fig. 5.2. From left to right, the rst column shows the case of excessively open lips. A possible explanation for that comes from the fact that the inversion procedure is carried out in the log-area domain, what makes large areas less sensitive to errors than small areas. From the acoustic point of view it makes sense, since formant frequency variations also depend on the log-area rather than on the area function itself. However, from the articulatory point of view, considerably large, sometimes unacceptable, errors can occur. The second column shows the case of an excessively large oral cavity behind a very narrow constriction at the lips. From the acoustic point of view, quite large variations on wide areas behind narrow constrictions have small e ects on formant frequencies. From the articulatory point of view, the quadratic cost function used to evaluate vocal-tract e ort to reach a position seems to be very \mild" with respect to large variations in the oral cavity. A possible explanation for that comes from the axial symmetry of the quadratic cost function: since complete closures are accomplished in the oral cavity with little e ort, and since a complete closure is associated with a \minus in nite log-area," very large oral areas become also associated with little e ort. The clipping procedure to avoid closures used in Chapter 2 during the construction of the log-area corpus apparently was not enough to avoid this discrepancy. A possible solution for this problem would be the use of an asymmetric cost function, but this would turn the cost minimization procedure more complex. Another possibility is to use continuity constraints in time. This is the subject of the next section. Although an exhaustive analysis of trajectory estimation has not been carried out yet, the cases

5.2. TRAJECTORIES

75

analyzed did not present this kind of discrepancy. The third column of Fig. 5.2 shows a very underestimated vocal-tract length. A clear explanation for this mistake has not been found yet, but this is one more case where continuity constraints in time could help. The right column of Fig. 5.2 shows a case where an underestimation of the vocaltract length is compensated by a partial lip closure. This kind of compensation is not unlikely to happen in real speech. This error shows that, even under morphological constraints represented correctly, more than one plausible shape of the vocal-tract can produce the same set of formants. Finally, we call attention to the fact that, even when the inversion procedure failed in estimating the correct area function, the transfer functions associated with original and estimated areas match fairly well. This fact indicates that most of the articulatory errors observed have small acoustic e ects.

5.2 Trajectories

Adding the continuity constraints explained in Section 4.2 to the method of combination of acoustic and morphological information described in the Section 4.1, it was possible to estimate sequences of area functions from the corresponding rst three formant trajectories. Some characteristics of the inversion procedure are illustrated in the following way. In the example given in Fig. 5.3, the sequence of area functions shown in the top panel was used to generate the formant trajectories shown in the bottom panel. These trajectories were then used to recover the original sequence of areas, under minimum e ort and continuity constraints. The result is shown in the top panel of Fig. 5.4. The search for the best sequence of areas was performed in the articulatory trajectory \ " space. Note, however, that the sequence of areas shown in Fig. 5.4 is close, but not identical, to that shown in Fig. 5.3. A possible reason for this is associated with the fact that the mathematical cost function used does not perfectly re ect the articulation e ort determined by the human physiology. Another reason is that the parametrization procedure allows only an approximated reconstruction of the original sequence of areas. For comparison purposes, it is interesting to see the results when the same problem

76

CHAPTER 5. RESULTS AND DISCUSSION

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0311

1 Speech Signal 0 −1 0 40 20 0

PB2854

1 Speech Signal 0 −1 0 40 20 0

PB1560

1 Speech Signal 0 −1 0 40 20 0

PB1754

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0311

Power Spectrum (dB)

PB2854

Power Spectrum (dB)

PB1560

Power Spectrum (dB)

PB1754

2 3 4 Frequency (kHz) PB0311

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2854

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1560

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1754

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0311

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2854

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1560

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1754

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0311

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2854

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1560

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1754

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0

cm

10 5 5 0

15

cm

10

cm

10

15

0 0

5

cm

10

0 0

5

cm

10

15

column, the central panel shows original (thin line) and estimated area (thick line). The estimated area is obtained from the formant frequencies determined by the original area. Vocal-tract pro le, midsagittal distances, transfer functions and speech signal are also shown for reference purposes. From left to right the columns correspond to the neutral, and French /a/, /i/, and /u/ vowels.

Figure 5.1: Results obtained with the inversion technique for isolated frames. In each

5.2. TRAJECTORIES

77

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0216

1 Speech Signal 0 −1 0 40 20 0

PB0230

1 Speech Signal 0 −1 0 40 20 0

PB0334

1 Speech Signal 0 −1 0 40 20 0

PB1518

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0216

Power Spectrum (dB)

PB0230

Power Spectrum (dB)

PB0334

Power Spectrum (dB)

PB1518

2 3 4 Frequency (kHz) PB0216

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0230

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0334

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1518

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0216

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0230

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0334

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1518

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0216

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0230

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0334

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1518

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0

10 5 5 0

cm

10

cm

10

15

0 0

5

cm

10

15

0 0

5

cm

10

15

Figure 5.2: Problems with the inversion procedure. The columns show the following cases.

Left: French /a/ with excessively open lips. Center-left: French /u/ with excessively large front cavity. Center-right: French /i/ with excessively short length. Right: French /e/ with excessively closed lips compensating underestimated length. As in Fig. 5.1, in the central panel of each column, the thin line is the original area and the thick line is the area estimated from the formant frequencies determined by the original area.

78

CHAPTER 5. RESULTS AND DISCUSSION

is solved using a truncated Fourier series to represent the log-area function. We start with the analysis carried out by Mermelstein (1967), who parametrized the vocal-tract log-area function by the rst six coe cients of its Fourier cosine series expansion. It was veri ed that, when the even coe cients are all equal to zero, as already mentioned, there exists a one-to-one relationship between the rst three formant frequencies and the three odd Fourier coe cients. Using this property, an interactive procedure was implemented to nd the unique set of odd Fourier coe cients associated with a given set of formant frequencies, when all even Fourier coe cients are equal to zero. This procedure was used to obtain the sequence of areas shown in Fig. 5.5 from the formant trajectories shown in Fig. 5.3. Note that the result is substantially di erent from the original sequence of areas (top panel of Fig. 5.3). This con rms the fact that setting all even Fourier coe cients to zero is an arti cial constraint that does not re ect the geometrical constraints determined by the vocal-tract morphology. When the mathematical framework described in the previous chapter to incorporate such morphological constraints is used with the Fourier representation described in Chapter 3, the sequence of areas shown in Fig. 5.6 is obtained. It can be seen that it resembles the original sequence of areas shown in Fig. 5.3. However, abrupt variations, inherent in some regions of the vocal-tract, cannot be well approximated, due to the smooth character of the cosine functions, which form the basis of the Fourier cosine series representation. This is in contrast with the eigenvectors used in the principal component representation, which allow a good representation of the vocal-tract structure. When the principal component representation is used in place of the Fourier representation, the result obtained is the sequence of areas shown in Fig. 5.4. As a nal observation, the similarity of the formant frequency trajectories associated with the sequences of areas shown in Figures 5.3, 5.4, 5.5, and 5.6, show that, even under continuity constraints, substantially di erent sequences of area functions can generate basically the same formant trajectories.

5.2. TRAJECTORIES

79

Original Area 5 Area (cm2) 0 0 5 0.1 15 Time (s) 0.2 10 Position (cm)

0

Formant Frequency Trajectories Frequency (kHz) 3 2 1 0 0.1 Time (s) 0.2

the diphthong /ui/, uttered in the French sentence \Luis pense a ca." Bottom: Formant frequency trajectories corresponding to the sequences of areas shown in the top panel.

Figure 5.3: Top: Sequence of area functions, taken from the corpus, corresponding to

80

CHAPTER 5. RESULTS AND DISCUSSION

Principal Components 5 Area (cm2) 0 0 5 0.1 15 Time (s) 0.2 10 Position (cm)

0

**Formant Frequency Trajectories Frequency (kHz) 3 2 1 0
**

F3: 2.6% Max. Diff. F2: 6.4% F1: 7.8%

0.1 Time (s)

0.2

Figure 5.4: Top: Sequence of areas estimated from the formant trajectories shown in

the bottom panel of Fig. 5.3, under continuity and minimum e ort constraints. Bottom: The solid lines show the formant frequency trajectories corresponding to the sequences of areas shown in the top panel. The dashed lines reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3.

5.2. TRAJECTORIES

81

Fourier (Odd Terms)

Area (cm2)

5

0 0 5 0.1 15 Time (s) 0.2 10

0

Position (cm)

**Formant Frequency Trajectories Frequency (kHz) 3 2 1 0
**

F3: 4.7% Max. Diff. F2: 6.5% F1: 3.2%

0.1 Time (s)

0.2

Figure 5.5: Top: Sequence of area functions estimated from the formant trajectories shown in Fig. 5.3 under the following constraint: The areas are represented by the rst six components of its Fourier cosine series expansion with the even coe cients set to zero. Bottom: The solid lines show the formant frequency trajectories corresponding to the sequences of areas shown in the top panel. The dashed lines reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3.

82

CHAPTER 5. RESULTS AND DISCUSSION

Fourier (All Terms) 5 Area (cm2) 0 0 5 0.1 15 Time (s) 0.2 10 Position (cm)

0

**Formant Frequency Trajectories Frequency (kHz) 3 2 1 0
**

F3: 3.2% Max. Diff. F2: 5.4% F1: 2.1%

0.1 Time (s)

0.2

Figure 5.6: Top: Sequence of area functions estimated from the formant trajectories shown in Fig. 5.3 under the following constraint: The areas are represented by the rst nine components of its Fourier cosine series expansion determined under morphological and continuity constraints. Bottom: The solid lines show the formant frequency trajectories corresponding to the sequences of areas shown in the top panel. The dashed lines reproduce the original formant trajectories shown in the bottom panel of Fig. 5.3.

5.3. QUANTITATIVE ANALYSIS

83

5.3 Quantitative Analysis

The qualitative analysis done in sections 5.1 and 5.2 was important to understand the limitations of the method as well as the reasons for the discrepancies observed. In this section, a quantitative analysis based on the 258 frames of the corpus that correspond to oral vowels is carried out. The objective is to give a measure of the performance of the method and to understand its global behavior. The rst step in this analysis is to verify the in uence of the parametrization procedure on the original cross-sectional areas. It is illustrated in Fig. 5.7a, where all cross-sectional areas in logarithmic scale1 recovered from the parametric representation as vectors are plotted against their original counterparts. The correlation coe cient2 (Papoulis, 1991, p. 152) of 0.965 indicates that the parametric representation is good, but the error implied by it is not negligible. Looking at Table 5.1, the mean relative error of 0:9 % indicates that, globally, the parametrization procedure does not cause any signi cant bias. The standard deviation of the relative error of 21 % is indeed signi cant, but still acceptable. In particular, note that the errors due to parametrization caused only small deviations in the acoustic space (see Fig. 5.7e and Table 5.3). Next, original areas and areas estimated from isolated frames are compared (Fig. 5.7b and Table 5.1). A reasonably good correlation coe cient of 0.828 is obtained. Note, however, the minimal error in the acoustic space (Fig. 5.7f and Table 5.3). The standard deviation of the relative error of 56 % is high, but lower bounded by the 21 % standard deviation parametrization error. Also, the very good matching observed in the acoustic space a ects the accuracy of the areas estimated in the articulatory space. When sequences of log-area vectors are estimated instead of isolated vectors, the correlation coe cient increases only marginally from 0.828 to 0.832 (Fig. 5.7c and Table 5.1). Nevertheless, observing the scattering shown in Fig. 5.7d and the standard deviation of the relative error of 17 % plotted in Table 5.1, it is seen that the the areas estimated are signi cantly di erent. The estimation based on sequences of vectors

are 258 log-area vectors, each of them containing 32 log-areas. So, each scattering in the top row of Fig. 5.7 contains 258 32 = 8256 points. In the bottom row, since there are three formants per vector, each scattering contains 774 points. 2 The correlation coe cient was computed in logarithmic scale as p E log A1 log A2]= E (log A1)2 ]E (log A1 )2 ].

1 There

84

CHAPTER 5. RESULTS AND DISCUSSION

yields, naturally, smoother trajectories of log-area vectors. The price paid for that is a small degradation in the matching observed in the acoustic space (Fig. 5.7g and Table 5.3). Finally, the results obtained for length estimation (Table 5.2) show correlation coe cients considerably lower than those obtained for cross-sectional areas. This fact indicates the need for a more appropriate method to handle length information. The small values observed for the standard deviation of the relative error are due to the fact that length variations are small compared with total vocal-tract length. This is in contrast with cross-sectional areas, which vary from values very close to zero up to several square centimeters. Summarizing, the high correlation coe cients observed in the acoustic space conrm that the acoustic constraint imposed by the formant vectors is respected during the log-area vector estimation. In the articulatory space, correlation coe cients around 0.83 indicate that the model works, but still has to be improved. Table 5.1: Numerical Results: Areas Mean Di . Std. Dev. Corr. Coef Parametrized vs. Original 0.9 % 21 % 0.965 Isolated Frames vs. Original 4.1 % 56 % 0.828 Trajectories vs. Original 3.9 % 54 % 0.832 Isolated Frames vs. Trajectories -0.2 % 17 % 0.979 Trajectories vs. Parametrized 3.0 % 45 % 0.872 Table 5.2: Numerical Results: Length Mean Di . Std. Dev. Corr. Coef Parametrized vs. Original 0.44 % 0.7 % 0.987 Isolated Frames vs. Original -0.19 % 4.2 % 0.607 Trajectories vs. Original 0.19 % 3.9 % 0.635 Isolated Frames vs. Trajectories -0.38 % 1.9 % 0.928 Trajectories vs. Parametrized -0.25 % 4.0 % 0.603

5.3. QUANTITATIVE ANALYSIS

85

Table 5.3: Numerical Results: Formants Mean Di . Std. Dev. Corr. Coef Parametrized vs. Original -0.40 % 1.5 % 0.99925 Isolated Frames vs. Original -0.04 % 0.2 % 0.99999 Trajectories vs. Original -0.10 % 1.3 % 0.99936 Isolated Frames vs. Trajectories 0.06 % 1.3 % 0.99937 Trajectories vs. Parametrized 0.30 % 1.8 % 0.99884

Param. vs. Orig. 100 (a) Area (cm2) 10 1 0.1 0.965

Isol. Frm. vs. Orig. 100 (b) 10 1 0.1 0.828

Traject. vs. Orig. 100 (c) 10 1 0.1 0.832

Isol. Frm. vs. Traject. 100 (d) 10 1 0.1 0.979

0.01 0.01 0.01 0.01 0.01 0.1 1 10 100 0.01 0.1 1 10 100 0.01 0.1 1 10 100 0.01 0.1 1 10 100 Area (cm2) Area (cm2) Area (cm2) Area (cm2) 10 Freq. (kHz) (e) 0.99925 10 (f) 0.99999 10 (g) 0.99936 10 (h) 0.99937

1

1

1

1

0.1 0.1

1 Freq. (kHz)

10

0.1 0.1

1 Freq. (kHz)

10

0.1 0.1

1 Freq. (kHz)

10

0.1 0.1

1 Freq. (kHz)

10

Figure 5.7: (a) Scattering of the cross-sectional areas obtained from the parametric principal component representation of the original areas plotted against their original counterparts. The scattering of the formant frequencies derived from the areas is shown in (e). (b) Cross-sectional areas estimated from formant vectors in the case of isolated frames plotted against original areas. The formant frequencies derived from the areas are plotted in (f). (c) Cross-sectional areas estimated from formant vector trajectories plotted against original areas. The formant frequencies derived from the areas are plotted in (g). (d) Cross-sectional areas estimated from isolated frames plotted against areas estimated from formant vector trajectories. The formant frequencies derived from the areas are plotted in (h). The correlation coe cients are given in the top right corner. The 258 oral vowel frames available in the corpus were used to generate the scatterings.

Chapter 6 Conclusion

\Words are words." William Shakespeare (1564{1616) Othello Act I Scene III

In this study, a method to combine di erent pieces of information in a restricted case of the speech production inverse problem, namely the formant-to-area determination problem, was presented. The initial formulation is based on a Fourier analysis of the vocal-tract log-area function, already described by Mermelstein (1967). The novelty is that vocal-tract morphological constraints are invoked to cope with the underdetermined problem of obtaining a complete set of log-area Fourier coe cients from formant frequencies. After that, the Fourier representation is substituted by an optimal principal component representation of the log-area function which allows a better characterization of the vocal-tract. As a nal point, the analysis is generalized from isolated frames to trajectories of log-area parameters. This allows a natural implementation of continuity constraints in the articulatory domain. 86

87 The implemented system uses a Newton-Raphson iterative procedure to solve the non-linear system that arises in the framework formulation. The solution took on average four iterations to converge, usually but not always, to a position close to the right solution. It is, in principle, more e cient than analysis-by-synthesis techniques (Shirai and Kobayashi, 1986; Schroeter and Sondhi, 1991) that require a much larger number of iterations. Also, it gives a better insight into the problem than the neural network (Shirai, 1993) and the genetic algorithm (McGowan, 1994) approaches. The main weak point of the method developed in this study is the limited exibility of the cost function chosen to quantify the vocal-tract e ort during speech production. The quadratic form used has the merit of allowing a simple minimization procedure, but does not re ect well the vocal-tract e ort for positions far from the neutral articulatory position. In spite of that, the method worked satisfactorily for most of the analyzed cases. From the experimental results, since there was always a very good match between reference and estimated transfer functions (at least up to the third formant region); and since the matching between reference and estimated areas was not perfect; it is possible to conclude that the regions that did not match well were mainly the regions that have little in uence on the vocal-tract acoustic response (up to the third formant region). One important point observed during the analysis is that, when appropriately represented, the mapping between articulatory and acoustic properties of the human vocal-tract is not complex, having a dominant linear component. Another interesting conclusion that can be drawn is that, since the obtained transfer functions were derived only from the formant frequencies, making use of prior information about morphological and continuity constraints; and since a good spectral matching was obtained (up to the third formant region); it is possible to say that, if morphological information is available, it is possible to derive the vocal-tract transfer function from the formant frequencies (cf Fant, 1956). It remains to be shown if the human being makes use of such redundancy and, if so, in what way.

**Appendix A Numeric Information
**

Most of the numeric information used in the implementation of the vocal-tract parametric model described in this study was not included in the main text. Instead, for practical purposes, it is given in this appendix, and can be used by the interested reader to implement, test and analyze the model proposed. In order to do this, some observations are important: The rst one is that the tract length is expressed in normalized units, which can be converted into centimetres as follows 1 length unit = 0:534 cm:

The second observation is about the procedure used to \ ll" the articulatory space in Section 3.2.2: rst, a su ciently high number of points is uniformly generated in the hyperrectangle de ned by min and max. After that, the corresponding log-area vectors are calculated, and those that exceed the limits de ned by ymin and ymax are discarded, since they probably correspond either to unrealistic area functions or to areas with constrictions that are too narrow. The nal observation is about the procedure used to estimate the formants associated with a given area function: they can be determined using the wave propagation model described in Section 2.2.3. 88

89

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

U y=

0:006 0:010 0:017 0:030 0:039 0:057 0:083 0:101 0:111 0:113 0:105 0:091 0:074 0:072 0:078 0:076 0:056 0:011 0:043 0:102 0:184 0:264 0:307 0:324 0:334 0:340 0:348 0:331 0:288 0:202 0:005 0:099 0:006

0:019 0:035 0:045 0:026 0:050 0:087 0:082 0:071 0:054 0:033 0:014 0:001 0:009 0:017 0:043 0:059 0:136 0:226 0:220 0:205 0:216 0:205 0:169 0:125 0:081 0:026 0:077 0:208 0:296 0:262 0:157 0:304 0:584

0:004 0:026 0:036 0:012 0:002 0:023 0:029 0:042 0:053 0:059 0:060 0:058 0:052 0:048 0:056 0:055 0:063 0:092 0:128 0:135 0:136 0:130 0:099 0:062 0:028 0:010 0:052 0:106 0:165 0:247 0:514 0:712 0:002

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

0:004 0:010 0:016 0:110 0:030 0:013 0:004 0:034 0:059 0:079 0:088 0:090 0:084 0:106 0:154 0:170 0:208 0:233 0:199 0:131 0:047 0:048 0:120 0:157 0:156 0:131 0:062 0:112 0:355 0:537 0:165 0:259 0:353

0:052 0:130 0:177 0:142 0:000 0:136 0:142 0:161 0:159 0:137 0:112 0:091 0:080 0:102 0:159 0:170 0:177 0:199 0:232 0:227 0:217 0:192 0:129 0:068 0:019 0:020 0:048 0:020 0:006 0:002 0:111 0:269 0:599

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

;

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

y

=

0:82 0:46 0:02 0:39 0:81 1:00 1:19 1:13 1:04 0:97 0:98 1:07 1:21 1:35 1:35 1:25 1:08 0:62 0:31 0:39 0:43 0:36 0:34 0:32 0:27 0:18 0:04 0:02 0:11 0:19 0:22 0:11 28:19

; ymin ymax] =

0:28 1:22 0:12 1:14 0:82 1:17 0:76 1:13 0:36 1:47 0:19 1:87 0:20 1:93 0:12 1:80 0:26 1:67 0:49 1:58 0:55 1:54 0:37 1:56 0:05 1:64 0:21 1:77 0:49 1:91 0:14 1:93 1:47 1:90 2:74 1:68 3:00 1:43 3:00 1:44 3:00 1:65 3:00 1:84 3:00 1:92 3:00 1:95 3:00 2:01 3:00 2:04 3:00 2:09 3:00 2:15 3:00 2:17 3:00 2:11 3:00 1:77 3:00 1:86 25:92 33:40

;

90

2

**APPENDIX A. NUMERIC INFORMATION
**

0:334 1:300 0:027 0:572 0:407

3 7 7 5 2

=

6 6 4

; T =

6 6 4

0:053 0:348 0:225 0:143 0:519

0:655 0:376 0:006 0:195 0:142

0:356 0:541 0:440 0:747 0:248

0:743 0:121 0:859 0:542 0:638

0:701 0:624 1:001 0:018 0:469

3 7 7 5

;

2

min max] =

6 6 4

6:120 7:643 4:146 2:678 2:400

8:763 3:452 4:535 2:194 2:279

3 7 7 5

2

; U =

6 6 4

0:164 0:813 0:378 0:410 0:032

0:458 0:030 0:046 0:012 0:887

0:185 0:264 0:429 0:836 0:115

0:854 0:083 0:045 0:253 0:444

0:000 0:511 0:818 0:263 0:029

3 7 7 5

;

2

f

=

4

2:60 3:24 3:43

3 5

2

; Tfg =

4

15:4 8:3 5:2

7:0 23:0 7:2

18:5 12:7 35:4

3 5

2

; Uhg =

4

0:063 0:692 0:719

0:970 0:128 0:208

0:236 0:711 0:663

3 5

;

2

Tfh =

4

2:8 17:0 5:6

10:1 1:7 22:8

18:3 8:1 37:2

3 5

2

; Sh =

4

0:457 0:105 0:084

0:007 0:641 0:499

0:068 0:399 0:140

0:109 0:008 0:560

0:810 0:126 0:237

3 5

;

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

0 y

=

0:83 0:45 0:02 0:55 0:90 1:08 1:26 1:17 1:05 0:95 0:94 1:01 1:14 1:25 1:17 1:04 0:75 0:13 0:18 0:05 0:01 0:02 0:05 0:13 0:15 0:13 0:08 0:13 0:20 0:17 0:09 0:18 28:89

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

; Tty =

0:011 0:021 0:035 0:027 0:016 0:032 0:016 0:002 0:019 0:034 0:043 0:046 0:043 0:054 0:080 0:092 0:130 0:153 0:103 0:047 0:009 0:075 0:122 0:148 0:162 0:171 0:188 0:175 0:105 0:032 0:300 0:353 0:409

0:019 0:043 0:065 0:023 0:045 0:092 0:103 0:110 0:108 0:096 0:080 0:063 0:049 0:042 0:036 0:027 0:018 0:069 0:069 0:080 0:112 0:136 0:151 0:156 0:162 0:166 0:160 0:145 0:149 0:184 0:279 0:269 0:370

0:007 0:005 0:001 0:020 0:008 0:016 0:015 0:009 0:001 0:011 0:018 0:023 0:021 0:013 0:009 0:007 0:023 0:060 0:093 0:110 0:128 0:140 0:127 0:102 0:071 0:030 0:031 0:132 0:240 0:335 0:478 0:604 0:149

0:043 0:089 0:150 0:025 0:014 0:102 0:095 0:085 0:063 0:030 0:003 0:012 0:013 0:011 0:003 0:004 0:002 0:016 0:057 0:104 0:171 0:228 0:231 0:205 0:151 0:079 0:044 0:224 0:451 0:584 0:183 0:896 0:661

0:056 0:161 0:191 0:256 0:021 0:153 0:176 0:226 0:247 0:241 0:219 0:195 0:176 0:218 0:324 0:349 0:383 0:424 0:432 0:365 0:271 0:154 0:016 0:087 0:139 0:158 0:117 0:085 0:344 0:491 0:178 0:092 0:401

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

;

91

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4

Ty=

0:024 0:047 0:070 0:002 0:017 0:056 0:023 0:009 0:042 0:073 0:090 0:094 0:086 0:096 0:129 0:144 0:203 0:240 0:161 0:066 0:031 0:145 0:228 0:278 0:312 0:346 0:409 0:447 0:396 0:209 0:228 0:300 0:765

0:024 0:044 0:069 0:069 0:089 0:143 0:167 0:176 0:172 0:154 0:128 0:099 0:074 0:061 0:042 0:024 0:064 0:184 0:226 0:270 0:355 0:422 0:438 0:425 0:410 0:388 0:340 0:263 0:203 0:180 0:244 0:184 0:526

0:011 0:005 0:003 0:029 0:044 0:051 0:046 0:029 0:007 0:014 0:031 0:042 0:044 0:046 0:070 0:081 0:148 0:248 0:283 0:288 0:308 0:305 0:256 0:193 0:129 0:051 0:069 0:229 0:365 0:433 0:370 0:440 0:437

0:019 0:042 0:070 0:017 0:000 0:040 0:036 0:032 0:024 0:011 0:000 0:005 0:005 0:003 0:006 0:008 0:014 0:036 0:063 0:089 0:127 0:158 0:156 0:137 0:104 0:062 0:009 0:112 0:236 0:312 0:039 0:357 0:265

0:026 0:077 0:093 0:126 0:016 0:066 0:074 0:096 0:105 0:102 0:093 0:084 0:077 0:097 0:147 0:159 0:180 0:211 0:226 0:204 0:173 0:129 0:068 0:017 0:012 0:026 0:015 0:065 0:168 0:215 0:045 0:094 0:190

3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

:

**Appendix B Results for Isolated Frames
**

The inverse problem algorithm for isolated frames was applied to all oral vowel frames present in the analyzed corpus. The results are shown in the next pages. From the bottom each set of panels shows: Vocal-tract midsagittal pro le extracted from cineradiographic data (Bothorel et al., 1986) plotted on semi-polar grid; and labiogram simultaneously acquired. Midsagittal distances computed with the method described in Section 2.1.2. The thin line shows the area function estimated from the midsagittal pro le using the model described in Section 2.1.3. The formants determined by this area function (using the method described in Section 2.2.3) are used to estimate the area function shown by the thick line using the method described in Chapter 4. Thin black line: vocal-tract transfer function determined from the area function estimated from the midsagittal pro le. Thick black line: vocal tract transfer function determined from the area function estimated from formant frequencies. Gray line: power spectrum envelope estimated from the speech signal. Speech signal recorded during the acquisition of the midsagittal pro le shown in the bottom panel.

92

93

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0108

1 Speech Signal 0 −1 0 40 20 0

PB0109

1 Speech Signal 0 −1 0 40 20 0

PB0110

1 Speech Signal 0 −1 0 40 20 0

PB0111

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0108

Power Spectrum (dB)

PB0109

Power Spectrum (dB)

PB0110

Power Spectrum (dB)

PB0111

2 3 4 Frequency (kHz) PB0108

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0109

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0110

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0111

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0108

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0109

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0110

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0111

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0108

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0109

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0110

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0111

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0

10 5 5 0

cm

10

cm

10

15

0 0

5

cm

10

15

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0112

1 Speech Signal 0 −1 0 40 20 0

PB0121

1 Speech Signal 0 −1 0 40 20 0

PB0122

1 Speech Signal 0 −1 0 40 20 0

PB0123

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0112

Power Spectrum (dB)

PB0121

Power Spectrum (dB)

PB0122

Power Spectrum (dB)

PB0123

2 3 4 Frequency (kHz) PB0112

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0121

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0122

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0123

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0121

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0122 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0123

Midsagittal Distances PB0112

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0112

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0121

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0122

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0123

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0 15 0 0 5 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

cm

10

cm

10

15

94

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0124

1 Speech Signal 0 −1 0 40 20 0

PB0125

1 Speech Signal 0 −1 0 40 20 0

PB0130

1 Speech Signal 0 −1 0 40 20 0

PB0131

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0124

Power Spectrum (dB)

PB0125

Power Spectrum (dB)

PB0130

Power Spectrum (dB)

PB0131

2 3 4 Frequency (kHz) PB0124

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0125

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0130

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0131

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0124

4 2 0 0

Midsagittal Distances PB0125

4 2 0 0

Midsagittal Distances PB0130

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0131

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0124

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0125

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0130

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0131

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

15

0 0

5

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0132

1 Speech Signal 0 −1 0 40 20 0

PB0133

1 Speech Signal 0 −1 0 40 20 0

PB0134

1 Speech Signal 0 −1 0 40 20 0

PB0136

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0132

Power Spectrum (dB)

PB0133

Power Spectrum (dB)

PB0134

Power Spectrum (dB)

PB0136

2 3 4 Frequency (kHz) PB0132

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0133

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0134

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0136

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0132

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0134 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0136

Midsagittal Distances PB0133

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0132

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0133

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0134

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0136

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10

10 5 5 5 0

15

cm

10 5 0 15 0 0 5

5 0 15 0 0 5

cm

10

cm

10

0 0

5

cm

10

cm

10

15

95

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0137

1 Speech Signal 0 −1 0 40 20 0

PB0138

1 Speech Signal 0 −1 0 40 20 0

PB0139

1 Speech Signal 0 −1 0 40 20 0

PB0140

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0137

Power Spectrum (dB)

PB0138

Power Spectrum (dB)

PB0139

Power Spectrum (dB)

PB0140

2 3 4 Frequency (kHz) PB0137

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0138

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0139

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0140

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0137

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0138

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0139 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0140

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0137

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0138

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0139

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0140

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0141

1 Speech Signal 0 −1 0 40 20 0

PB0142

1 Speech Signal 0 −1 0 40 20 0

PB0154

1 Speech Signal 0 −1 0 40 20 0

PB0155

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0141

Power Spectrum (dB)

PB0142

Power Spectrum (dB)

PB0154

Power Spectrum (dB)

PB0155

2 3 4 Frequency (kHz) PB0141

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0142

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0154

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0155

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0141

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0142

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0154

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0155

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0141

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0142

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0154

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0155

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0

10 5 5 0

cm

10

cm

10

15

0 0

5

cm

10

15

0 0

5

cm

10

15

96

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0156

1 Speech Signal 0 −1 0 40 20 0

PB0157

1 Speech Signal 0 −1 0 40 20 0

PB0158

1 Speech Signal 0 −1 0 40 20 0

PB0159

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0156

Power Spectrum (dB)

PB0157

Power Spectrum (dB)

PB0158

Power Spectrum (dB)

PB0159

2 3 4 Frequency (kHz) PB0156

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0157

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0158

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0159

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0157

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0158

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0159

Midsagittal Distances PB0156

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0156

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0157

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0158

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0159

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0

cm

10 5 5 0

15

cm

10

0 0

5

cm

10

15

0 0

5

cm

10

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0210

1 Speech Signal 0 −1 0 40 20 0

PB0211

1 Speech Signal 0 −1 0 40 20 0

PB0212

1 Speech Signal 0 −1 0 40 20 0

PB0213

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0210

Power Spectrum (dB)

PB0211

Power Spectrum (dB)

PB0212

Power Spectrum (dB)

PB0213

2 3 4 Frequency (kHz) PB0210

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0211

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0212

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0213

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0210

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0211

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0212 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0213

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0210

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0211

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0212

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0213

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

97

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0215

1 Speech Signal 0 −1 0 40 20 0

PB0216

1 Speech Signal 0 −1 0 40 20 0

PB0217

1 Speech Signal 0 −1 0 40 20 0

PB0218

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0215

Power Spectrum (dB)

PB0216

Power Spectrum (dB)

PB0217

Power Spectrum (dB)

PB0218

2 3 4 Frequency (kHz) PB0215

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0216

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0217

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0218

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0215

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0216

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0217 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0218

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0215

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0216

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0217

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0218

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0219

1 Speech Signal 0 −1 0 40 20 0

PB0222

1 Speech Signal 0 −1 0 40 20 0

PB0223

1 Speech Signal 0 −1 0 40 20 0

PB0224

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0219

Power Spectrum (dB)

PB0222

Power Spectrum (dB)

PB0223

Power Spectrum (dB)

PB0224

2 3 4 Frequency (kHz) PB0219

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0222

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0223

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0224

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0219

4 2 0 0

Midsagittal Distances PB0222

4 2 0 0

Midsagittal Distances PB0223

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0224

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0219

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0222

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0223

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0224

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

15

0 0

5

cm

10

cm

10

cm

10

15

98

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0229

1 Speech Signal 0 −1 0 40 20 0

PB0230

1 Speech Signal 0 −1 0 40 20 0

PB0231

1 Speech Signal 0 −1 0 40 20 0

PB0232

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0229

Power Spectrum (dB)

PB0230

Power Spectrum (dB)

PB0231

Power Spectrum (dB)

PB0232

2 3 4 Frequency (kHz) PB0229

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0230

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0231

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0232

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0230

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0231 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0232

Midsagittal Distances PB0229

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0229

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0230

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0231

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0232

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0 15 0 0 5 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0233

1 Speech Signal 0 −1 0 40 20 0

PB0234

1 Speech Signal 0 −1 0 40 20 0

PB0241

1 Speech Signal 0 −1 0 40 20 0

PB0242

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0233

Power Spectrum (dB)

PB0234

Power Spectrum (dB)

PB0241

Power Spectrum (dB)

PB0242

2 3 4 Frequency (kHz) PB0233

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0234

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0241

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0242

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0233

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0234

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0241 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0242

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0233

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0234

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0241

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0242

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

99

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0243

1 Speech Signal 0 −1 0 40 20 0

PB0244

1 Speech Signal 0 −1 0 40 20 0

PB0245

1 Speech Signal 0 −1 0 40 20 0

PB0310

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0243

Power Spectrum (dB)

PB0244

Power Spectrum (dB)

PB0245

Power Spectrum (dB)

PB0310

2 3 4 Frequency (kHz) PB0243

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0244

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0245

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0310

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0243

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0244

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0245

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0310

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0243

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0244

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0245

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0310

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0

cm

10 5 5 0

15

cm

10

cm

10

15

0 0

5

cm

10

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0311

1 Speech Signal 0 −1 0 40 20 0

PB0312

1 Speech Signal 0 −1 0 40 20 0

PB0313

1 Speech Signal 0 −1 0 40 20 0

PB0314

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0311

Power Spectrum (dB)

PB0312

Power Spectrum (dB)

PB0313

Power Spectrum (dB)

PB0314

2 3 4 Frequency (kHz) PB0311

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0312

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0313

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0314

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0311

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0312

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0313 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0314

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0311

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0312

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0313

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0314

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

100

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0315

1 Speech Signal 0 −1 0 40 20 0

PB0331

1 Speech Signal 0 −1 0 40 20 0

PB0332

1 Speech Signal 0 −1 0 40 20 0

PB0333

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0315

Power Spectrum (dB)

PB0331

Power Spectrum (dB)

PB0332

Power Spectrum (dB)

PB0333

2 3 4 Frequency (kHz) PB0315

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0331

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0332

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0333

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB0315

4 2 0 0

Midsagittal Distances PB0331

4 2 0 0

Midsagittal Distances PB0332

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0333

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0315

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0331

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0332

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0333

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

15

0 0

5

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0334

1 Speech Signal 0 −1 0 40 20 0

PB0343

1 Speech Signal 0 −1 0 40 20 0

PB0344

1 Speech Signal 0 −1 0 40 20 0

PB0345

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0334

Power Spectrum (dB)

PB0343

Power Spectrum (dB)

PB0344

Power Spectrum (dB)

PB0345

2 3 4 Frequency (kHz) PB0334

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0343

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0344

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0345

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0343

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0344 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0345

Midsagittal Distances PB0334

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0334

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0343

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0344

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0345

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0 15 0 0 5 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

cm

10

cm

10

15

101

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0346

1 Speech Signal 0 −1 0 40 20 0

PB0347

1 Speech Signal 0 −1 0 40 20 0

PB0348

1 Speech Signal 0 −1 0 40 20 0

PB0349

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0346

Power Spectrum (dB)

PB0347

Power Spectrum (dB)

PB0348

Power Spectrum (dB)

PB0349

2 3 4 Frequency (kHz) PB0346

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0347

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0348

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0349

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0346

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0347

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0348 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0349

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0346

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0347

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0348

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0349

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0805

1 Speech Signal 0 −1 0 40 20 0

PB0806

1 Speech Signal 0 −1 0 40 20 0

PB0807

1 Speech Signal 0 −1 0 40 20 0

PB0812

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0805

Power Spectrum (dB)

PB0806

Power Spectrum (dB)

PB0807

Power Spectrum (dB)

PB0812

2 3 4 Frequency (kHz) PB0805

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0806

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0807

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0812

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0805

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0806

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0807

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0812

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0805

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0806

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0807

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0812

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 0 15 0 0 5

10 5 5 0

cm

10

cm

10

cm

10

15

0 0

5

cm

10

15

102

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0813

1 Speech Signal 0 −1 0 40 20 0

PB0814

1 Speech Signal 0 −1 0 40 20 0

PB0815

1 Speech Signal 0 −1 0 40 20 0

PB0838

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0813

Power Spectrum (dB)

PB0814

Power Spectrum (dB)

PB0815

Power Spectrum (dB)

PB0838

2 3 4 Frequency (kHz) PB0813

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0814

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0815

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0838

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0813

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0814

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0815

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0838

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0813

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0814

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0815

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0838

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 0 15 0 0 5

10 5 5 0

cm

10

cm

10

cm

10

15

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0839

1 Speech Signal 0 −1 0 40 20 0

PB0840

1 Speech Signal 0 −1 0 40 20 0

PB0841

1 Speech Signal 0 −1 0 40 20 0

PB0842

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0839

Power Spectrum (dB)

PB0840

Power Spectrum (dB)

PB0841

Power Spectrum (dB)

PB0842

2 3 4 Frequency (kHz) PB0839

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0840

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0841

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0842

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0839

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0840

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0841

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0842

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0839

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0840

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0841

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0842

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0

cm

10 5 5 0

15

cm

10

cm

10

15

0 0

5

cm

10

0 0

5

cm

10

15

103

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0843

1 Speech Signal 0 −1 0 40 20 0

PB0848

1 Speech Signal 0 −1 0 40 20 0

PB0849

1 Speech Signal 0 −1 0 40 20 0

PB0850

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0843

Power Spectrum (dB)

PB0848

Power Spectrum (dB)

PB0849

Power Spectrum (dB)

PB0850

2 3 4 Frequency (kHz) PB0843

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0848

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0849

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0850

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0843

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0848

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0849 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0850

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0843

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0848

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0849

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0850

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0851

1 Speech Signal 0 −1 0 40 20 0

PB0852

1 Speech Signal 0 −1 0 40 20 0

PB0853

1 Speech Signal 0 −1 0 40 20 0

PB0854

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0851

Power Spectrum (dB)

PB0852

Power Spectrum (dB)

PB0853

Power Spectrum (dB)

PB0854

2 3 4 Frequency (kHz) PB0851

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0852

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0853

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0854

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0851

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0852

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0853 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0854

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0851

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0852

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0853

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0854

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

104

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0855

1 Speech Signal 0 −1 0 40 20 0

PB0856

1 Speech Signal 0 −1 0 40 20 0

PB0916

1 Speech Signal 0 −1 0 40 20 0

PB0917

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0855

Power Spectrum (dB)

PB0856

Power Spectrum (dB)

PB0916

Power Spectrum (dB)

PB0917

2 3 4 Frequency (kHz) PB0855

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0856

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0916

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0917

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0855

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0856

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0916 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0917

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0855

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0856

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0916

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0917

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0918

1 Speech Signal 0 −1 0 40 20 0

PB0919

1 Speech Signal 0 −1 0 40 20 0

PB0920

1 Speech Signal 0 −1 0 40 20 0

PB0921

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0918

Power Spectrum (dB)

PB0919

Power Spectrum (dB)

PB0920

Power Spectrum (dB)

PB0921

2 3 4 Frequency (kHz) PB0918

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0919

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0920

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0921

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0918

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0919

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0920 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0921

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0918

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0919

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0920

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0921

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

105

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0922

1 Speech Signal 0 −1 0 40 20 0

PB0923

1 Speech Signal 0 −1 0 40 20 0

PB0924

1 Speech Signal 0 −1 0 40 20 0

PB0925

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0922

Power Spectrum (dB)

PB0923

Power Spectrum (dB)

PB0924

Power Spectrum (dB)

PB0925

2 3 4 Frequency (kHz) PB0922

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0923

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0924

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0925

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0922

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0923

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0924 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0925

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0922

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0923

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0924

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0925

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0944

1 Speech Signal 0 −1 0 40 20 0

PB0945

1 Speech Signal 0 −1 0 40 20 0

PB0946

1 Speech Signal 0 −1 0 40 20 0

PB0947

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0944

Power Spectrum (dB)

PB0945

Power Spectrum (dB)

PB0946

Power Spectrum (dB)

PB0947

2 3 4 Frequency (kHz) PB0944

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0945

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0946

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0947

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0944

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0945

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0946 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0947

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0944

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0945

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0946

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0947

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

106

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0948

1 Speech Signal 0 −1 0 40 20 0

PB0949

1 Speech Signal 0 −1 0 40 20 0

PB0950

1 Speech Signal 0 −1 0 40 20 0

PB0961

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0948

Power Spectrum (dB)

PB0949

Power Spectrum (dB)

PB0950

Power Spectrum (dB)

PB0961

2 3 4 Frequency (kHz) PB0948

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0949

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0950

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0961

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0948

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0949

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0950 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0961

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0948

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0949

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0950

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0961

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0962

1 Speech Signal 0 −1 0 40 20 0

PB0963

1 Speech Signal 0 −1 0 40 20 0

PB0964

1 Speech Signal 0 −1 0 40 20 0

PB0965

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0962

Power Spectrum (dB)

PB0963

Power Spectrum (dB)

PB0964

Power Spectrum (dB)

PB0965

2 3 4 Frequency (kHz) PB0962

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0963

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0964

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0965

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0962

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0963

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0964 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0965

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0962

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0963

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0964

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0965

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

107

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB0966

1 Speech Signal 0 −1 0 40 20 0

PB0967

1 Speech Signal 0 −1 0 40 20 0

PB0968

1 Speech Signal 0 −1 0 40 20 0

PB1515

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB0966

Power Spectrum (dB)

PB0967

Power Spectrum (dB)

PB0968

Power Spectrum (dB)

PB1515

2 3 4 Frequency (kHz) PB0966

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0967

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB0968

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1515

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0966

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB0967

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB0968

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1515

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0966

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0967

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB0968

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1515

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10

cm

10 5 5 0

15

5 0 15 0 0 5

cm

10

cm

10

cm

10

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1516

1 Speech Signal 0 −1 0 40 20 0

PB1517

1 Speech Signal 0 −1 0 40 20 0

PB1518

1 Speech Signal 0 −1 0 40 20 0

PB1519

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1516

Power Spectrum (dB)

PB1517

Power Spectrum (dB)

PB1518

Power Spectrum (dB)

PB1519

2 3 4 Frequency (kHz) PB1516

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1517

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1518

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1519

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1516

4 2 0 0

Midsagittal Distances PB1517

4 2 0 0

Midsagittal Distances PB1518

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1519

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1516

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1517

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1518

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1519

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

15

0 0

5

cm

10

cm

10

cm

10

15

108

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1526

1 Speech Signal 0 −1 0 40 20 0

PB1527

1 Speech Signal 0 −1 0 40 20 0

PB1528

1 Speech Signal 0 −1 0 40 20 0

PB1529

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1526

Power Spectrum (dB)

PB1527

Power Spectrum (dB)

PB1528

Power Spectrum (dB)

PB1529

2 3 4 Frequency (kHz) PB1526

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1527

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1528

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1529

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1526

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1527

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1528

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1529

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1526

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1527

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1528

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1529

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10

cm

10 5 5 0

15

5 0 15 0 0 5

cm

10

cm

10

cm

10

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1534

1 Speech Signal 0 −1 0 40 20 0

PB1535

1 Speech Signal 0 −1 0 40 20 0

PB1536

1 Speech Signal 0 −1 0 40 20 0

PB1537

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1534

Power Spectrum (dB)

PB1535

Power Spectrum (dB)

PB1536

Power Spectrum (dB)

PB1537

2 3 4 Frequency (kHz) PB1534

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1535

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1536

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1537

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1534

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1535

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1536 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1537

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1534

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1535

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1536

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1537

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

109

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1538

1 Speech Signal 0 −1 0 40 20 0

PB1539

1 Speech Signal 0 −1 0 40 20 0

PB1540

1 Speech Signal 0 −1 0 40 20 0

PB1544

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1538

Power Spectrum (dB)

PB1539

Power Spectrum (dB)

PB1540

Power Spectrum (dB)

PB1544

2 3 4 Frequency (kHz) PB1538

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1539

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1540

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1544

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1538

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1539

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1540

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1544

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1538

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1539

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1540

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1544

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 0 15 0 0 5

10 5 5 0

cm

10

cm

10

cm

10

15

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1545

1 Speech Signal 0 −1 0 40 20 0

PB1546

1 Speech Signal 0 −1 0 40 20 0

PB1547

1 Speech Signal 0 −1 0 40 20 0

PB1548

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1545

Power Spectrum (dB)

PB1546

Power Spectrum (dB)

PB1547

Power Spectrum (dB)

PB1548

2 3 4 Frequency (kHz) PB1545

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1546

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1547

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1548

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1545

4 2 0 0

Midsagittal Distances PB1546

4 2 0 0

Midsagittal Distances PB1547

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1548

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1545

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1546

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1547

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1548

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

15

0 0

5

cm

10

cm

10

15

110

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1549

1 Speech Signal 0 −1 0 40 20 0

PB1550

1 Speech Signal 0 −1 0 40 20 0

PB1556

1 Speech Signal 0 −1 0 40 20 0

PB1557

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1549

Power Spectrum (dB)

PB1550

Power Spectrum (dB)

PB1556

Power Spectrum (dB)

PB1557

2 3 4 Frequency (kHz) PB1549

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1550

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1556

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1557

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1550

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1556 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1557

Midsagittal Distances PB1549

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1549

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1550

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1556

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1557

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0 15 0 0 5 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1558

1 Speech Signal 0 −1 0 40 20 0

PB1559

1 Speech Signal 0 −1 0 40 20 0

PB1560

1 Speech Signal 0 −1 0 40 20 0

PB1561

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1558

Power Spectrum (dB)

PB1559

Power Spectrum (dB)

PB1560

Power Spectrum (dB)

PB1561

2 3 4 Frequency (kHz) PB1558

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1559

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1560

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1561

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1558

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1559

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1560

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1561

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1558

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1559

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1560

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1561

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 0 15 0 0 5

10 5 5 0

cm

10

cm

10

cm

10

15

0 0

5

cm

10

15

111

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1562

1 Speech Signal 0 −1 0 40 20 0

PB1714

1 Speech Signal 0 −1 0 40 20 0

PB1715

1 Speech Signal 0 −1 0 40 20 0

PB1716

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1562

Power Spectrum (dB)

PB1714

Power Spectrum (dB)

PB1715

Power Spectrum (dB)

PB1716

2 3 4 Frequency (kHz) PB1562

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1714

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1715

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1716

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1562

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1714

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1715 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1716

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1562

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1714

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1715

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1716

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1719

1 Speech Signal 0 −1 0 40 20 0

PB1720

1 Speech Signal 0 −1 0 40 20 0

PB1721

1 Speech Signal 0 −1 0 40 20 0

PB1722

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1719

Power Spectrum (dB)

PB1720

Power Spectrum (dB)

PB1721

Power Spectrum (dB)

PB1722

2 3 4 Frequency (kHz) PB1719

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1720

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1721

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1722

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1719

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1720

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1721 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1722

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1719

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1720

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1721

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1722

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

112

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1729

1 Speech Signal 0 −1 0 40 20 0

PB1730

1 Speech Signal 0 −1 0 40 20 0

PB1731

1 Speech Signal 0 −1 0 40 20 0

PB1732

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1729

Power Spectrum (dB)

PB1730

Power Spectrum (dB)

PB1731

Power Spectrum (dB)

PB1732

2 3 4 Frequency (kHz) PB1729

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1730

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1731

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1732

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1729

4 2 0 0

Midsagittal Distances PB1730

4 2 0 0

Midsagittal Distances PB1731

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1732

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1729

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1730

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1731

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1732

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

15

0 0

5

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1733

1 Speech Signal 0 −1 0 40 20 0

PB1738

1 Speech Signal 0 −1 0 40 20 0

PB1739

1 Speech Signal 0 −1 0 40 20 0

PB1740

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1733

Power Spectrum (dB)

PB1738

Power Spectrum (dB)

PB1739

Power Spectrum (dB)

PB1740

2 3 4 Frequency (kHz) PB1733

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1738

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1739

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1740

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1733

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsagittal Distances PB1738

4 2 0 0

Midsagittal Distances PB1739

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1740

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1733

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1738

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1739

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1740

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

10 5 5 0 0 0 5

cm

10 5 5 0

15

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

15

0 0

5

cm

10

cm

10

15

113

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1741

1 Speech Signal 0 −1 0 40 20 0

PB1742

1 Speech Signal 0 −1 0 40 20 0

PB1743

1 Speech Signal 0 −1 0 40 20 0

PB1754

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1741

Power Spectrum (dB)

PB1742

Power Spectrum (dB)

PB1743

Power Spectrum (dB)

PB1754

2 3 4 Frequency (kHz) PB1741

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1742

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1743

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1754

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1741

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1743 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1754

Midsagittal Distances PB1742

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1741

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1742

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1743

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1754

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10

10 5 5 5 0

15

cm

10 5 0 15 0 0 5

5 0 15 0 0 5

cm

10

cm

10

0 0

5

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1755

1 Speech Signal 0 −1 0 40 20 0

PB1756

1 Speech Signal 0 −1 0 40 20 0

PB1757

1 Speech Signal 0 −1 0 40 20 0

PB1758

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1755

Power Spectrum (dB)

PB1756

Power Spectrum (dB)

PB1757

Power Spectrum (dB)

PB1758

2 3 4 Frequency (kHz) PB1755

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1756

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1757

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1758

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1755

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1756

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1757 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1758

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1755

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1756

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1757

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1758

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

114

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1820

1 Speech Signal 0 −1 0 40 20 0

PB1821

1 Speech Signal 0 −1 0 40 20 0

PB1822

1 Speech Signal 0 −1 0 40 20 0

PB1823

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1820

Power Spectrum (dB)

PB1821

Power Spectrum (dB)

PB1822

Power Spectrum (dB)

PB1823

2 3 4 Frequency (kHz) PB1820

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1821

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1822

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1823

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1820

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1821

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1822 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1823

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1820

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1821

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1822

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1823

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1832

1 Speech Signal 0 −1 0 40 20 0

PB1833

1 Speech Signal 0 −1 0 40 20 0

PB1834

1 Speech Signal 0 −1 0 40 20 0

PB1835

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1832

Power Spectrum (dB)

PB1833

Power Spectrum (dB)

PB1834

Power Spectrum (dB)

PB1835

2 3 4 Frequency (kHz) PB1832

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1833

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1834

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1835

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1832

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1833

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1834 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1835

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1832

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1833

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1834

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1835

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

115

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1836

1 Speech Signal 0 −1 0 40 20 0

PB1837

1 Speech Signal 0 −1 0 40 20 0

PB1845

1 Speech Signal 0 −1 0 40 20 0

PB1846

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1836

Power Spectrum (dB)

PB1837

Power Spectrum (dB)

PB1845

Power Spectrum (dB)

PB1846

2 3 4 Frequency (kHz) PB1836

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1837

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1845

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1846

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB1836

4 2 0 0

Midsagittal Distances PB1837

4 2 0 0

Midsagittal Distances PB1845

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1846

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1836

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1837

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1845

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1846

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

15

0 0

5

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1847

1 Speech Signal 0 −1 0 40 20 0

PB1848

1 Speech Signal 0 −1 0 40 20 0

PB1849

1 Speech Signal 0 −1 0 40 20 0

PB1850

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1847

Power Spectrum (dB)

PB1848

Power Spectrum (dB)

PB1849

Power Spectrum (dB)

PB1850

2 3 4 Frequency (kHz) PB1847

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1848

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1849

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1850

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1847

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1848

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1849

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1850

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1847

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1848

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1849

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1850

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10

cm

10 5 5 0

15

5 0 15 0 0 5

cm

10

cm

10

cm

10

0 0

5

cm

10

15

116

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1851

1 Speech Signal 0 −1 0 40 20 0

PB1852

1 Speech Signal 0 −1 0 40 20 0

PB1856

1 Speech Signal 0 −1 0 40 20 0

PB1857

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1851

Power Spectrum (dB)

PB1852

Power Spectrum (dB)

PB1856

Power Spectrum (dB)

PB1857

2 3 4 Frequency (kHz) PB1851

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1852

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1856

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1857

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1851

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1852

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1856 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1857

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1851

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1852

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1856

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1857

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1858

1 Speech Signal 0 −1 0 40 20 0

PB1859

1 Speech Signal 0 −1 0 40 20 0

PB1860

1 Speech Signal 0 −1 0 40 20 0

PB1870

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1858

Power Spectrum (dB)

PB1859

Power Spectrum (dB)

PB1860

Power Spectrum (dB)

PB1870

2 3 4 Frequency (kHz) PB1858

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1859

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1860

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1870

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1858

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1859

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1860

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1870

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1858

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1859

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1860

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1870

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10

cm

10 5 5 0

15

5 0 15 0 0 5

cm

10

cm

10

cm

10

0 0

5

cm

10

15

117

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1871

1 Speech Signal 0 −1 0 40 20 0

PB1872

1 Speech Signal 0 −1 0 40 20 0

PB1873

1 Speech Signal 0 −1 0 40 20 0

PB1874

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1871

Power Spectrum (dB)

PB1872

Power Spectrum (dB)

PB1873

Power Spectrum (dB)

PB1874

2 3 4 Frequency (kHz) PB1871

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1872

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1873

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1874

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1871

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1872

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1873 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB1874

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1871

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1872

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1873

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1874

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB1875

1 Speech Signal 0 −1 0 40 20 0

PB1876

1 Speech Signal 0 −1 0 40 20 0

PB2413

1 Speech Signal 0 −1 0 40 20 0

PB2414

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB1875

Power Spectrum (dB)

PB1876

Power Spectrum (dB)

PB2413

Power Spectrum (dB)

PB2414

2 3 4 Frequency (kHz) PB1875

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB1876

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2413

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2414

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1875

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB1876

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2413 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2414

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1875

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB1876

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2413

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2414

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

118

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2415

1 Speech Signal 0 −1 0 40 20 0

PB2418

1 Speech Signal 0 −1 0 40 20 0

PB2419

1 Speech Signal 0 −1 0 40 20 0

PB2420

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2415

Power Spectrum (dB)

PB2418

Power Spectrum (dB)

PB2419

Power Spectrum (dB)

PB2420

2 3 4 Frequency (kHz) PB2415

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2418

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2419

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2420

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2415

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2418

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2419 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2420

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2415

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2418

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2419

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2420

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2421

1 Speech Signal 0 −1 0 40 20 0

PB2425

1 Speech Signal 0 −1 0 40 20 0

PB2426

1 Speech Signal 0 −1 0 40 20 0

PB2427

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2421

Power Spectrum (dB)

PB2425

Power Spectrum (dB)

PB2426

Power Spectrum (dB)

PB2427

2 3 4 Frequency (kHz) PB2421

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2425

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2426

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2427

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2421

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2425

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2426 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2427

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2421

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2425

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2426

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2427

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

119

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2428

1 Speech Signal 0 −1 0 40 20 0

PB2431

1 Speech Signal 0 −1 0 40 20 0

PB2432

1 Speech Signal 0 −1 0 40 20 0

PB2433

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2428

Power Spectrum (dB)

PB2431

Power Spectrum (dB)

PB2432

Power Spectrum (dB)

PB2433

2 3 4 Frequency (kHz) PB2428

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2431

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2432

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2433

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2431

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2432 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2433

Midsagittal Distances PB2428

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2428

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2431

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2432

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2433

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0 15 0 0 5 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2434

1 Speech Signal 0 −1 0 40 20 0

PB2435

1 Speech Signal 0 −1 0 40 20 0

PB2441

1 Speech Signal 0 −1 0 40 20 0

PB2442

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2434

Power Spectrum (dB)

PB2435

Power Spectrum (dB)

PB2441

Power Spectrum (dB)

PB2442

2 3 4 Frequency (kHz) PB2434

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2435

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2441

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2442

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2434

4 2 0 0

Midsagittal Distances PB2435

4 2 0 0

Midsagittal Distances PB2441

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2442

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2434

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2435

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2441

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2442

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 0 15 0 0 5

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

15

0 0

5

cm

10

cm

10

15

120

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2443

1 Speech Signal 0 −1 0 40 20 0

PB2444

1 Speech Signal 0 −1 0 40 20 0

PB2445

1 Speech Signal 0 −1 0 40 20 0

PB2446

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2443

Power Spectrum (dB)

PB2444

Power Spectrum (dB)

PB2445

Power Spectrum (dB)

PB2446

2 3 4 Frequency (kHz) PB2443

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2444

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2445

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2446

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2443

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2444

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2445 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2446

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2443

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2444

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2445

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2446

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2447

1 Speech Signal 0 −1 0 40 20 0

PB2448

1 Speech Signal 0 −1 0 40 20 0

PB2449

1 Speech Signal 0 −1 0 40 20 0

PB2450

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2447

Power Spectrum (dB)

PB2448

Power Spectrum (dB)

PB2449

Power Spectrum (dB)

PB2450

2 3 4 Frequency (kHz) PB2447

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2448

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2449

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2450

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2447

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsagittal Distances PB2448

4 2 0 0

Midsagittal Distances PB2449

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2450

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2447

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2448

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2449

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2450

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

10 5 5 0 0 0 5

cm

10 5 5 0

15

10 5 5 0 5

cm

10 5 0 15 0 0 5

cm

10

0 0

5

cm

10

15

0 0

5

cm

10

cm

10

15

121

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2451

1 Speech Signal 0 −1 0 40 20 0

PB2810

1 Speech Signal 0 −1 0 40 20 0

PB2811

1 Speech Signal 0 −1 0 40 20 0

PB2812

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2451

Power Spectrum (dB)

PB2810

Power Spectrum (dB)

PB2811

Power Spectrum (dB)

PB2812

2 3 4 Frequency (kHz) PB2451

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2810

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2811

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2812

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2811 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2812

Midsagittal Distances PB2451

Midsagittal Distances PB2810

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2451

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2810

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2811

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2812

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

15

10 5 5 0

15

cm

10 5 5 0 15 0 0 5

cm

10

0 0

5

cm

10

0 0

5

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2813

1 Speech Signal 0 −1 0 40 20 0

PB2814

1 Speech Signal 0 −1 0 40 20 0

PB2823

1 Speech Signal 0 −1 0 40 20 0

PB2824

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2813

Power Spectrum (dB)

PB2814

Power Spectrum (dB)

PB2823

Power Spectrum (dB)

PB2824

2 3 4 Frequency (kHz) PB2813

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2814

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2823

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2824

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2813

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2823 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2824

Midsagittal Distances PB2814

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2813

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2814

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2823

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2824

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10

10 5 5 5 0

15

cm

10 5 0 15 0 0 5

5 0 15 0 0 5

cm

10

cm

10

0 0

5

cm

10

cm

10

15

122

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2825

1 Speech Signal 0 −1 0 40 20 0

PB2826

1 Speech Signal 0 −1 0 40 20 0

PB2827

1 Speech Signal 0 −1 0 40 20 0

PB2828

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2825

Power Spectrum (dB)

PB2826

Power Spectrum (dB)

PB2827

Power Spectrum (dB)

PB2828

2 3 4 Frequency (kHz) PB2825

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2826

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2827

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2828

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2825

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2826

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2827

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2828

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2825

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2826

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2827

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2828

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 0 15 0 0 5

10 5 5 0

cm

10

cm

10

cm

10

15

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2834

1 Speech Signal 0 −1 0 40 20 0

PB2835

1 Speech Signal 0 −1 0 40 20 0

PB2836

1 Speech Signal 0 −1 0 40 20 0

PB2837

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2834

Power Spectrum (dB)

PB2835

Power Spectrum (dB)

PB2836

Power Spectrum (dB)

PB2837

2 3 4 Frequency (kHz) PB2834

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2835

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2836

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2837

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2834

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2835

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2836 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2837

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2834

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2835

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2836

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2837

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

123

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2838

1 Speech Signal 0 −1 0 40 20 0

PB2842

1 Speech Signal 0 −1 0 40 20 0

PB2843

1 Speech Signal 0 −1 0 40 20 0

PB2844

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2838

Power Spectrum (dB)

PB2842

Power Spectrum (dB)

PB2843

Power Spectrum (dB)

PB2844

2 3 4 Frequency (kHz) PB2838

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2842

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2843

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2844

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2838

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsag. Dist. (cm) Midsagittal Distances PB2842

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2843 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2844

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2838

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2842

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2843

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2844

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 5 0 15 0 0 5

cm

10 5 0 15 0 0 5

cm

10

cm

10

cm

10

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2845

1 Speech Signal 0 −1 0 40 20 0

PB2846

1 Speech Signal 0 −1 0 40 20 0

PB2847

1 Speech Signal 0 −1 0 40 20 0

PB2851

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2845

Power Spectrum (dB)

PB2846

Power Spectrum (dB)

PB2847

Power Spectrum (dB)

PB2851

2 3 4 Frequency (kHz) PB2845

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2846

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2847

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2851

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2845

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm)

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2847 Midsag. Dist. (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2851

Midsagittal Distances PB2846

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2845

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2846

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2847

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2851

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

10 5 5 0

15

cm

10 5 5 0 15 0 0 5

cm

10

15

0 0

5

cm

10

0 0

5

cm

10

cm

10

15

124

APPENDIX B. RESULTS FOR ISOLATED FRAMES

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2852

1 Speech Signal 0 −1 0 40 20 0

PB2853

1 Speech Signal 0 −1 0 40 20 0

PB2854

1 Speech Signal 0 −1 0 40 20 0

PB2855

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2852

Power Spectrum (dB)

PB2853

Power Spectrum (dB)

PB2854

Power Spectrum (dB)

PB2855

2 3 4 Frequency (kHz) PB2852

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2853

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2854

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2855

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2852

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2853

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2854

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2855

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2852

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2853

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2854

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2855

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

cm

10 5 5 0 0 0 5 5

10 5 5 0 15 0 0 5

10 5 0 15 0 0 5

10 5 5 0

cm

10

cm

10

cm

10

15

0 0

5

cm

10

15

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2856

1 Speech Signal 0 −1 0 40 20 0

PB2857

1 Speech Signal 0 −1 0 40 20 0

PB2858

1 Speech Signal 0 −1 0 40 20 0

PB2859

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2856

Power Spectrum (dB)

PB2857

Power Spectrum (dB)

PB2858

Power Spectrum (dB)

PB2859

2 3 4 Frequency (kHz) PB2856

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2857

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2858

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2859

5

Area Function

Area Function

Area Function

Area Function

5

5

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

Midsag. Dist. (cm)

Midsag. Dist. (cm)

Midsagittal Distances PB2856

4 2 0 0

Midsagittal Distances PB2857

4 2 0 0

Midsagittal Distances PB2858

Midsag. Dist. (cm)

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0

5 10 15 Distance from Glottis (cm)

0 0 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2859

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2856

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2857

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2858

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2859

15

15

15

15

10

15

10

15

10

15

10

15

cm

cm

cm

10 5 5 0 0 0 5

10 5 5 0

10 5 5 0

cm

10 5 5 0

15

cm

10

15

0 0

5

cm

10

15

0 0

5

cm

10

0 0

5

cm

10

15

125

1 Speech Signal 0 −1 0 40 20 0 0 10 Area (cm2) 1 5 10 Time (ms)

PB2860

1 Speech Signal 0 −1 0 40 20 0

PB2861

15

20

5

10 Time (ms)

15

20

Power Spectrum (dB)

PB2860

Power Spectrum (dB)

PB2861

2 3 4 Frequency (kHz) PB2860

5

0 10 Area (cm2)

1

2 3 4 Frequency (kHz) PB2861

5

Area Function

Area Function

5

5

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2860

0 0 Midsag. Dist. (cm) 4 2 0 0

5 10 15 Distance from Glottis (cm) Midsagittal Distances PB2861

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2860

5 10 15 Distance from Glottis (cm) Vocal−Tract Profile 20 25 30 PB2861

15

15

10

15

10

15

cm

10 5 5 0 0 0 5

cm

10 5 5 0

15

cm

10

0 0

5

cm

10

15

Bibliography

ACT78] B. S. Atal, J. J. Chang, and J. W. Tukey. Inversion of articulatory-toacoustic transformation in the vocal-tract by a computing sorting technique. The Journal of the Acoustical Society of America, 63(5):1535{1555, 1978. B. S. Atal and S. L. Hanauer. Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50:637{655, 1971. D. Beautemps, P. Badin, and R. Laboissiere. Deriving vocal-tract area functions from midsagittal pro les and formant frequencies: a new model for vowels and fricative consonants based on experimental data. Speech Communication, 16:27{47, 1995.

AH71]

BBL95]

BGGN91] T. Baer, J. C. Gore, L. C. Gracco, and P. W. Nye. Analysis of vocal tract shape and dimensions using magnetic resonance imaging. The Journal of the Acoustical Society of America, 90(2):799{828, 1991. BLS91] G. Bailly, R. Laboissiere, and J. L. Schwartz. Formant trajectories as audible gestures: an alternative for speech synthesis. Journal of Phonetics, 19:9{23, 1991. A. Bell and J. Seijnowski. Blind separation and blind deconvolution: an information-theoretic approach. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 1995.

BS95]

BSWZ86] A. Bothorel, P. Simon, F. Wioland, and J. P. Zerling. Cineradiographie de voyelles et consonnes du Francais. Institut de Phonetique de Strasburg, 1986. 126

BIBLIOGRAPHY CF66]

127

C. Coker and O. Fujimura. Model for speci cation of the vocal tract area function. The Journal of the Acoustical Society of America, 40:1271, 1966. T. Chiba and M. Kajiyama. The Vowel { Its Nature and Structure. Tokyo, 1941. C. Coker. A model of articulatory dynamics and control. Proc. IEEE, 64(4):452{460, 1976. P. Comon. Independent component analysis, a new concept? Signal Processing, 36:287{314, 1994. H. F. Davis. Fourier Series and Orthogonal Functions. Dover, 1963. E. Eisner. Complete solutions of the `webster' horn equation. The Journal of the Acoustical Society of America, 41(4):1126{1146, 1966. G. Fant. On the predictability of formant levels and spectrum envelopes from formant frequencies. In I. Lehiste, editor, Readings in Acoustic Phonetics, pages 44{56. MIT, 1967. (Rep. from R. Jakobson, M. Halle, and H. MacLean, eds., Mouton, 1956.). G. Fant. Acoustic Theory of Speech Production. The Hague, 1970. G. Fant. The relations between the area functions and the acoustic signal. Phonetica, 37:55{86, 1980. J. L. Flanagan, K. Ishizaka, and K. L. Shipley. Signal models for low bitrate coding of speech. The Journal of the Acoustical Society of America, 68(3):780{791, 1979. J. L. Flanagan. A di erence limen for vowel formant frequency. The Journal of the Acoustical Society of America, 27:613{617, 1955. J. L. Flanagan. Speech Analysis, Synthesis, and Perception. SpringerVerlag, 1972.

CK41] Cok76] Com94] Dav63] Eis66] Fan67]

Fan70] Fan80] FIS79]

Fla55] Fla72]

128 GS93]

BIBLIOGRAPHY S. K. Gupta and J. Schroeter. Pitch synchronous frame-by-frame and segment-based articulatory analysis by synthesis. The Journal of the Acoustical Society of America, 94(5):2517{2530, 1993. R. Horn and C. Johnson. Matrix Analysis. Cambridge, 1985. J. M. Heinz and K. N. Stevens. On the derivation of area functions and acoustic spectra from cineradiographic lms of speech. The Journal of the Acoustical Society of America, 36:1037, 1964. F. Itakura and S. Saito. Analysis synthesis telephony based on the maximum likelyhood method. In J. Flanagan and R. Rabiner, editors, Speech Synthesis, pages 289{292. Dowden, Hutchinson & and Ross, 1973. (Rep. from 6th Int. Cong. Acoust., Tokyo, 1968.). N. Jayant and P. Noll. Digital Coding of Waveforms. Springer-Verlag, 1984. M. Jordan. Motor learning and the degrees of freedom problem. In M. Jeannerod, editor, Attention and Performance, vol. XIII, pages 797{ 836. Erlbaum, 1990. J. L. Kelly and C. C. Lochbaum. Speech synthesis. In J. Flanagan and R. Rabiner, editors, Speech Synthesis, pages 127{130. Dowden, Hutchinson & and Ross, 1973. (Rep. from 4th Int. Cong. Acoust., Copenhagen, 1962.). Q. Lin. Speech production theory and articulatory speech synthesis. PhD thesis, Royal Institute of Technology (KTH), Stockholm, 1990. B. Lindblom. Explaining phonetic variation: a sketch of the h & h theory. In W. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pages 403{439. Kluwer Academic Publishers, 1990. S. Maeda. On the conversion of x-ray data into formant frequencies. Technical report, Bell Laboratories, Murray Hill, N.J., 1972. S. Maeda. A digital simulation method of the vocal-tract system. Speech Communication, 1(3{4):199{229, 1982.

HJ85] HS64]

IS73]

JN84] Jor90]

KL73]

Lin90a] Lin90b]

Mae72] Mae82]

BIBLIOGRAPHY Mae90]

129

S. Maeda. Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In W. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pages 131{149. Kluwer Academic Publishers, 1990. R. S. McGowan. Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: preliminary model tests. Speech Communication, 14:19{48, 1994. P. Mermelstein. Determination of vocal-tract shape from measured formant frequencies. The Journal of the Acoustical Society of America, 41(5):1283{1294, 1967. P. Mermelstein. Articulatory model for the study of speech production. The Journal of the Acoustical Society of America, 53(4):1070{1082, 1973. J. D. Markel and A. H. Gray. Linear Prediction of Speech. SpringerVerlag, 1976. A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, 1991. P. Perrier, L. J. Boe, and R. Sock. Vocal tract area function estimation from midsagittal dimensions with CT scans and a vocal tract cast: modelling the transition with two sets of coe cients. Journal of Speech and Hearing Research, 35:53{67, 1992.

McG94]

Mer67]

Mer73] MG76] Pap91] PBS92]

PCS+ 92] J. S. Perkell, M. H. Cohen, M. A. Svirsky, M. L. Matthies, I. Garabieta, and M. T. T. Jackson. Electromagnetic midsagittal articulometer systems for transducing speech articulatory movements. The Journal of the Acoustical Society of America, 92(6):3078{3096, 1992. Per69] RJ93] J. S. Perkell. Physiology of speech production: Results and implications of a quantitative cineradiographic study. Master's thesis, M.I.T., 1969. L. Rabiner and B. W. Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.

130 RS78] Sal46] Sch67]

BIBLIOGRAPHY L. Rabiner and R. Schafer. Digital Processing of Speech Signals. Prentice Hall, 1978. V. Salmon. Generalized plane wave horn theory. The Journal of the Acoustical Society of America, 17(3):199{211, 1946. M. R. Schroeder. Determination of the geometry of the human vocaltract by acoustical measurements. The Journal of the Acoustical Society of America, 41(4):1002{1010, 1967. C. Scully. Articulatory synthesis. In W. J. Hardcastle and A. Marchal, editors, Speech Production and Speech Modelling, pages 151{186. Kluwer Academic Publishers, 1990. K. Shirai. Estimation and generation of articulatory motion using neural networks. Speech Communication, 13:45{51, 1993. K. Shirai and T. Kobayashi. Estimating articulatory motion from the speech wave. Speech Communication, 5:159{170, 1986. M. M. Sondhi. Model for wave propagation in a lossy vocal-tract. The Journal of the Acoustical Society of America, 55(5):1070{1075, 1974. M. M. Sondhi. Estimation of vocal-tract areas: The need for acoustical measurements. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(3):268{273, 1979. M. M. Sondhi and J. R. Resnick. The inverse problem for the vocal-tract: Numerical methods, acoustical experiments, and speech synthesis. The Journal of the Acoustical Society of America, 73(3):985{1002, 1983. M. M. Sondhi and J. Schroeter. A hybrid time-frequency domain articulatory speech synthesizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(7):955{967, 1987. J. Schroeter and M. Sondhi. Speech coding based on physiological models of speech production. In M. M. Sondhi and S. Furui, editors, Advances in Speech Processing, pages 231{268. Marcel Dekker, 1991.

Scu90]

Shi93] SK86] Son74] Son79]

SR83]

SS87]

SS91]

BIBLIOGRAPHY SS94]

131

J. Schroeter and M. M. Sondhi. Techniques for estimating vocal-tract shapes from the speech signal. IEEE Transactions on Speech and Audio Processing, 2(1):133{150, 1994. M. K. Tiede and H. Yehia. A shape-based approach to vocal tract area function estimation. To appear in The Proceedings of the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, 1996.

TY96]

TYVB96] M. K. Tiede, H. Yehia, and E. Vatikiotis-Bateson. A shape-based approach to vocal tract area function estimation. In Proceedings of the 1st ESCA Tutorial and Research Workshop on Speech Production Modeling & 4th Speech Production Seminar, pages 41{44, 1996. VK94] V. Valimaki and M. Karjalainen. Improving the kelly-lochbaum vocal tract model using conical tube sections and fractional delay ltering techniques. In Proc. International Conference on Spoken Language Processing, pages S12{12.1{S12{12.4, 1994. H. Wakita. Direct estimation of the vocal-tract shape by inverse ltering of the acoustic speech waveforms. IEEE Transactions on Audio and Electroacoustics, AU-21(5):417{427, 1973. H. Wakita. Estimation of vocal-tract shapes from acoustical analysis of the speech wave: the state of the art. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-27(3):281{285, 1979. A. G. Webster. Acoustical impedance, and the theory of horns and of the phonograph. Proc. Natl. Acad. Sci. (U.S.), 5:275{282, 1919.

Wak73]

Wak79]

Web19]

YHI95a] H. Yehia, M. Honda, and F. Itakura. Acoustic measurements of the vocal-tract area function: sensitivity analysis and experiments. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 652{655, 1995. YHI95b] H. Yehia, M. Honda, and F. Itakura. Acoustical measurements of the vocal-tract area function: System modelling and experimental results.

132

BIBLIOGRAPHY In Proceedings of the 1995 Spring Meeting of the Acoustical Society of Japan, pages 305{306, 1995.

YI93a]

H. Yehia and F. Itakura. Dynamic vocal-tract shape determination from formant frequencies using two-dimensional Fourier analysis. SP-92 143, Institute of Electronics, Information and Communication Engineers, 1993. H. Yehia and F. Itakura. Variational and perturbation analysis applied to determination of vocal-tract formants. In Proceedings of the 1993 Autumn Meeting of the Acoustical Society of Japan, pages 285{286, 1993. H. Yehia and F. Itakura. Determination of human vocal-tract dynamic geometry from formant trajectories using spatial and temporal Fourier analysis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 477{480, 1994. H. Yehia and F. Itakura. Analysis of a technique to measure the vocaltract cross-sectional area based on the impulse response at the lips. SP-94 107, Institute of Electronics, Information and Communication Engineers, 1995. H. Yehia and F. Itakura. Combining dynamic and acoustic constraints in the speech production inverse problem. SP-95 13, Institute of Electronics, Information and Communication Engineers, 1995. H. Yehia and F. Itakura. A method to combine acoustical and morphological constraints in the speech production inverse problem. Speech Communication, 18(2):151{174, 1996. H. Yehia and M. Tiede. A parametric three-dimensional model of the vocal-tract based on MRI data. To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 1997. H. Yehia, K. Takeda, and F. Itakura. A vocal-tract area function trajectory representation oriented to the speech production inverse problem. In Proceedings of the 1995 Autumn Meeting of the Acoustical Society of Japan, pages 339{340, 1995.

YI93b]

YI94]

YI95a]

YI95b]

YI96]

YT97]

YTI95]

BIBLIOGRAPHY YTI96]

133

H. Yehia, K. Takeda, and F. Itakura. An acoustically oriented vocaltract model. IEICE Transactions on Information and Systems, E79D(8):1198{1208, 1996.

YTVBI96] H. Yehia, M. K. Tiede, E. Vatikiotis-Bateson, and F. Itakura. Applying morphological constraints to estimate three-dimensional vocal-tract shapes from partial pro le and acoustic information. To appear in The Proceedings of the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, 1996.

List of Publications

Journal Papers

H. Yehia and F. Itakura, \A method to combine acoustical and morphological constraints in the speech production inverse problem," Speech Communication, 18(2):151{174, 1996. H. Yehia, K. Takeda, and F. Itakura, \An acoustically oriented vocal-tract model," IEICE Transactions on Information and Systems, E79-D(8):1198{1208, 1996. H. Yehia, K. Takeda, and F. Itakura, \An analysis of the acoustic-to-articulatory mapping during speech under morphological and continuity constraints," submitted to Speech Communication.

International Conferences

H. Yehia and F. Itakura, \Determination of human vocal-tract dynamic geometry from formant trajectories using spatial and temporal Fourier analysis," In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 477{480, 1994. H. Yehia, M. Honda, and F. Itakura, \Acoustic measurements of the vocal-tract area function: sensitivity analysis and experiments," In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 652{655, 1995. M. K. Tiede, H. Yehia, and E. Vatikiotis-Bateson, \A shape-based approach to vocal tract area function estimation," In Proceedings of the 1st ESCA Tutorial and Research Workshop on Speech Production Modeling & 4th Speech Production Seminar, pages 41{44, 1996. 134

PUBLICATIONS

135

E. Vatikiotis-Bateson, K. G. Munhall M. Hirayama, Y. Kasahara, and H. Yehia, \Physiology-Based Synthesis of Audiovisual Speech," In Proceedings of the 1st ESCA Tutorial and Research Workshop on Speech Production Modeling & 4th Speech Production Seminar, pages 241{244, 1996. E. Vatikiotis-Bateson, K. G. Munhall, Y. Kasahara F. Garcia, and H. Yehia, \Characterizing audiovisual information during speech," In Proceedings of the International Conference on Spoken Language Processing, pages 1485{1488, 1996. H. Yehia, M. K. Tiede, E. Vatikiotis-Bateson, and F. Itakura, \Applying morphological constraints to estimate three-dimensional vocal-tract shapes from partial pro le and acoustic information," In The Proceedings of the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, pages 855{860, 1996. T. Taniguchi, H. Yehia, S. Kajita, T. Takeda and F. Itakura, \On the problems of applying Bell's blind separation to real environments," In The Proceedings of the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, pages 1257{1260, 1996. M. K. Tiede and H. Yehia, \A shape-based approach to vocal tract area function estimation," In The Proceedings of the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, pages 861{866, 1996. E. Vatikiotis-Bateson and H. Yehia, \Synthesizing audiovisual speech from physiological signals," In The Proceedings of the 1996 Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, pages 811{816, 1996. H. Yehia and M. Tiede, \A parametric three-dimensional model of the vocal-tract based on MRI data," To appear in The Proceedings of ICASSP-97, 1997.

**Technical Meetings and Symposia
**

H. Yehia and F. Itakura, \A method to estimate LPC parameters exploring frame segmentation," In Proceedings of the 1992 Spring Meeting of the Acoustical Society of Japan, pages 305{306, 1992.

136

PUBLICATIONS

H. Yehia and F. Itakura, \Dynamic vocal-tract shape determination from formant frequencies using two-dimensional fourier analysis," SP-92 143, Institute of Electronics, Information and Communication Engineers, pages 49{56, 1993. H. Yehia and F. Itakura, \Variational and perturbation analysis applied to determination of vocal-tract formants," In Proceedings of the 1993 Autumn Meeting of the Acoustical Society of Japan, pages 285{286, 1993. H. Yehia and F. Itakura, \Analysis of a technique to measure the vocal-tract crosssectional area based on the impulse response at the lips," SP-94 107, Institute of Electronics, Information and Communication Engineers, pages 69{76, 1995. H. Yehia, M. Honda, and F. Itakura, \Acoustical measurements of the vocal-tract area function: System modelling and experimental results," In Proceedings of the 1995 Spring Meeting of the Acoustical Society of Japan, pages 305{306, 1995. H. Yehia and F. Itakura, \Combining dynamic and acoustic constraints in the speech production inverse problem," SP-95 13, Institute of Electronics, Information and Communication Engineers, pages 23{30, 1995. H. Yehia, K. Takeda, and F. Itakura, \A vocal-tract area function trajectory representation oriented to the speech production inverse problem," In Proceedings of the 1995 Autumn Meeting of the Acoustical Society of Japan, pages 339{340, 1995. I. Masuda, H. Yehia, and H. Kawahara, \A study of a method for signal separation by spectral interporation using bartlett window properties (in Japanese)," EA-96 29, Institute of Electronics, Information and Communication Engineers, pages 17{24 1996. E. Vatikiotis-Bateson and H. Yehia, \Physiological modeling of facial motion during speech," H-96 65, The Acoustical Society of Japan, 1996. H. Yehia, \Vocal-tract pro le to area function mapping taking formant frequency constraints into account," In Proceedings of the 1996 Autumn Meeting of the Acoustical Society of Japan, pages 321{322 1996.