(IJCSIS) International Journal of Computer Science and Information Security,Vol. XXX, o. XXX, 2010
clearly defined in the signal . These aspects affect the performance of speech segmentation. A better performance isreached when the speaker has good diction, low speech rateand high intensity.
Aspects of the Signal
The most important aspect within the signal is the samplingfrequency. With a greater number of samples obtained from theoriginal signal during the digitizing better detail is revealed.However, more noise or unnecessary signals of frequencymight be included. In accordance with the Nyquist-ShannonTheorem it is sufficient to have the quantity of samples whichcome out with a frequency of at least twice the base frequency.
Types of Labeling
There are levels of labeling that define the limits of segments contained in a speech signal. The levels of labelingdepend on the number of allophones, closings in stop andaffricate consonants, glides and, sounds of accentuated vowelsto mention a few. In this work utterances from the DIMEx100corpus of the Spanish spoken in Mexico are used. The corpus isdescribed below. The utterances used in the tests include thefollowing levels of labeling as described in .
At this level the 37 most frequent allophones of MexicanSpanish are represented, as well as the 8 closings in stop andaffricate consonants, ([p_c, t_c, k_c, b_c, d_c, g_c, tS_c,dZ_c]) and the 9 vowels that allow an accent ([i_7, e_7,E_7, a_j_7, a_7, a_2_7, O_7, o_7 and u_7]); the completeinventory of allophone units is also represented at this level.
This level considers some basic acoustic aspects, andsome syllabic features; the level includes, besides the 22 prototypical allophones of Mexican Spanish, the closings instop consonants and the voiceless affricate consonant ([p_c,t_c, k_c, b_c, d_c, g_c, tS_c]), the allophones near voicedstops([V,D,D]), the 9 vowels which allow accent ([i_7, e_7,E_7, a_j_7, a_7, a_2_7, O_7, o_7, u_7])and the glides ([j,w]). Also, a single symbol is allocated to consonant couples([p/b, t/d, k/g, n/m, r(/r]) at the end of a syllable or asyllable coda ([-P, -T, -K, -N, -R]).
At this level solely the 22 allophonic forms (inventory)which are related with the phonemes of the MexicanSpanish are represented. This is one of the aspects that must be considered, since the type of labeling might affect thesegmentation performance, as it is shown in the experimentssection.
Extracted information of the speech signal
Some features can be extracted from the time domain aswell as from the frequency domain. Segmentation algorithmsthat use features of time domain such as intensity [6, 8],energy and zero crossing rates, to mention a few, have beenreported.On the other hand, some encoding schemes are extractedfrom the frequency domain such as MFCC, PCBF, Bark spectrum and Mel spectrum. The best results for thesegmentation process  were obtained in the later spectrum.
Stevens and Volkman in  proposed the Mel scale. Itwas obtained from experiments on human hearing perception.They proposed that the perception level with respect to thefrequency heard follows a logarithmic scale expressed by theequation:
In order to obtain the speech codified in Mel spectra, a bank of filters emulates the critical perception bands, wherethe boundaries of the filters coincide with the center of theadjacent filters; their own axes follow the Mel scale. Thefilters obtain the average of the concentrations of energy of each central frequency corresponding to each frame of thespeech signal, where each frame is a segment of the speech(usually of 10 ms).
Figura 1. Obtaining the Mel spectrum in vectors
In the present paper, Mel spectra found in feature vectors,where the size of such vectors is equal to the number of filtersapplied to each frame of the signal, are used. Each filter isapplied to a frequency sub-band to obtain a quantification of the energy in it. On the other hand, to carry out thesegmentation, the approach of comparing distances betweenobjects represented by feature vectors is applied to determinethe phonetic limits.SEGMENTATION
ALGORITHMSome segmentation algorithms are based on features of thetime domain such as the ones reported in [6,8], as well asfeatures of the frequencies domain, such as the ones reportedin [4, 5,7] which perform the phoneme segmentation with textindependency. The proposed segmentation algorithm usesspeech feature vectors in particular the codification schemes based on Mel spectrum. Each vector represents the features of the wave of speech in diverse intervals of frequency at amoment of time t. For each frequency interval, a fuzzy space is
Identify applicable sponsor/s here.
Identify applicable sponsor/s here.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010230http://sites.google.com/site/ijcsis/ISSN 1947-5500