Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Speech Segmentation Algorithm Based On Fuzzy Memberships

Speech Segmentation Algorithm Based On Fuzzy Memberships

Ratings: (0)|Views: 42 |Likes:
Published by ijcsis
In this work, an automatic speech segmentation algorithm with text independency was implemented. In the algorithm, the use of fuzzy memberships on each characteristic in different speech sub-bands is proposed. Thus, the segmentation is performed a greater detail. Additionally, we tested with various speech signal frequencies and labeling, and we could observe how they affect the performance of the segmentation process in phonemes. The speech segmentation algorithm used is described. During the segmentation process it is not supported by any additional information on the speech signal, as the text is. A correct segmentation of the 80,51% is reported on a data base in Spanish, with a rate of on-segmentation near 0%.
In this work, an automatic speech segmentation algorithm with text independency was implemented. In the algorithm, the use of fuzzy memberships on each characteristic in different speech sub-bands is proposed. Thus, the segmentation is performed a greater detail. Additionally, we tested with various speech signal frequencies and labeling, and we could observe how they affect the performance of the segmentation process in phonemes. The speech segmentation algorithm used is described. During the segmentation process it is not supported by any additional information on the speech signal, as the text is. A correct segmentation of the 80,51% is reported on a data base in Spanish, with a rate of on-segmentation near 0%.

More info:

Published by: ijcsis on Jun 30, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/07/2010

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. XXX, o. XXX, 2010
Speech Segmentation Algorithm Based On FuzzyMemberships
Luis D. Huerta, Jose A. Huesca and Julio C. Contreras
Departamento de InformáticaUniversidad del Istmo Campus IxtepécIxtepéc Oaxaca, México{luisdh2, achavez, jcontreras}@bianni.edu.mx
 Abstract 
 — In this work, an automatic speech segmentationalgorithm with text independency was implemented. In thealgorithm, the use of fuzzy memberships on each characteristic indifferent speech sub-bands is proposed. Thus, the segmentation isperformed a greater detail. Additionally, we tested with variousspeech signal frequencies and labeling, and we could observe howthey affect the performance of the segmentation process inphonemes. The speech segmentation algorithm used is described.During the segmentation process it is not supported by anyadditional information on the speech signal, as the text is. Acorrect segmentation of the 80,51% is reported on a data base inSpanish, with a rate of on-segmentation near 0%.
 Keywords-component; Speech Segmentation; Fuzzy Memberships; Phonemes; Sub-bands Features
I
 NTRODUCTION
 Speech recognition systems provide a naturalcommunication environment between people and computers.Basically these systems require two processes to carry out theunderstanding of the speech signal: the segmentation processand the segment recognition process.The speech recognition systems are based on units such aswords, syllables, diphonemes and phonemes, the phonemes being the smallest set. A speech recognition system based on phonemes reduces the number of units needed for recognition.Therefore, it reduces the confusion during the recognition process.The segmentation process will determine the existing limits between the speech units considered within the signal. Thequality of the segmentation process directly affects the qualityof the recognition process since vague segments will perform poorly during the recognition process and therefore perform poorly in the whole system.Works of segmentation based on sub-words using syllables[1,2] and phonemes [4,5], to mention some, have beenreported. Many have been tested under a series of restrictionssuch as the use of limited vocabulary [1], a small number of speakers [4], and the use of additional information. These areknown as text dependent, as the ones reported in [1,3].A series of segmentation algorithms [5, 6, 7] has been proposed. These have been tested with various speakers,naturally spoken utterances with a wide range of vocabulary,and without any type of additional information to the content of the speech wave which could provide some help to thesegmentation. Promising results in the segmentation based on phonemes were reported.However, there has been little effort to study factors thataffect the performance of the segmentation beyond theimplementation of the algorithm.The present work shows results of the performance of thesegmentation process with variants on the speech signalfrequency and labeling. In the experiments, the DIMEx100data base of the Spanish spoken in Mexico was used.F
ACTORS WHICH TAKE PART IN THE PERFORMANCE OF THESPEECH SEGMENTATION
 Recently, works related to automatic speech segmentationin phonemes, tested under conditions that include differentspeakers, utterances expressed in natural conditions withoutany vocabulary restrictions or any additional information onthe content of the speech wave, known as text independence.These testing conditions affect the performance of thealgorithm. However when considering them in theexperimental phase they allow more realistic results and of  better quality to be obtained.Therefore, it is crucially important to know which factors take part in the final result. Some factors that affect the performance of the segmentation are as follows:
 Aspects of the speakers
To include different speakers to the tests assesses howrobust the algorithm is when dealing with diverse natural stylesof speech, where there are some features such as the speaker’sdiction, rate and intensity. Diction is related to the right pronunciation and articulation of the words. An appropriatediction results in utterances that can be heard clearly andintelligibly by the receiver. On the other hand, the speaker’srate refers to the amount of words or sub-words spoken by timeunit, or more exactly, the speed at which a word or an utteranceis expressed. At high rates of speech the clarity of theutterances is reduced. The intensity of the speech is related tothe amount of energy involved in the emitted wave. Whenthere is a greater intensity there is a major emphasis betweenthe phonetic transitions. The factors of the speaker mentionedabove influence in the articulation of the words, and due totheir effects, the transition between some phonetic limits is not
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010229http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. XXX, o. XXX, 2010
clearly defined in the signal [9]. These aspects affect the performance of speech segmentation. A better performance isreached when the speaker has good diction, low speech rateand high intensity.
 Aspects of the Signal 
The most important aspect within the signal is the samplingfrequency. With a greater number of samples obtained from theoriginal signal during the digitizing better detail is revealed.However, more noise or unnecessary signals of frequencymight be included. In accordance with the Nyquist-ShannonTheorem it is sufficient to have the quantity of samples whichcome out with a frequency of at least twice the base frequency.
Types of Labeling 
There are levels of labeling that define the limits of segments contained in a speech signal. The levels of labelingdepend on the number of allophones, closings in stop andaffricate consonants, glides and, sounds of accentuated vowelsto mention a few. In this work utterances from the DIMEx100corpus of the Spanish spoken in Mexico are used. The corpus isdescribed below. The utterances used in the tests include thefollowing levels of labeling as described in [10].
•  
 
 Level T54
At this level the 37 most frequent allophones of MexicanSpanish are represented, as well as the 8 closings in stop andaffricate consonants, ([p_c, t_c, k_c, b_c, d_c, g_c, tS_c,dZ_c]) and the 9 vowels that allow an accent ([i_7, e_7,E_7, a_j_7, a_7, a_2_7, O_7, o_7 and u_7]); the completeinventory of allophone units is also represented at this level.
•  
 
 Level T44
This level considers some basic acoustic aspects, andsome syllabic features; the level includes, besides the 22 prototypical allophones of Mexican Spanish, the closings instop consonants and the voiceless affricate consonant ([p_c,t_c, k_c, b_c, d_c, g_c, tS_c]), the allophones near voicedstops([V,D,D]), the 9 vowels which allow accent ([i_7, e_7,E_7, a_j_7, a_7, a_2_7, O_7, o_7, u_7])and the glides ([j,w]). Also, a single symbol is allocated to consonant couples([p/b, t/d, k/g, n/m, r(/r]) at the end of a syllable or asyllable coda ([-P, -T, -K, -N, -R]).
•  
 
 Level T22
At this level solely the 22 allophonic forms (inventory)which are related with the phonemes of the MexicanSpanish are represented. This is one of the aspects that must be considered, since the type of labeling might affect thesegmentation performance, as it is shown in the experimentssection.
 Extracted information of the speech signal 
Some features can be extracted from the time domain aswell as from the frequency domain. Segmentation algorithmsthat use features of time domain such as intensity [6, 8],energy and zero crossing rates, to mention a few, have beenreported.On the other hand, some encoding schemes are extractedfrom the frequency domain such as MFCC, PCBF, Bark spectrum and Mel spectrum. The best results for thesegmentation process [5] were obtained in the later spectrum.
Mel Spectrum
Stevens and Volkman in [12] proposed the Mel scale. Itwas obtained from experiments on human hearing perception.They proposed that the perception level with respect to thefrequency heard follows a logarithmic scale expressed by theequation:
 = 2595


(1)
In order to obtain the speech codified in Mel spectra, a bank of filters emulates the critical perception bands, wherethe boundaries of the filters coincide with the center of theadjacent filters; their own axes follow the Mel scale. Thefilters obtain the average of the concentrations of energy of each central frequency corresponding to each frame of thespeech signal, where each frame is a segment of the speech(usually of 10 ms).
Figura 1. Obtaining the Mel spectrum in vectors
In the present paper, Mel spectra found in feature vectors,where the size of such vectors is equal to the number of filtersapplied to each frame of the signal, are used. Each filter isapplied to a frequency sub-band to obtain a quantification of the energy in it. On the other hand, to carry out thesegmentation, the approach of comparing distances betweenobjects represented by feature vectors is applied to determinethe phonetic limits.SEGMENTATION
 
ALGORITHMSome segmentation algorithms are based on features of thetime domain such as the ones reported in [6,8], as well asfeatures of the frequencies domain, such as the ones reportedin [4, 5,7] which perform the phoneme segmentation with textindependency. The proposed segmentation algorithm usesspeech feature vectors in particular the codification schemes based on Mel spectrum. Each vector represents the features of the wave of speech in diverse intervals of frequency at amoment of time t. For each frequency interval, a fuzzy space is
Identify applicable sponsor/s here.
(sponsors)
 Identify applicable sponsor/s here.
(sponsors)
 
m
1
m
2
m
3
m
4
m
5
m
6
m
7
m
8
 
frame
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010230http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. XXX, o. XXX, 2010
defined in order to obtain a better detailed spectralquantification in each case. This fuzzy space is defined asobtaining the minimum and maximum spectrum of each sub- band, with overlapping of 50%. For each spectrum the High,Mid and Low memberships regarding the frequency interval inwhich they reside are obtained.In order to obtain a quantitative representation of theexistence or non existence of a spectral change between frames,a summation of the corresponding distances in each sub-band iscomputed for an instant of time
and, in this way, the distance between compared frames is established. In order to determinethe distance between the feature vectors of each frame, thefollowing formula is used:
 
,
 

= µ

 

µ

 

+µ

 

µ

 

+µ

 

µ

 


(2)
Where
  
is the number of given features sub-bands in thenumber of filters used in the extraction of Mel spectra. Thedistance of a frame is obtained from the features of its adjacentframes. The previous equation gives the distances that exist ineach sub-band with respect to each membership of eachadjacent frame, applying summation to the distances of eachsub-band. A sole distance with respect to the frame in processis obtained.
Figura 2
.
Algorithm based on fuzzy memberships of the Mel spectrum.
 The representative distance of each frame is analyzed toestablish if it is a candidate to be a phonetic limit. Theconditions used for the selection of these limits are:1. D
t
>D
t-1
y D
t
>D
t+1
 2. D
t
>The first condition is oriented to obtain the local maxima based in this simple condition, while the second allowsselecting the significant local maxima.I
MPLEMENTATION AND
E
XPERIMENTS
 In the experiments, tests with frequency variations,labeling, encoding schemes, and the use of fuzzy membershipson the Mel spectra were carried out. The features extractionand the segmentation process were implemented using thefreeware PRAAT v.4.16.3 [11].
 Data Description
Tests were performed using Spanish utterances obtainedfrom the DIMEX100 corpus. The corpus was recorded in asound study in CCADET, Universidad Nacional Autónoma deMéxico (UNAM), with a mono format of sampling at 16 bits,and a sampling rate of 44.1 KHz.Speakers age ranges from 16 to 36. They have more than 9years of formal education in Mexico City. A random group of speakers in the UNAM (researchers, students, teachers andworkers) was selected, with an average age of 23.82. 87% of them do not hold a degree. 49% of utterances of the corpus areexpressed by females and 51% by males. Speakers fromMexico City were chosen for this corpus, since this varietyrepresents the variety spoken by the majority of the populationin the country.
 Experimental Data
In the test phase 240 speech signals were used. There werea total of 12655, 12551 and 11192 phonetic boundaries usinglabeling of 54, 44 y 22 phonemes respectively andcorresponding to 30 speakers (15 males and 15 females). Thesignals were extracted from DIMEx100 corpus with Spanishsentences.
Measurement of Performance
The algorithm performance was evaluated with commonlyused means such as in [5, 6, 7, 8].
    
=
1100
 D
(3)
 Where D is the measurement of over-segmentation, Sd is thenumber of points of segmentation detected by the algorithm,and St is the number of real points of segmentation.
    
=
cc
 P 
100
(4)
 Where Pc is the percentage of the correct detections, and Sc isthe number of points of correct segmentation. Thesegmentation points are considered as correct if the distance tothe true point of segmentation is in the range ± 20 ms.
Pre-emphasis
 
filter
 
FeaturesHigh Mid
 
LowFilter Bank Sub-band
1
 
Sub-band
2
 Sub-band
 
-1
 
Suband
 PhonemeboundariesRules
D
1,1
 D
1,2
 D
t,2
 D
t,1
 D
1,-1
 D
1,
 D
t,-1
 D
t,
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010231http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->