You are on page 1of 6

Automatic Speech Segmentation

Introduction
Speech segmentation is the process of dividing speech signal into segments. Segmentation of
continuous speech can be at word, syllable or phone level. Segmentation is one of the important
stages in the knowledge based automatic speech recognition (ASR). Once speech is segmented
into its basic units the net step in the ASR is classification of these units. !f the segmentation is
at the word level, all the words in the language need to be present in the dictionary of ASR for
classification. Also, the number of words in a language is very high. On the other hand if the
speech segmentation is at the phone level, all possible phones of the language will be present in
the dictionary of the ASR which is very small compared to the number of words. "ut, phone
level segmentation is a difficult task compared to word level as the phones in continuous speech
are not well separated. Syllable level segmentation is a compromise between the two# word and
phone level segmentations. Syllable is a unit of speech signal having one vowel sound, with or
without surrounding consonants, forming the whole or a part of a word. $urther syllable level
segmentation has been proved to be a better representation for ASR compared to word or phone
%&'.
(anual segmentation of speech is tedious and time consuming. (anpower re)uired for manual
segmentation of speech is very high. $urthermore, manual segmentation of the same sentence
may not result in a uni)ue solution. *g. +oledano et. al %,' evaluated the differences in the
manual segmentation of the same speech database at phone level by different human eperts.
+hey found that --..&/ and 01.-0/ of the boundaries lie within a tolerance interval of &2 ms
and ,2 ms respectively. +hese variations in manual segmentation can be overcome by using
automatic speech segmentation.
3ith this motivation, this work focuses on automatic speech segmentation at the syllable level.
+he net section in the report presents the related work. Report also gives the possible
ob4ectives, methodology and outcomes of the proposed work.
Literature Review
As mentioned already, a syllable consists of a vowel along with consonants. (ost of the energy
in vowels is concentrated in the lower fre)uencies. +herefore the peaks in the low fre)uency
energy contour are epected to coincide with the syllable nuclei. 5fit6inger et al. %.' find all
possible candidates for syllable nuclei then signal loudness was used to select the right candidate.
+hey found that those candidates that match the reference syllables were above 72892 d" signal
loudness. +o reduce the multiple marking of identical syllables a window was fied. Among all
the candidates which fall in the region of the window are considered and the one with the highest
loudness is chosen as the right candidate.
Similar to 5fit6inger et al. %.', :illing et al. %7' use the concept that intensity peaks are the
syllable nucleus candidates. Along with that they also try to mark the syllabic boundaries.
5ossible candidates for these boundaries were intensity troughs. $urther to select the right
candidate, they used envelope velocity and coarse spectral makeup.
;hang and <lass %9' find the first two syllabic nuclei, then estimate the instantaneous speech
rhythm based on these two nuclei. =sing this they predict the intervals where the net syllable
nucleus may appear. !nstead of using a threshold, the authors used a slope based peak detection
algorithm.
>agara4an and (urthy %1' process the short time energy (S+*) function to etract the syllable
boundaries. S+* contour is smoothed by using additive property of the $ourier transform phase
and deconvolution property of the cepstrum. $urther the smoothed S+* contour is treated as a
magnitude spectrum and a minimum phase group delay function is computed which better
represents the syllable boundaries compared to S+*.
$urther, ?anakiraman et al. %@' combined the concept of group delay with vowel onset point
detection. +hey first test the group delay based syllabification for speech utterances with varying
speaking rate and found that over8segmentation occurs if the speaking rate decreases and under8
segmentation occurs if the speaking rate increases. +herefore they designed a syllabification
system which is robust to the speaking rate by combining the information about syllable rate
(syllables per sec.) for the dataset, minimum phase group delay and vowel onset point detection
to improve the system performance.
Database
Aifferent databases were used in the literature for evaluating syllabification. Among those the
standard available database is +eas !nstruments and (assachusetts !nstitute of +echnology
(+!(!+), a standard American continuous speech database comprising of 1.22 sentences uttered
by 1.2 speakers %-'. +!(!+ provides manual segmentation at both word as well as at the phone
level. +he files were digiti6ed at a &1 kB6 sample rate using a &18bit ACA. Dandsiedel et al. %0'
used available phoneme annotation at the phone level and syllabic annotation was created using
the rule8based syllabification program.
Recently, a database for a similar purpose has been developed in Eannada, a language
predominantly spoken in the state of Earnataka %&2'. +his work is taken up under a consortium
where a database is being developed in &, !ndian languages for use in prosody related research.
+his consortium is headed by 5rof. ". Fegnanaraya, 5rofessor, at !!!+ Byderabad. +he Eannada
speech corpus consists of data in three different contets namely, read mode, conversation mode
and etempore mode. A four layer transcription namely, phonetic transcription, using
international phonetic alphabet (!5A) symbols, syllabification, pitch marking and break marking
is done for a subset of the data. .2 hours of data (&2 hours in each mode) has been recorded and
& hour data has been prosodically transcribed with four layers. +his data has been collected from
91 different speakers from different dialects of Earnataka. +he data was recorded at &1 kB6 and
&1 bits per sample.
Objective
+he ob4ective of the work will be to eplore new techni)uesCmethods for automatic
syllabification of speech.
Methodology
+his work will focus on evaluating the eisting methods for syllable level segmentation on
American *nglish standard database, +!(!+. Dater, same features will be used for syllabification
of Eannada database mentioned already. >ew methods will be investigated.
Possible Outcome
A system which can automatically segment speech signal into syllables.
Reerences
%&' Su Din 3u, "rian *. A. Eingsbury, >elson (organ and Steven <reenberg, G!ncorporating
information from syllable8length time scales into automatic speech recognition,H in Proceedings
of IEEE Int. Conf. Acoust., Speech, and Signal Processing, Seatle, 3 A, pp. @,&I@,7, &00-.
%,' +oledano, A. +., Bernande6 <Jome6, D. and :illarrubia <rande, D., GAutomatic 5honetic
SegmentationH, in !*** +ransactions on Speech and Audio 5rocessing, vol. &&, no. 1, pages
1&@I1,9, >ovember ,22..
%.' 5fit6inger, B.R., "urger S., Beid, S., GSyllable Aetection in Read and Spontaneous SpeechH,
!KSD5 &007.
%7' :illing, R., ?. +imoney, +. 3ard, and ?. Kostello, GAutomatic blind syllable segmentation for
continuous speechH, in 5roceedings of !SSK ,227, "elfast, ,227.
%9' F. ;hang, and ?.R. <lass. GSpeech rhythm guided syllable nuclei detection.H in 5roceedings
of !KASS5 ,220, 55. .@0@8.-22, ,220.
%1' +. >agara4an, +., and B. A. (urthy. LSubband8based group delay segmentation of
spontaneous speech into syllable8like units.L EURASIP Journal on Advances in Signal
Processing ,227.&@ (&022)# ,1&78,1,9, ,227.
%@' R. ?anakiraman, ?. Khaitanya Eumar, and Bema A. (urthy. LRobust syllable segmentation
and its application to syllable8centric continuous speech recognition.L !*** >ational Konference
on Communications (CC!, "#$#.
%-' Damel, D.$., Eassel, R.B., M Seneff, S. (&0-1). Speech database development# Aesign and
analysis of the acoustic8phonetic corpus. AAR5A Speech Recognition 3orkshop. 5alo, Alto.
%0' Khristian Dandsiedel, ?ens *dlund, $lorian *yben, Aaniel >eiberg, "4Nrn Schuller,
GSyllabification of conversational speech using bidirectional log8short8term memory neural
networksH, !KASS5, ,2&&, 9,9189,90, ,2&&.
%&2' Shridhara, (.:., "anahatti, ".E., >arthan, D. Ear4igi, :. and Eumaraswamy R.,
GAevelopment of Eannada speech corpus for prosodically guided phonetic search engineH,
Oriental KOKOSAA, Konference on Asian Spoken Danguage Research and *valuation (O8
KOKOSAACKASDR*), <urgoan, ,2&..

You might also like