Professional Documents
Culture Documents
The SIGMA Algorithm A Glottal Activity Detector Fo
The SIGMA Algorithm A Glottal Activity Detector Fo
net/publication/224441000
CITATIONS READS
77 546
2 authors, including:
Mark R. P. Thomas
Dolby Laboratories, Inc.
54 PUBLICATIONS 945 CITATIONS
SEE PROFILE
All content following this page was uploaded by Mark R. P. Thomas on 06 March 2015.
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1558 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009
speech. Recent approaches have applied multiscale analysis the entire recording multiplied by constant-valued coefficient.
to detect glottal activity as singularities in the EGG signal If DEGG passes through both thresholds within a set period of
[17] and speech signal [18]. Existing techniques are, however, time, an estimated GCI is flagged. A GOI is the point in the
often prone to errors around the end of voicing as discussed in EGG signal whose amplitude is equal to the amplitude at the
Section II. preceding GCI.
In cases where only the speech signal is available, new al-
gorithms have recently been proposed which estimate glottal B. Detection Errors
activity from the speech signal alone [19]–[23], and this is an Both SIGMA and the algorithms described are evaluated
ongoing topic of research with seemingly ever-improving re- against a large hand-labeled database. The remainder of this
sults. These algorithms enable glottal activity information to be section describes common features of the EGG signal, those
determined in real-world applications in which, typically, the cases where interpretation of the EGG signal requires clarifica-
EGG signal is not available. However, as such methods improve, tion and the resulting errors made by existing algorithms.
their evaluation requires ever more accurate references. This re- A voiced speech signal, its corresponding time-aligned
quirement, alongside the application to the study of pathological EGG signal and the EGG derivative are shown in Fig. 1. Time
speech, further motivates the development of better EGG-based alignment is achieved by ensuring that the lip-microphone
detection algorithms. propagation distance plus an estimate of the length of the
This paper describes the Singularity in EGG by Multiscale talker’s vocal tract is a constant value, then subtracting the
Analysis (SIGMA) algorithm. SIGMA benefits from the use corresponding delay. We define a positive EGG signal to be
of multiscale processing but it novely extends the approach by high glottal contact area, giving positive- and negative-going
performing spike detection on the multiscale product using a transients for GCIs and GOIs, respectively, with corresponding
group delay method [24] which circumvents the need for thresh- spikes in the EGG derivative.
olding. The robustness of our approach to false detections is Errors in GCI detection can be divided into two categories
further enhanced by Gaussian mixture modeling [25] which is [19]: False alarm errors are made when more than one GCI is
used to remove detections with unlikely features. The proposed detected within a reference cycle; Miss errors are made when no
method provides GCI estimation with outstandingly high accu- GCI is detected within a reference cycle (GOI errors are treated
racy which also achieves similarly accurate GOI detection. Ad- in the same manner). Errors occur when certain types of EGG
ditionally, the algorithm makes no assumptions about the nature signal, discussed in the following sections, cause a poor estimate
of the EGG signal other than the bounds on the range of glottal of the signal thresholds described in Section II-A.
frequency and open quotients [26]; SIGMA may therefore have
many further uses as it is also suitable for singularity detection C. “False Alarm” Errors
in applications outside the field of speech processing. It has been shown that, for normal “modal” voiced speech,
This paper is organized as follows. Section II reviews the the frequency of oscillation of the glottis and the open quo-
characteristics of the EGG signal and the methodology em- tient are dependent on phoneme and voice quality [12], [14].
ployed by some existing algorithms. Section III describes Studies have further revealed that, for a given talker, the diffi-
multiscale analysis, the use of the group delay function and culty of detecting glottal closure is largely independent of the
Gaussian mixture modeling for spike detection in the multiscale sound produced but that interesting effects occur at the bound-
product. The proposed SIGMA algorithm is compared with aries of voiced/unvoiced speech, noting in particular [27]:
existing techniques and evaluated in Section IV. Conclusions 1) “Vocal fold vibration does not stop abruptly at the end of
are drawn in Section V. voicing, but slowly decays as the vocal folds come to a rest
position.”
II. INTERPRETING EGG SIGNALS 2) “It is possible for vocal fold vibration to continue without
the generation of any significant energy,” termed “breathy
offsets” [28].
A. HQTx and TXGEN
This is examined in greater detail in [28], where a third phenom-
In Section IV, the performance of SIGMA is compared enon is observed at the end of voicing.
with two existing algorithms: High Quality Time of excitation 1) “A persistence of energy in the speech waveform after
(HQTx) and Time of eXcitation GENerator (TXGEN) [16]. the EGG waveform has dropped virtually to zero,” termed
The following is a brief description of their operation. “breathy voice.”
HQTx uses two derived functions: DEGG and an estimation In the case of breathy offsets, GCIs can be detected from the
of instantaneous gradient. A threshold function varies dynami- EGG long after the speech amplitude has significantly dimin-
cally with the EGG signal, whose minimum is set by periods of ished as the EGG signal remains modal, with increasing open
silence assumed to lie during the first and last 20 ms of the EGG phases that result in a breathier sound [28]. This is demonstrated
recording. The instants of time when the DEGG and instanta- in Fig. 2, showing 14 cycles of breathy offset terminating in
neous gradient exceed this threshold are the estimated GCIs. breathy voice when EGG signal finally loses modality.
TXGEN uses a more straightforward approach but attempts In the case of breathy voice, observed throughout case (3)
to detect both GCIs and GOIs. After low-pass filtering the EGG and at the very end of case (2), the glottis is “flapping in the
signal at 3 kHz, it is differentiated to find DEGG. High and breeze” [29] with insufficient contact to register on the EGG
low thresholds are set by the extrema of the DEGG signal from waveform. As described in [30], “If the glottis does not shut
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1559
Fig. 4. (a) Original Speech signal with correct GCIs (marked “ ”) and false
alarm errors (marked “ ”) and (b) time scale expanded by three times with
Fig. 2. (a) Speech signal, (b) EGG signal, and (c) its time derivative with over- the PSOLA Algorithm. Voiced cycles are copied and concatenated to increase
layed HQTx GCI estimation markers at the end of a voiced speech segment, /u/, duration; this works well for modal speech but fails when GCIs are detected in
exhibiting “breathy offset” (cycles 8–21) and briefly “breathy voice.” The first the wrong location.
22 GCIs are identified correctly (marked “ ”) but the last three (marked “ ”)
are erroneous.
that this usually lasts for just a few cycles of speech but er-
roneous estimates by a GCI detector during these segments
can cause significant problems for glottal-synchronous algo-
rithms. For example, a pitch tracker [2] that calculates pitch
on a cycle-by-cycle basis will give highly erratic results.
Glottal-synchronous speech processing algorithms such as
prosodic speech modification [3], speech dereverberation [4],
speech synthesis [5], and voice source modeling [6] all rely
upon the manipulation of individual cycles of speech. Any
fricatives or plosives following segments of voiced speech
will be treated as periodic, giving rise to particularly annoying
artefacts [31].
An example is shown in Fig. 4 where HQTx is used to drive
the PSOLA algorithm [3] to increase the duration of a speech
signal by three times without affecting prosody or formant struc-
ture. Applications for increasing the duration of a speech signal
include enhancing intelligibility and lip synchronization in mo-
tion video. It is achieved by repeating cycles of voiced speech
and concatenating them with an estimate of the correct period
Fig. 3. (a) Speech signal, (b) EGG signal, and (c) its time derivative with over- as shown in the first 70 ms of Fig. 4(b). Unvoiced speech and
layed HQTx GCI estimation markers at the end of a voiced speech segment, /I/, voiced-unvoiced transitions do not exhibit such periodicity so a
exhibiting “breathy voice.” The first three GCIs are identified correctly (marked
“ ”) but the last four (marked “ ”) are erroneous. Negative peaks due to glottal common approach is to leave these segments unmodified [31].
opening are significant in (c). This is not the case due to the erroneous detections at the voiced-
unvoiced transition from 70–150 ms, leading to strange artefacts
that detract from the otherwise natural sound of the processed
quickly enough no vocal wave is generated in the supra- voiced speech segments.
glottic cavity,” and is demonstrated in Fig. 3. In both cases, Sudden changes in EGG amplitude can also cause false alarm
a number of erroneous GCIs are detected by HQTx during errors in dynamic threshold-based algorithms if the threshold is
segments of breathy voice ( ) until its dynamic threshold is too low. A further problem with dynamic thresholds arises when
no longer exceeded. These errors also often occur at erratic GCIs have slow rise times [17], causing not a spike but a spread
intervals. For the hand-labeled reference, marked “ ,” the pulse in the DEGG. In this case, we define the GCI as the center
labeler would not mark any GCIs where there is no visible of energy of the pulse.
instant defining the periodicity, as would be the case with all
instances of breathy voice. D. “Miss” Errors
Breathy voice represents a natural transition from modal A common feature at the end of voiced segments is a reduced
voiced speech to unvoiced or silence [28]. It is further noted EGG signal amplitude compared with normal modal voice.
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1560 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1561
Fig. 6. (a) Approximation and (b) detail analysis filters for multiscale analysis. Fig. 7. EGG waveform, multiscale product and group delay function for GCI
Iterating these filters through a dyadic filterbank constructs a biorthogonal spline detection. Candidates are marked “ ” and chosen candidates are marked “ .”
wavelet with one vanishing moment. The ideal slope, marked in a dashed line on the lowest plot, is the slope which
would exist if the candidates were perfect impulses.
(2) (5)
where is the SWT of at scale and are the ap- where is the Fourier transform of and is
proximation coefficients at scale . The multiscale product, the Fourier transform of at frequency . If
, is formed by , where is a unit impulse function, it
follows from (5) that . In the presence of noise,
remains constant but with a degree of additive noise, so
(3) an averaging procedure needs to be performed over ; different
approaches are reviewed in [24]. The Energy-Weighted Group
where it is assumed that the lowest scale to include is always 1. Delay was deemed the most appropriate [20], defined as
The sign of is inverted compared with a DEGG using the
chosen wavelet, hence a minus sign is included to maintain the
convention. The de-noising effect of at each scale in con- (6)
junction with the multiscale product means that is near-
zero except at discontinuities across the first scales of
as depicted in Fig. 7(b), allowing better identification of discon- Manipulation yields the simplified expression
tinuties than the DEGG. The function can be half-wave
rectified to contain peaks pertaining only to GCIs, , or
GOIs, , which aids the group delay function in the fol- (7)
lowing step. The value of is limited by , but it is often no
greater than as the region of support (RoS) of and
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1562 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009
C. Candidate Selection
The true GCIs are usually a subset of the negative-going zero
crossings of , with additional false crossings during un-
voiced speech, silence and occasionally between GCIs. Many
existing approaches concentrate only on those areas where false Fig. 8. Typical distribution of GCI feature vectors for a segment of voiced/
candidates are unlikely to occur. The following candidate selec- unvoiced/silent speech. The chosen cluster, whose members are marked “ ” is
tion technique aims to remove all false candidates to provide a the one whose mean is furthest from the origin. Rejected candidates are
marked “ .”
set of true GCIs throughout an entire segment of speech. Let the
number of candidates be occurring at samples ,
. Three measurements construct a
upon the likelihood of class , , given feature vector
feature vector, , from which is derived
a feature matrix, . The features are
defined as follows. (11)
1) Consistency of the group delay gradient. In the case of
a Dirac pulse, is a negative unit slope, with a Fig. 8 shows a typical distribution of the feature vectors for
zero crossing at the location of the impulse and width a segment of mixed voiced/unvoiced/silent speech. It has been
samples, as shown in Fig. 7(c). A spread pulse or the found empirically that the cluster whose mean is furthest
presence of noise will cause the slope to deviate from from the origin is most likely to contain the chosen candidates,
the ideal shape, denoted . The RMS error between marked “ .” Rejected candidates are marked “ .” The chosen
ideal and measured is calculated as GCI estimates are defined as .
GOIs are calculated in the same way but with reversed signs
where appropriate.
(8)
D. Swallowing
The algorithm proposed thus far performs accurate singu-
2) Peak value of multiscale product’s root inside group larity detection on an input signal without considering any
delay window. It is shown in [34] that the root of characteristics peculiar to EGG waveforms. It is found that
helps to give a “zooming in” on the signal, par- in natural conversional speech, singularities are often caused
ticularly at weak amplitudes (in this case ). Exper- by swallowing and occasionally by electrical interference
imentation with this algorithm has shown that the group in the measurement apparatus and are usually single iso-
delay function gives best results on but that its lated impulse-like signals. Considering a maximum period
root has better discriminative properties. all GCIs which are separated from a neighbouring
GCI by more than are rejected, else they are kept pro-
viding: .
(9) Experimentation has shown that provided the polarity of the
3) Area beneath multiscale product’s root inside group recording is correct, swallowing only causes errors in closure
delay window. In the case of a spread singularity, the detection so this technique is not applied to opening detection.
area beneath the multiscale product’s root can pro-
vide better discrimination of candidates. E. GOI Postfiltering
GOIs are detected from using the same approach
(10) as applied to GCI detection (with inverted signs where appro-
priate). However, the energy imparted by glottal opening is often
significantly lower than glottal closure, which results in more er-
The distributions of the feature vectors are modeled as two roneous GOI candidates. Assuming that a GOI always accompa-
multivariate Gaussians using the EM algorithm [25], initialized nies a GCI, postprocessing can be applied to use GCI estimates
with two random data points. Acceptance or rejection is based to improve GOIs accordingly.
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1563
Fig. 9. SIGMA system diagram. The EGG signal is decomposed into multiple scales from which the half-wave rectified multiscale product is derived.
Spike detection is performed on by the negative-going zero crossings of the group delay function at samples . Feature vectors derived from
the ideal group delay slope and are clustered by an unsupervised EM algorithm to obtain the GCI estimates . Similarly, GOIs are detected using
the negative half-wave of the multiscale product . Postprocessing is applied to the GCI estimates to remove isolated clicks from sources other than glottal
closure to give . GOI postprocessing removes candidates which do not lie within the range of permitted open quotients, using the GCIs as references giving .
The main cause of error in GOI post-filtering is small pertur- SAM database [40] contains readings of duration approxi-
bations in immediately preceding a glottal closure which mately 150 seconds by two male and two female speakers
triggers a zero crossing in the group delay function. A region and these were labeled in the same manner. SAM recordings
surrounding the closure is therefore isolated, limiting the al- are considered to contain more natural speech with a greater
lowed open quotient, , to the bounds and . The number of swallows and present a more challenging task for a
first candidate which lies within these limits is accepted; if no glottal activity detector. The EGG recordings were run through
candidate is found, then one is inserted following the current the HQTx (GCI only), TXGEN and SIGMA algorithms and
GCI at the previous open quotient. were evaluated by finding the number of estimates per reference
The SIGMA system diagram is shown in Fig. 9. Symmetry cycle then classified as follows, depicted in Fig. 10.
can be seen between closure and opening detection up until the 1) Hit. One estimate per true glottal cycle.
postprocessing stage; prior to this point the algorithm need only 2) Miss. No estimates per true glottal cycle.
know the maximum frequency of the singularities to detect and 3) False Alarm (FA). More than one estimate per glottal cycle.
so is suitable for general singularity detection. 4) False Alarm Total (FAT) Total number of false alarms (the
number of estimates which are not hits).
The measures are defined as follows.
IV. RESULTS AND DISCUSSION 1) .
The SIGMA algorithm has three parameters and these were 2) .
set as follows. 3) .
• : the group delay evaluation window size and there- 4) .
fore the maximum frequency of singularities which can be 5) .
detected. In the case of voiced speech, the maximum glottal A glottal cycle is defined as for GCIs and
frequency is Hz giving ms. for GOIs. Hit accuracy and hit bias
• : the maximum glottal period, so that isolated GCI are the RMS and mean errors between all hits and the cor-
candidates separated from neighboring candidates by more responding ground-truth estimates, respectively. The testing
than this value are removed in the GCI postfiltering step. strategy is identical to that employed in [19] with the addition
A minimum glottal frequency of 50 Hz leads to a of the FAT measure, which counts the total number of false
ms. alarms as a proportion of total estimates and not the number
• : the minimum and maximum open quotients of reference cycles containing more than one estimate as a
for GOI postfiltering. Their purpose is to isolate a region proportion of true glottal cycles. The overall figure of merit
around a GCI inside which a GOI cannot be detected. They provides a single-valued measure of performance by expressing
are set at 10% and 90%, respectively. the hit rate as a proportion of all reference cycles summed with
The MATLAB implementation of the chosen biorthogonal the number of non-hit estimates (the FAT).
spline decomposition filters is called bior1.5. The GCI results in Tables I and III show that SIGMA per-
forms significantly better than HQTx and TXGEN when applied
to either database. Notably HQTx is prone to false alarm errors
A. Experiment 1: Evaluation With APLAWD and SAM
whereas TXGEN is prone to miss errors; this agrees with the
The APLAWD database [39] contains speech and contem- qualitative analysis of HQTx’s performance in Section II which
poraneous EGG recordings of five short sentences, repeated showed that it is prone to false alarms at the end of segments of
ten times by five male and five female talkers. GCIs and voiced speech. HQTx and TXGEN exhibit much greater FAT
GOIs were hand-labeled on the first repetition of every than FA which suggests that each false alarm is usually fol-
sentence independently of the algorithms under test, de- lowed by successive false alarms within a single reference cycle.
noted , , and , SIGMA’s miss, FAT and FA measures are broadly similar which
, respectively. A subset of the tells us that successive false alarms do not usually occur within a
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1564 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009
TABLE IV
OPENING PERFORMANCE ON THE SAM DATABASE BY TXGEN, SIGMA (dG
WAVELET AND bior1.5 WAVELET) ALGORITHMS
clustering. SIGMA and HQTx hit bias are universally low but
TXGEN’s estimates tend to occur slightly early.
SIGMA’s GOI results in Tables II and IV are also encour-
aging. The reliance upon the estimated GCIs results in similar
hit, miss and false alarm rates, with diminished hit accuracy due
to the greater difficulty of precisely locating openings. The gap
in the overall figure of merit between SIMGA and TXGEN is
Fig. 10. Testing strategy. A hit is one estimate occuring during a reference
cycle. A miss is the absence of an estimate per reference cycle. If more than one again more than an order of magnitude.
estimate occurs per reference cycle, one false alarm (FA) is counted and the total
number false alarms in the cycle are added to false alarm total (FAT). Accuracy B. Experiment 2: Variation in Group Delay Window Size
and bias are the RMS and mean errors between hits and the corresponding ref-
erence, respectively. The group delay evaluation window size was set according to
the physical constraints of human speech, whose minimum fun-
damental period is around 2.5 ms. This experiment assesses the
TABLE I
CLOSURE PERFORMANCE ON THE APLAWD DATABASE BY HQTX, TXGEN,
algorithm’s sensitivity to variation in the group delay window
SIGMA (dG AND bior1.5 WAVELET) ALGORITHMS size on the APLAWD database.
The results presented in Fig. 11 show that 2.5 ms is indeed
an optimal choice of window length. The reliance on GCIs to
estimate GOIs means that intuitively the overall, hit, miss, and
FAT rates should vary in a similar manner which is confirmed
by these results.
FAT rates increase with decreasing window sizes due to the
fact that more negative zero crossings can occur in the group
delay function per unit time. In this case the true candidates re-
TABLE II
OPENING PERFORMANCE ON THE APLAWD DATABASE BY TXGEN, SIGMA main a subset of all candidates, with a number of additional false
(dG AND bior1.5 WAVELET) ALGORITHMS ones arising. Providing the clustering algorithm can discrimi-
nate against the false candidates, those which are true should
always be detected so false alarm rates should therefore increase
slowly with decreasing window size.
Miss rates increase with window size as neighboring singular-
ities can occur within a single group delay window and reduce
the number of negative zero crossings. It becomes impossible
for the GMM to find the correct candidates as they are no longer
TABLE III
CLOSURE PERFORMANCE ON THE SAM DATABASE BY HQTX, TXGEN,
a subset of the candidate set, hence miss rates climb rapidly with
SIGMA (dG AND bior1.5 WAVELET) ALGORITHMS increasing window size.
GCI bias and hit accuracy are relatively immune to variations
in window size, suggesting that providing one candidate occurs
per true period, is it statistically the correct choice. GOI bias and
hit accuracy are more sensitive, showing the most significant in-
crease with reduced window size. Bias increases monotonically
with decreasing window length.
This experiment was repeated for male- and female-only
speech. The results provide similar curves to the previous ex-
given reference cycle and that misses and false alarms have sim- periment that employs both genders, the optimum value being
ilar likelihood. SIGMA’s overall figures of merit are more than shifted up to approximately 3 ms for male voices and down to
an order of magnitude greater than the other algorithms under approximately 2 ms for female. The experiment with mixed
test. male/female speech shows that variation in group delay size
SIGMA’s GCI hit accuracy is in the order of a few samples does not have a significant effect upon the results in the range
which agrees with the statement in Section III-C that the true of approximately 1.5 to 3.5 ms, hence performance is weakly
GCIs are usually a subset of the SIGMA candidate GCIs before dependent on gender.
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1565
REFERENCES
[1] P. Davies, G. A. Lindsey, H. Fuller, and A. J. Fourcin, “Variation of
glottal open and closed phases for speakers of English,” Proc. Inst.
Acoust., vol. 8, no. 7, pp. 539–546, 1986.
[2] W. Hess and H. Indefrey, “Accurate pitch determination of speech sig-
nals by means of a laryngograph,” in Proc. IEEE Intl. Conf. Acoust.,
Speech, Signal Process. (ICASSP), 1984, vol. 9, pp. 73–76.
[3] H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using
PSOLA technique,” Speech Commun., vol. 11, no. 2, pp. 175–187, Jun.
1992.
[4] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “Multi-microphone
speech dereverberation using spatio-temporal averaging,” in Proc. Eur.
Signal Process. Conf. (EUSIPCO), Vienna, Austria, Sep. 2004, pp.
809–812.
[5] E. Moulines and F. Charpentier, “Pitch-synchronous waveform
Fig. 11. Effect of varied group delay window length on (a) overall and hit, (b) processing techniques for text-to-speech synthesis using diphones,”
miss and FAT, and (c) and bias and hit accuracy. The choice of 2.5 ms from Speech Commun., vol. 9, no. 5–6, pp. 453–467, Dec. 1990.
physical reasoning is close to the optimal value. [6] M. R. P. Thomas, J. Gudnason, and P. A. Naylor, “Data-driven voice
source waveform modelling,” in Proc. IEEE Intl. Conf. Acoust.,
Speech, Signal Process. (ICASSP), Taipei, Taiwan, Apr. 2009, pp.
C. Experiment 3: Comparison With Cubic Spline Wavelet 3965–3968.
[7] J. Deller, “Some notes on closed phase glottal inverse filtering,” IEEE
The derivative-of-Gaussian (dG) cubic spline wavelet is the Trans. Acoust., Speech, Signal Process., vol. ASSP–29, no. 4, pp.
wavelet of choice for multiscale analysis in [17], [18] and [36]. 917–919, Aug. 1981.
[8] J. Gudnason and M. Brookes, “Voice Source cepstrum coefficients
Experiments with other common wavelets have shown that the for speaker identification,” in Proc. IEEE Intl. Conf. Acoust., Speech,
bior1.5 biorthogonal spline wavelet is more effective for EGG Signal Process. (ICASSP), 2008, pp. 4821–4824.
analysis with this algorithm. The results in Tables I–IV show [9] R. Colton and J. Casper, Understanding Voice Problems: A Physiolog-
ical Perspective for Diagnosis and Treatment. New York: Williams
SIGMA using the dG wavelet (labeled SIGMA-dG) as well as & Wilkins, 1996.
the proposed bior1.5 (labeled SIGMA). [10] K. Verdolini, R. Chan, I. R. Titze, M. Hess, and W. Bierhals, “Corre-
The performance of SIGMA is slightly reduced with the dG spondence of electroglottographic closed quotient to vocal fold impact
stress in excised canine larynges,” J. Voice, vol. 12, no. 4, pp. 415–423,
wavelet, particularly with increased false alarms and increased Feb. 1998.
hit error on the opening tests. Miss rates are slightly reduced but [11] J. Gamboa, F. J. Jiménez-Jiménez, A. Nieto, I. Cobeta, A. Vegas, M.
the greater increase in false alarm rate diminishes the overall Orti-Pareka, T. Gasalla, J. A. Molina, and E. Garcia-Albea, “Acoustic
voice analysis in patients with essential tremor,” J. Voice, vol. 12, no.
performance results. 4, pp. 444–452, Feb. 1998.
[12] E. R. M. Abberton, D. M. Howard, and A. J. Fourcin, “Laryngographic
V. CONCLUSION assessment of normal voice: A tutorial,” Clinical Linguist. Phon., vol.
3, pp. 281–296, 1989.
We have shown that robust detection of GCIs and GOIs from [13] D. G. Childers, D. M. Hooks, G. P. Moore, L. Eskenazi, and A. L.
EGG signals is particularly challenging at the transition regions Lalwani, “Electroglottography and vocal fold physiology,” J. Speech.
around the ending of voicing. A new method for glottal activity Hear. Res., vol. 33, no. 2, pp. 245–254, Jun. 1990.
[14] D. M. Howard, “Variation of electrolaryngographically derived closed
detection from EGG recordings has been presented which is quotient for trained and untrained adult female singers,” J. Voice, vol.
accurate even in these challenging regions. It first detects sin- 9, no. 2, pp. 121–1223, Jun. 1995.
gularities in the EGG signal by the multiscale product of three [15] N. Henrich, C. d’Alessandro, M. Castellengo, and B. Doval, “On the
use of the derivative of electroglottographic signals for characterization
dyadic scales. It then employs a technique based upon the group of nonpathological voice phonation,” J. Acoust. Soc. Amer., vol. 115,
delay function which detects peaks in the multiscale product. In- no. 3, pp. 1321–1332, Mar. 2004.
Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1566 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009
[16] M. A. Huckvale, “Speech Filing System: Tools for Speech,” Tech. Rep. [34] A. Bouzid and N. Ellouze, “Local regularity analysis at glottal opening
Univ. College London, London, U.K., 2004 [Online]. Available: http:// and closure instants in electroglottogram signal using wavelet trans-
www.phon.ucl.ac.uk/resource/sfs form modulus maxima,” in Proc. Eurospeech, 2003, pp. 2837–2840.
[17] A. Bouzid and N. Ellouze, “Multiscale product of electroglottogram [35] S. Mallat and W. L. Hwang, “Singularity detection and processing with
signal for glottal closure and opening instant detection,” in Proc. wavelets,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp. 617–643, Mar.
IMACS MultiConf. Comput. Eng. Syst. Applicat., 2006, vol. 1, pp. 1992.
106–109. [36] B. M. Sadler and A. Swami, “Analysis of multiscale products for step
[18] A. Bouzid and N. Ellouze, “Glottal opening instant detection from detection and estimation,” IEEE Trans. Inf. Theory, vol. 45, no. 3, pp.
speech signal,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Vi- 1043–1051, Apr. 1999.
enna, Austria, Sep. 2004, pp. 729–732. [37] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM,
[19] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, “Estima- 1992.
tion of glottal closure instants in voiced speech using the DYPSA algo- [38] R. Smits and B. Yegnanarayana, “Determination of instants of signif-
rithm,” IEEE Trans. Speech Audio Process., vol. 15, no. 1, pp. 34–43, icant excitation in speech using group delay function,” IEEE Trans.
Jan. 2007. Speech Audio Process., vol. 5, no. 3, pp. 325–333, Sep. 1995.
[20] M. R. P. Thomas, N. D. Gaubitch, and P. A. Naylor, “Multichannel [39] G. Lindsey, A. Breen, and S. Nevard, “SPAR’s Archivable Actual-
DYPSA for estimation of glottal closure instants in reverberant Word Databases,” Tech. Rep. Univ. College London, London, U.K.,
speech,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Poznan, 1987.
Poland, Sep. 2007. [40] D. Chan, A. Fourcin, D. Gibbon, B. Granstrom, M. Huckvale,
[21] K. S. Rao, S. R. M. Prasanna, and B. Yegnanarayana, “Determination G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno,
of instants of significant excitation in speech using Hilbert envelope J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, and J. Zeiliger,
and group delay function,” IEEE Signal Process. Lett., vol. 14, no. 10, “EUROM – A spoken language resource for the EU,” in Proc. Eur.
pp. 762–765, Oct. 2007. Conf. Speech Commun. Technol., Sep. 1995, pp. 867–870.
[22] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech
signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 8, pp.
1602–1613, Nov. 2008.
[23] W. Saidi, A. Bouzid, and N. Ellouze, “Evaluation of multi-scale
product method and DYPSA algorithm for glottal closure instant de-
tection,” in Proc. 3rd Int. Conf. Inf. Commun. Technol.: From Theory Mark Thomas (S’06) received the M.Eng. degree in
to Applicat. (ICTTA), Apr. 2008, pp. 1–5. electrical and electronic engineering from Imperial
[24] M. Brookes, P. A. Naylor, and J. Gudnason, “A quantitative as- College, London, U.K., in 2006 where he is currently
sessment of group delay methods for identifying glottal closures in pursuing the Ph.D. degree.
voiced speech,” IEEE Trans. Speech Audio Process., vol. 14, no. 2, His research interests include glottal-synchronous
pp. 456–466, Mar. 2006. and multichannel speech processing, involving
[25] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood methods for analysis, prosodic manipulation and re-
from incomplete data via the EM algorithm,” J. R. Statist. Soc., Ser. B, verberation/noise reduction. His previous experience
vol. 39, no. 1, pp. 1–38, 1977. in industry was with the BBC R&D Department,
[26] M. B. Higgins and J. H. Saxman, “A comparison of selected phonatory where he worked on audio, video, and RF engi-
behaviours of healthy aged and young adults,” J. Speech Hear. Res., neering.
vol. 34, pp. 1000–1010, Oct. 1991.
[27] A. K. Krishnamurthy and D. G. Childers, “Two-channel speech anal-
ysis,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-34, no.
4, pp. 730–743, Aug. 1986. Patrick Naylor (M’89–SM’07) received the B.Eng.
[28] D. M. Howard and G. Lindsey, “Conditioned variability in voicing off- degree in electronics and electrical engineering from
sets,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 3, pp. the University of Sheffield, Sheffield, U.K., in 1986
406–407, Mar. 1988. and the Ph.D. degree from Imperial College, London,
[29] J. C. Catford, Fundamental Problems in Phonetics. Bloomington, IN: U.K., in 1990.
Indiana Univ. Press, 1977. Since 1989, he has been a Member of Academic
[30] Y. Lebrun and J. Hasquin-Deleval, “On the so-called ’dissociations’ be- Staff in the Communications and Signal Processing
tween electroglottogram and phonogram,” Folia Phoniatrica, vol. 23, Research Group, Imperial College London, where he
pp. 225–227, 1971. is also Director of Postgraduate Studies. His research
[31] M. R. P. Thomas, J. Gudnason, and P. A. Naylor, “Application of the interests are in the areas of speech and audio signal
DYPSA algorithm to segmented time-scale modification of speech,” in processing and he has worked in particular on adap-
Proc. Eur. Signal Process. Conf. (EUSIPCO), Lausanne, Switzerland, tive signal processing for acoustic echo control, speaker identification, multi-
Aug. 2008. channel speech enhancement, and speech production modeling. In addition to
[32] P. Jax and P. Vary, “On artificial bandwidth extension of telephone his academic research, he enjoys several fruitful links with industry in the U.K.,
speech,” Signal Process., vol. 83, pp. 1707–1719, 2003. U.S., and in mainland Europe.
[33] S. Mallat and S. Zhong, “Characterization of signals from multiscale Dr. Naylor is an Associate Editor of IEEE SIGNAL PROCESSING LETTERS and
edges,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 7, pp. a member of the IEEE Signal Processing Society Technical Committee on Audio
710–732, Jul. 1992. and Electroacoustics.
Authorized
View publication stats licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.