You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224441000

The SIGMA Algorithm: A Glottal Activity Detector for Electroglottographic


Signals

Article  in  IEEE Transactions on Audio Speech and Language Processing · December 2009


DOI: 10.1109/TASL.2009.2022430 · Source: IEEE Xplore

CITATIONS READS
77 546

2 authors, including:

Mark R. P. Thomas
Dolby Laboratories, Inc.
54 PUBLICATIONS   945 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Mark R. P. Thomas on 06 March 2015.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009 1557

The SIGMA Algorithm: A Glottal Activity Detector


for Electroglottographic Signals
Mark R. P. Thomas, Student Member, IEEE, and Patrick A. Naylor, Senior Member, IEEE

Abstract—Accurate estimation of glottal closure instants (GCIs)


and opening instants (GOIs) is important for speech processing
applications that benefit from glottal-synchronous processing.
The majority of existing approaches detect GCIs by comparing
the differentiated EGG signal to a threshold and are able to
provide accurate results during voiced speech. More recent al-
gorithms use a similar approach across multiple dyadic scales
using the stationary wavelet transform. All existing approaches
are however prone to errors around the transition regions at the
end of voiced segments of speech. This paper describes a new
method for EGG-based glottal activity detection which exhibits
high accuracy over the entirety of voiced segments, including, in
particular, the transition regions, thereby giving significant im-
provement over existing methods. Following a stationary wavelet
transform-based preprocessor, detection of excitation due to
glottal closure is performed using a group delay function and then
true and false detections are discriminated by Gaussian mixture
modeling. GOI detection involves additional processing using the
estimated GCIs. The main purpose of our algorithm is to provide
a ground-truth for GCIs and GOIs. This is essential in order to
evaluate algorithms that estimate GCIs and GOIs from the speech
signal only, and is also of high value in the analysis of pathological Fig. 1. (a) Speech signal, (b) the corresponding EGG signal, and (c) the EGG
speech where knowledge of GCIs and GOIs is often needed. We time derivative for / /. Negative peaks due to glottal opening are weak in (c).
compare our algorithm with two previous algorithms against a
hand-labeled database. Evaluation has shown an average GCI hit
rate of 99.47% and GOI of 99.35%, compared to 96.08 and 92.54 instant (GOI). The process repeats periodically as a series of
for the best-performing existing algorithm. pulses that produces “modal” voiced speech. The ratio of open
Index Terms—Electroglottograph (EGG), glottal closure in- time with respect to glottal period is termed the open quotient
stants (GCIs), group delay function, laryngograph. (OQ) [1].
Identification of GCIs in voiced speech is important for
speech processing algorithms such as pitch tracking [2],
I. INTRODUCTION prosodic speech modification [3], speech dereverberation [4],
glottal-synchronous processing in speech synthesis [5] and

A LL voiced sounds are produced by an excitation signal


that is filtered by a passive resonator called the vocal tract.
This excitation is produced by the vocal folds, which consist
voice source modeling [6]. Identification of GOIs is necessary
for closed-phase LPC analysis and subsequent inverse filtering
to obtain an estimate of glottal volume flow from a speech
of opposing ligaments that form a constriction at the top of the signal [7] for applications such as feature extraction for speaker
trachea as it joins the lower vocal tract. When air is expelled identification [8]. Further uses are found in the analysis of
from the lungs at sufficient velocity through this orifice—often pathological speech, including types of dysphonia [9], vocal
referred to as the glottis—the Bernoulli Effect results in a par- fold impact stress [10] and essential tremor [11]. We refer to
tial vacuum that causes the vocal folds to snap shut and disrupt GOIs and GCIs as glottal activity.
the air flow. This glottal closure instant (GCI) is followed by a The Electroglottograph (EGG) (or Laryngograph) signal
period during which the glottis is closed, until muscle tension [12] is a measurement of the electrical conductance of the
and air pressure cause the folds to reopen at the glottal opening glottis captured contemporaneously with speech recordings.
The measured EGG signal is proportional to the glottal contact
Manuscript received September 30, 2008; revised March 10, 2009. First pub- area, whose derivative (DEGG) during voiced speech contains
lished May 08, 2009; current version published August 14, 2009. The associate short, high-amplitude impulsive temporal features (spikes) due
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Hiroshi Sawada.
to glottal closure and smaller features of opposite sign due to
The authors are with the Department of Electrical and Electronic En- glottal opening. An example of a voiced speech segment, the
gineering, Imperial College London, London SW7 2AZ, U.K. (e-mail: corresponding EGG recording and its derivative is shown in
mark.r.thomas02@imperial.ac.uk; p.naylor@imperial.ac.uk). Fig. 1. Many approaches analyze the EGG by searching for
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. spikes in DEGG [13]–[16] and compare their amplitudes with
Digital Object Identifier 10.1109/TASL.2009.2022430 thresholds to obtain an estimate of glottal activity during voiced
1558-7916/$26.00 © 2009 IEEE

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1558 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009

speech. Recent approaches have applied multiscale analysis the entire recording multiplied by constant-valued coefficient.
to detect glottal activity as singularities in the EGG signal If DEGG passes through both thresholds within a set period of
[17] and speech signal [18]. Existing techniques are, however, time, an estimated GCI is flagged. A GOI is the point in the
often prone to errors around the end of voicing as discussed in EGG signal whose amplitude is equal to the amplitude at the
Section II. preceding GCI.
In cases where only the speech signal is available, new al-
gorithms have recently been proposed which estimate glottal B. Detection Errors
activity from the speech signal alone [19]–[23], and this is an Both SIGMA and the algorithms described are evaluated
ongoing topic of research with seemingly ever-improving re- against a large hand-labeled database. The remainder of this
sults. These algorithms enable glottal activity information to be section describes common features of the EGG signal, those
determined in real-world applications in which, typically, the cases where interpretation of the EGG signal requires clarifica-
EGG signal is not available. However, as such methods improve, tion and the resulting errors made by existing algorithms.
their evaluation requires ever more accurate references. This re- A voiced speech signal, its corresponding time-aligned
quirement, alongside the application to the study of pathological EGG signal and the EGG derivative are shown in Fig. 1. Time
speech, further motivates the development of better EGG-based alignment is achieved by ensuring that the lip-microphone
detection algorithms. propagation distance plus an estimate of the length of the
This paper describes the Singularity in EGG by Multiscale talker’s vocal tract is a constant value, then subtracting the
Analysis (SIGMA) algorithm. SIGMA benefits from the use corresponding delay. We define a positive EGG signal to be
of multiscale processing but it novely extends the approach by high glottal contact area, giving positive- and negative-going
performing spike detection on the multiscale product using a transients for GCIs and GOIs, respectively, with corresponding
group delay method [24] which circumvents the need for thresh- spikes in the EGG derivative.
olding. The robustness of our approach to false detections is Errors in GCI detection can be divided into two categories
further enhanced by Gaussian mixture modeling [25] which is [19]: False alarm errors are made when more than one GCI is
used to remove detections with unlikely features. The proposed detected within a reference cycle; Miss errors are made when no
method provides GCI estimation with outstandingly high accu- GCI is detected within a reference cycle (GOI errors are treated
racy which also achieves similarly accurate GOI detection. Ad- in the same manner). Errors occur when certain types of EGG
ditionally, the algorithm makes no assumptions about the nature signal, discussed in the following sections, cause a poor estimate
of the EGG signal other than the bounds on the range of glottal of the signal thresholds described in Section II-A.
frequency and open quotients [26]; SIGMA may therefore have
many further uses as it is also suitable for singularity detection C. “False Alarm” Errors
in applications outside the field of speech processing. It has been shown that, for normal “modal” voiced speech,
This paper is organized as follows. Section II reviews the the frequency of oscillation of the glottis and the open quo-
characteristics of the EGG signal and the methodology em- tient are dependent on phoneme and voice quality [12], [14].
ployed by some existing algorithms. Section III describes Studies have further revealed that, for a given talker, the diffi-
multiscale analysis, the use of the group delay function and culty of detecting glottal closure is largely independent of the
Gaussian mixture modeling for spike detection in the multiscale sound produced but that interesting effects occur at the bound-
product. The proposed SIGMA algorithm is compared with aries of voiced/unvoiced speech, noting in particular [27]:
existing techniques and evaluated in Section IV. Conclusions 1) “Vocal fold vibration does not stop abruptly at the end of
are drawn in Section V. voicing, but slowly decays as the vocal folds come to a rest
position.”
II. INTERPRETING EGG SIGNALS 2) “It is possible for vocal fold vibration to continue without
the generation of any significant energy,” termed “breathy
offsets” [28].
A. HQTx and TXGEN
This is examined in greater detail in [28], where a third phenom-
In Section IV, the performance of SIGMA is compared enon is observed at the end of voicing.
with two existing algorithms: High Quality Time of excitation 1) “A persistence of energy in the speech waveform after
(HQTx) and Time of eXcitation GENerator (TXGEN) [16]. the EGG waveform has dropped virtually to zero,” termed
The following is a brief description of their operation. “breathy voice.”
HQTx uses two derived functions: DEGG and an estimation In the case of breathy offsets, GCIs can be detected from the
of instantaneous gradient. A threshold function varies dynami- EGG long after the speech amplitude has significantly dimin-
cally with the EGG signal, whose minimum is set by periods of ished as the EGG signal remains modal, with increasing open
silence assumed to lie during the first and last 20 ms of the EGG phases that result in a breathier sound [28]. This is demonstrated
recording. The instants of time when the DEGG and instanta- in Fig. 2, showing 14 cycles of breathy offset terminating in
neous gradient exceed this threshold are the estimated GCIs. breathy voice when EGG signal finally loses modality.
TXGEN uses a more straightforward approach but attempts In the case of breathy voice, observed throughout case (3)
to detect both GCIs and GOIs. After low-pass filtering the EGG and at the very end of case (2), the glottis is “flapping in the
signal at 3 kHz, it is differentiated to find DEGG. High and breeze” [29] with insufficient contact to register on the EGG
low thresholds are set by the extrema of the DEGG signal from waveform. As described in [30], “If the glottis does not shut

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1559

Fig. 4. (a) Original Speech signal with correct GCIs (marked “ ”) and false
alarm errors (marked “ ”) and (b) time scale expanded by three times with
Fig. 2. (a) Speech signal, (b) EGG signal, and (c) its time derivative with over- the PSOLA Algorithm. Voiced cycles are copied and concatenated to increase
layed HQTx GCI estimation markers at the end of a voiced speech segment, /u/, duration; this works well for modal speech but fails when GCIs are detected in
exhibiting “breathy offset” (cycles 8–21) and briefly “breathy voice.” The first the wrong location.
22 GCIs are identified correctly (marked “ ”) but the last three (marked “ ”)
are erroneous.

that this usually lasts for just a few cycles of speech but er-
roneous estimates by a GCI detector during these segments
can cause significant problems for glottal-synchronous algo-
rithms. For example, a pitch tracker [2] that calculates pitch
on a cycle-by-cycle basis will give highly erratic results.
Glottal-synchronous speech processing algorithms such as
prosodic speech modification [3], speech dereverberation [4],
speech synthesis [5], and voice source modeling [6] all rely
upon the manipulation of individual cycles of speech. Any
fricatives or plosives following segments of voiced speech
will be treated as periodic, giving rise to particularly annoying
artefacts [31].
An example is shown in Fig. 4 where HQTx is used to drive
the PSOLA algorithm [3] to increase the duration of a speech
signal by three times without affecting prosody or formant struc-
ture. Applications for increasing the duration of a speech signal
include enhancing intelligibility and lip synchronization in mo-
tion video. It is achieved by repeating cycles of voiced speech
and concatenating them with an estimate of the correct period
Fig. 3. (a) Speech signal, (b) EGG signal, and (c) its time derivative with over- as shown in the first 70 ms of Fig. 4(b). Unvoiced speech and
layed HQTx GCI estimation markers at the end of a voiced speech segment, /I/, voiced-unvoiced transitions do not exhibit such periodicity so a
exhibiting “breathy voice.” The first three GCIs are identified correctly (marked
“ ”) but the last four (marked “ ”) are erroneous. Negative peaks due to glottal common approach is to leave these segments unmodified [31].
opening are significant in (c). This is not the case due to the erroneous detections at the voiced-
unvoiced transition from 70–150 ms, leading to strange artefacts
that detract from the otherwise natural sound of the processed
quickly enough no vocal wave is generated in the supra- voiced speech segments.
glottic cavity,” and is demonstrated in Fig. 3. In both cases, Sudden changes in EGG amplitude can also cause false alarm
a number of erroneous GCIs are detected by HQTx during errors in dynamic threshold-based algorithms if the threshold is
segments of breathy voice ( ) until its dynamic threshold is too low. A further problem with dynamic thresholds arises when
no longer exceeded. These errors also often occur at erratic GCIs have slow rise times [17], causing not a spike but a spread
intervals. For the hand-labeled reference, marked “ ,” the pulse in the DEGG. In this case, we define the GCI as the center
labeler would not mark any GCIs where there is no visible of energy of the pulse.
instant defining the periodicity, as would be the case with all
instances of breathy voice. D. “Miss” Errors
Breathy voice represents a natural transition from modal A common feature at the end of voiced segments is a reduced
voiced speech to unvoiced or silence [28]. It is further noted EGG signal amplitude compared with normal modal voice.

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1560 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009

TXGEN’s thresholds are proportional to the extrema of the en-


tire signal and it is generally not prone to the false alarm errors
exhibited by HQTx. It instead gives miss errors where the EGG
amplitude is consistently low, particularly at the very beginning
and very end of voiced speech segments. For the majority of
glottal-synchronous algorithms this does not pose a significant
problem. If, however, the amplitude of the EGG signal mo-
mentarily drops below the fixed threshold, TXGEN can miss
a small number of isolated cycles which can be problematic
for certain applications. Data-Driven Voice Source Modeling
[6], for example, derives feature vectors from individual of
cycles of voiced speech which are then clustered to determine
classes of voice source. This has been demonstrated to have
applications in speech compression [6] and artificial bandwidth
extension [32]. A missed GCI results in features being derived Fig. 5. Three-level dyadic signal decomposition on a signal into detail,
from multiple cycles of speech, causing misclassification and , and approximation, , signals. (a) is the Dyadic Wavelet Transform
(DWT), and (b) the Stationary Wavelet Transform (SWT), an overcomplete ver-
distorting the processed signal. sion of the DWT useful in the detection of discontinuities.
HQTx can exhibit miss errors following a sudden decrease in
EGG amplitude due to smoothing of the dynamic threshold that
is not employed in TXGEN. III. GLOTTAL ACTIVITY DETECTION WITH
THE SIGMA ALGORITHM
E. False Alarm/Miss Tradeoff
In general, HQTx is prone to false alarm errors, particularly Detection of glottal activity from an EGG signal involves iso-
at the end of voiced segments. This is verified in Section IV; it is lating regions of discontinuity, sometimes referred to as singu-
further shown that miss errors are far less common. In contrast, larities. A common approach is the detection of spikes in the
TXGEN is generally prone to miss errors with relatively few derivative of the EGG signal, whose estimates are refined using
false alarms; this is also verified in Section IV. the peak amplitude of the DEGG and a longer-term measure of
HQTx fails largely because thresholds are estimated over too the change in EGG amplitude as a cost function.
short a window and TXGEN because thresholds are based upon
single global thresholds for the whole speech utterance. The A. Multiscale Analysis
constant of proportionality used to set the threshold from signal Let us consider a generalization of the HQTx approach that
extrema can be varied in TXGEN’s function call. The default employs two estimates of signal gradient. The dyadic wavelet
was empirically chosen to give the best tradeoff between miss transform [33] involves iteratively decomposing a signal
and false alarm errors; a marginally lower value can result in into decimated subbands; a three-level decomposition is shown
increased false alarms and decreased misses. There is there- in Fig. 5(a), where the downsampling and filtering operations
fore a clear tradeoff between false alarms and misses caused split the signal into octave-wide subbands.
by the thresholding approach employed by the majority of ex- The filters and have high- and low-pass character-
isting algorithms. The severity of this type of error is applica- istics, respectively. It is shown in [34] that, for singularity detec-
tion-specific but, when used as a reference to evaluate speech- tion in EGG signals, each filter in the filterbank should be a first-
based GCI/GOI detectors, neither should be deemed acceptable. order differentiation operator at increasing levels of smoothing.
SIGMA instead employs a novel method for detecting GCIs and A wavelet fulfilling this criteria is described as having one van-
GOIs that does not use thresholding, circumventing the false ishing moment and discontinuities in the input signal are seen
alarm/miss tradeoff and providing accurate estimates for the en- as converging maxima across scales [35].
tire EGG signal. A derivative-of-Gaussian (dG) approximation with cubic
spline wavelet decomposition filters is used in [36] and [34]
F. EGG at Glottal Opening which provides the differentiation and smoothing we require.
A glottal closure instant is usually followed by a GOI, which However, an arbitrary number of filters exist which fulfil the
manifests itself as a weaker spike of opposite sign in the EGG same criteria. A number of derivations can be found in [37] but
derivative [13] whose amplitude is largely speaker-dependent. give little idea as to their use in the detection of singularities.
GOI detection suffers from the same problems as GCI detec- In order to determine the relative performance, the proposed
tion but is more challenging because of the low amplitude of algorithm was run with five different sets of decomposition fil-
the opening pulses. Compare the negative halves of the EGG ters. Section IV-C presents a performance comparison between
signals in Figs. 1 and 3. In Fig. 1, glottal opening results in the chosen wavelet, whose filters are shown in Fig. 6, and the
spread pulses, hence the concept of an opening phase rather popular cubic spline dG wavelet.
than opening instant is often used. However, as in the case of The dyadic wavelet transform is dyadic in both scale and
a spread-pulse GCI, we consistently define the GOI as its center time. Only scale is of interest in singularity detection, so we
of energy. Fig. 3 represents a speaker for whom the opening do not decimate as shown in Fig. 5(b). Instead, the filters
spikes in DEGG are easier to locate. and are upsampled by 2 at each iteration to implement the

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1561

Fig. 6. (a) Approximation and (b) detail analysis filters for multiscale analysis. Fig. 7. EGG waveform, multiscale product and group delay function for GCI
Iterating these filters through a dyadic filterbank constructs a biorthogonal spline detection. Candidates are marked “ ” and chosen candidates are marked “ .”
wavelet with one vanishing moment. The ideal slope, marked in a dashed line on the lowest plot, is the slope which
would exist if the candidates were perfect impulses.

change of scale to form and at scale . This over-


complete representation of a signal is discussed in detail in [35] becomes prohibitively large, demanding high processing
and is given many names including: Stationary Wavelet Trans- resources and smoothing adjacent discontinuties. is
form (SWT), Algorithme à Trous (Hole Algorithm), Redundant deemed a good compromise [36].
Wavelet Transform (RWT) and Undecimated Wavelet Transform B. Group Delay Function
(UWT). The signal’s length remains unchanged throughout the
filterbank tree, allowing simple sample-by-sample multiplica- A group delay function (GD) [24] can be used for detection
tion of the signal at different scales to find converging maxima. of peaks in linear prediction residuals of speech and can be ap-
Denote the wavelet , where plied to locate spikes in any signal if their minimum separation,
. The SWT of the EGG signal at scale is , is known. Consider the multiscale product, , and an
-sample windowed segment beginning at sample
(1)
(4)
where , plus the remaining coarse scale information
The group delay of is given by [38]
denoted . This is a simple linear filtering operation

(2) (5)

where is the SWT of at scale and are the ap- where is the Fourier transform of and is
proximation coefficients at scale . The multiscale product, the Fourier transform of at frequency . If
, is formed by , where is a unit impulse function, it
follows from (5) that . In the presence of noise,
remains constant but with a degree of additive noise, so
(3) an averaging procedure needs to be performed over ; different
approaches are reviewed in [24]. The Energy-Weighted Group
where it is assumed that the lowest scale to include is always 1. Delay was deemed the most appropriate [20], defined as
The sign of is inverted compared with a DEGG using the
chosen wavelet, hence a minus sign is included to maintain the
convention. The de-noising effect of at each scale in con- (6)
junction with the multiscale product means that is near-
zero except at discontinuities across the first scales of
as depicted in Fig. 7(b), allowing better identification of discon- Manipulation yields the simplified expression
tinuties than the DEGG. The function can be half-wave
rectified to contain peaks pertaining only to GCIs, , or
GOIs, , which aids the group delay function in the fol- (7)
lowing step. The value of is limited by , but it is often no
greater than as the region of support (RoS) of and

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1562 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009

which is an efficient time-domain formulation and can be


viewed as the “center of energy” of , bounded in the
range . The location of the nega-
tive-going zero crossings of give an accurate estimation
of the location of a spike in a function as depicted in Fig. 7(c).
Additionally, if a spike is spread in time then the group delay
method will find its center of energy, which is particularly
useful in the case of the “redoubled” GCI discussed in [17].
The same analysis is applied to to provide , whose
negative-going zero crossings are GOI candidates.

C. Candidate Selection
The true GCIs are usually a subset of the negative-going zero
crossings of , with additional false crossings during un-
voiced speech, silence and occasionally between GCIs. Many
existing approaches concentrate only on those areas where false Fig. 8. Typical distribution of GCI feature vectors for a segment of voiced/
candidates are unlikely to occur. The following candidate selec- unvoiced/silent speech. The chosen cluster, whose members are marked “ ” is
tion technique aims to remove all false candidates to provide a the one whose mean is furthest from the origin. Rejected candidates are
marked “ .”
set of true GCIs throughout an entire segment of speech. Let the
number of candidates be occurring at samples ,
. Three measurements construct a
upon the likelihood of class , , given feature vector
feature vector, , from which is derived
a feature matrix, . The features are
defined as follows. (11)
1) Consistency of the group delay gradient. In the case of
a Dirac pulse, is a negative unit slope, with a Fig. 8 shows a typical distribution of the feature vectors for
zero crossing at the location of the impulse and width a segment of mixed voiced/unvoiced/silent speech. It has been
samples, as shown in Fig. 7(c). A spread pulse or the found empirically that the cluster whose mean is furthest
presence of noise will cause the slope to deviate from from the origin is most likely to contain the chosen candidates,
the ideal shape, denoted . The RMS error between marked “ .” Rejected candidates are marked “ .” The chosen
ideal and measured is calculated as GCI estimates are defined as .
GOIs are calculated in the same way but with reversed signs
where appropriate.
(8)
D. Swallowing
The algorithm proposed thus far performs accurate singu-
2) Peak value of multiscale product’s root inside group larity detection on an input signal without considering any
delay window. It is shown in [34] that the root of characteristics peculiar to EGG waveforms. It is found that
helps to give a “zooming in” on the signal, par- in natural conversional speech, singularities are often caused
ticularly at weak amplitudes (in this case ). Exper- by swallowing and occasionally by electrical interference
imentation with this algorithm has shown that the group in the measurement apparatus and are usually single iso-
delay function gives best results on but that its lated impulse-like signals. Considering a maximum period
root has better discriminative properties. all GCIs which are separated from a neighbouring
GCI by more than are rejected, else they are kept pro-
viding: .
(9) Experimentation has shown that provided the polarity of the
3) Area beneath multiscale product’s root inside group recording is correct, swallowing only causes errors in closure
delay window. In the case of a spread singularity, the detection so this technique is not applied to opening detection.
area beneath the multiscale product’s root can pro-
vide better discrimination of candidates. E. GOI Postfiltering
GOIs are detected from using the same approach
(10) as applied to GCI detection (with inverted signs where appro-
priate). However, the energy imparted by glottal opening is often
significantly lower than glottal closure, which results in more er-
The distributions of the feature vectors are modeled as two roneous GOI candidates. Assuming that a GOI always accompa-
multivariate Gaussians using the EM algorithm [25], initialized nies a GCI, postprocessing can be applied to use GCI estimates
with two random data points. Acceptance or rejection is based to improve GOIs accordingly.

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1563

Fig. 9. SIGMA system diagram. The EGG signal is decomposed into multiple scales from which the half-wave rectified multiscale product is derived.
Spike detection is performed on by the negative-going zero crossings of the group delay function at samples . Feature vectors derived from
the ideal group delay slope and are clustered by an unsupervised EM algorithm to obtain the GCI estimates . Similarly, GOIs are detected using
the negative half-wave of the multiscale product . Postprocessing is applied to the GCI estimates to remove isolated clicks from sources other than glottal
closure to give . GOI postprocessing removes candidates which do not lie within the range of permitted open quotients, using the GCIs as references giving .

The main cause of error in GOI post-filtering is small pertur- SAM database [40] contains readings of duration approxi-
bations in immediately preceding a glottal closure which mately 150 seconds by two male and two female speakers
triggers a zero crossing in the group delay function. A region and these were labeled in the same manner. SAM recordings
surrounding the closure is therefore isolated, limiting the al- are considered to contain more natural speech with a greater
lowed open quotient, , to the bounds and . The number of swallows and present a more challenging task for a
first candidate which lies within these limits is accepted; if no glottal activity detector. The EGG recordings were run through
candidate is found, then one is inserted following the current the HQTx (GCI only), TXGEN and SIGMA algorithms and
GCI at the previous open quotient. were evaluated by finding the number of estimates per reference
The SIGMA system diagram is shown in Fig. 9. Symmetry cycle then classified as follows, depicted in Fig. 10.
can be seen between closure and opening detection up until the 1) Hit. One estimate per true glottal cycle.
postprocessing stage; prior to this point the algorithm need only 2) Miss. No estimates per true glottal cycle.
know the maximum frequency of the singularities to detect and 3) False Alarm (FA). More than one estimate per glottal cycle.
so is suitable for general singularity detection. 4) False Alarm Total (FAT) Total number of false alarms (the
number of estimates which are not hits).
The measures are defined as follows.
IV. RESULTS AND DISCUSSION 1) .
The SIGMA algorithm has three parameters and these were 2) .
set as follows. 3) .
• : the group delay evaluation window size and there- 4) .
fore the maximum frequency of singularities which can be 5) .
detected. In the case of voiced speech, the maximum glottal A glottal cycle is defined as for GCIs and
frequency is Hz giving ms. for GOIs. Hit accuracy and hit bias
• : the maximum glottal period, so that isolated GCI are the RMS and mean errors between all hits and the cor-
candidates separated from neighboring candidates by more responding ground-truth estimates, respectively. The testing
than this value are removed in the GCI postfiltering step. strategy is identical to that employed in [19] with the addition
A minimum glottal frequency of 50 Hz leads to a of the FAT measure, which counts the total number of false
ms. alarms as a proportion of total estimates and not the number
• : the minimum and maximum open quotients of reference cycles containing more than one estimate as a
for GOI postfiltering. Their purpose is to isolate a region proportion of true glottal cycles. The overall figure of merit
around a GCI inside which a GOI cannot be detected. They provides a single-valued measure of performance by expressing
are set at 10% and 90%, respectively. the hit rate as a proportion of all reference cycles summed with
The MATLAB implementation of the chosen biorthogonal the number of non-hit estimates (the FAT).
spline decomposition filters is called bior1.5. The GCI results in Tables I and III show that SIGMA per-
forms significantly better than HQTx and TXGEN when applied
to either database. Notably HQTx is prone to false alarm errors
A. Experiment 1: Evaluation With APLAWD and SAM
whereas TXGEN is prone to miss errors; this agrees with the
The APLAWD database [39] contains speech and contem- qualitative analysis of HQTx’s performance in Section II which
poraneous EGG recordings of five short sentences, repeated showed that it is prone to false alarms at the end of segments of
ten times by five male and five female talkers. GCIs and voiced speech. HQTx and TXGEN exhibit much greater FAT
GOIs were hand-labeled on the first repetition of every than FA which suggests that each false alarm is usually fol-
sentence independently of the algorithms under test, de- lowed by successive false alarms within a single reference cycle.
noted , , and , SIGMA’s miss, FAT and FA measures are broadly similar which
, respectively. A subset of the tells us that successive false alarms do not usually occur within a

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1564 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009

TABLE IV
OPENING PERFORMANCE ON THE SAM DATABASE BY TXGEN, SIGMA (dG
WAVELET AND bior1.5 WAVELET) ALGORITHMS

clustering. SIGMA and HQTx hit bias are universally low but
TXGEN’s estimates tend to occur slightly early.
SIGMA’s GOI results in Tables II and IV are also encour-
aging. The reliance upon the estimated GCIs results in similar
hit, miss and false alarm rates, with diminished hit accuracy due
to the greater difficulty of precisely locating openings. The gap
in the overall figure of merit between SIMGA and TXGEN is
Fig. 10. Testing strategy. A hit is one estimate occuring during a reference
cycle. A miss is the absence of an estimate per reference cycle. If more than one again more than an order of magnitude.
estimate occurs per reference cycle, one false alarm (FA) is counted and the total
number false alarms in the cycle are added to false alarm total (FAT). Accuracy B. Experiment 2: Variation in Group Delay Window Size
and bias are the RMS and mean errors between hits and the corresponding ref-
erence, respectively. The group delay evaluation window size was set according to
the physical constraints of human speech, whose minimum fun-
damental period is around 2.5 ms. This experiment assesses the
TABLE I
CLOSURE PERFORMANCE ON THE APLAWD DATABASE BY HQTX, TXGEN,
algorithm’s sensitivity to variation in the group delay window
SIGMA (dG AND bior1.5 WAVELET) ALGORITHMS size on the APLAWD database.
The results presented in Fig. 11 show that 2.5 ms is indeed
an optimal choice of window length. The reliance on GCIs to
estimate GOIs means that intuitively the overall, hit, miss, and
FAT rates should vary in a similar manner which is confirmed
by these results.
FAT rates increase with decreasing window sizes due to the
fact that more negative zero crossings can occur in the group
delay function per unit time. In this case the true candidates re-
TABLE II
OPENING PERFORMANCE ON THE APLAWD DATABASE BY TXGEN, SIGMA main a subset of all candidates, with a number of additional false
(dG AND bior1.5 WAVELET) ALGORITHMS ones arising. Providing the clustering algorithm can discrimi-
nate against the false candidates, those which are true should
always be detected so false alarm rates should therefore increase
slowly with decreasing window size.
Miss rates increase with window size as neighboring singular-
ities can occur within a single group delay window and reduce
the number of negative zero crossings. It becomes impossible
for the GMM to find the correct candidates as they are no longer
TABLE III
CLOSURE PERFORMANCE ON THE SAM DATABASE BY HQTX, TXGEN,
a subset of the candidate set, hence miss rates climb rapidly with
SIGMA (dG AND bior1.5 WAVELET) ALGORITHMS increasing window size.
GCI bias and hit accuracy are relatively immune to variations
in window size, suggesting that providing one candidate occurs
per true period, is it statistically the correct choice. GOI bias and
hit accuracy are more sensitive, showing the most significant in-
crease with reduced window size. Bias increases monotonically
with decreasing window length.
This experiment was repeated for male- and female-only
speech. The results provide similar curves to the previous ex-
given reference cycle and that misses and false alarms have sim- periment that employs both genders, the optimum value being
ilar likelihood. SIGMA’s overall figures of merit are more than shifted up to approximately 3 ms for male voices and down to
an order of magnitude greater than the other algorithms under approximately 2 ms for female. The experiment with mixed
test. male/female speech shows that variation in group delay size
SIGMA’s GCI hit accuracy is in the order of a few samples does not have a significant effect upon the results in the range
which agrees with the statement in Section III-C that the true of approximately 1.5 to 3.5 ms, hence performance is weakly
GCIs are usually a subset of the SIGMA candidate GCIs before dependent on gender.

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
THOMAS AND NAYLOR: SIGMA ALGORITHM: A GLOTTAL ACTIVITY DETECTOR FOR EGG SIGNALS 1565

correct estimates are removed by the clustering of three-dimen-


sional feature vectors using the EM algorithm. Postprocessing
removes isolated GCIs and uses GCIs to aid GOI detection.
A comparison was made between the proposed approach and
two popular existing methods by evaluating their performance
against 50 short and four long hand-labeled sentences. An ex-
isting testing procedure with some new enhancements was used,
showing very accurate GCI and GOI detection with the pro-
posed method, fulfilling our objective of obtaining results that
are accurate enough to be used as a reference. Our method en-
ables accurate evaluation of speech-based glottal activity detec-
tion algorithms, precise estimation of the closed phase for the
estimation of glottal volume flow and could also be applied to
the analysis of a number of types of pathological speech. Fur-
ther, few assumptions are made about the nature of the input
signal. This allows the application of the proposed algorithm
to singularity detection in almost any signal provided the min-
imum separation of singularities is known.

REFERENCES
[1] P. Davies, G. A. Lindsey, H. Fuller, and A. J. Fourcin, “Variation of
glottal open and closed phases for speakers of English,” Proc. Inst.
Acoust., vol. 8, no. 7, pp. 539–546, 1986.
[2] W. Hess and H. Indefrey, “Accurate pitch determination of speech sig-
nals by means of a laryngograph,” in Proc. IEEE Intl. Conf. Acoust.,
Speech, Signal Process. (ICASSP), 1984, vol. 9, pp. 73–76.
[3] H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using
PSOLA technique,” Speech Commun., vol. 11, no. 2, pp. 175–187, Jun.
1992.
[4] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “Multi-microphone
speech dereverberation using spatio-temporal averaging,” in Proc. Eur.
Signal Process. Conf. (EUSIPCO), Vienna, Austria, Sep. 2004, pp.
809–812.
[5] E. Moulines and F. Charpentier, “Pitch-synchronous waveform
Fig. 11. Effect of varied group delay window length on (a) overall and hit, (b) processing techniques for text-to-speech synthesis using diphones,”
miss and FAT, and (c) and bias and hit accuracy. The choice of 2.5 ms from Speech Commun., vol. 9, no. 5–6, pp. 453–467, Dec. 1990.
physical reasoning is close to the optimal value. [6] M. R. P. Thomas, J. Gudnason, and P. A. Naylor, “Data-driven voice
source waveform modelling,” in Proc. IEEE Intl. Conf. Acoust.,
Speech, Signal Process. (ICASSP), Taipei, Taiwan, Apr. 2009, pp.
C. Experiment 3: Comparison With Cubic Spline Wavelet 3965–3968.
[7] J. Deller, “Some notes on closed phase glottal inverse filtering,” IEEE
The derivative-of-Gaussian (dG) cubic spline wavelet is the Trans. Acoust., Speech, Signal Process., vol. ASSP–29, no. 4, pp.
wavelet of choice for multiscale analysis in [17], [18] and [36]. 917–919, Aug. 1981.
[8] J. Gudnason and M. Brookes, “Voice Source cepstrum coefficients
Experiments with other common wavelets have shown that the for speaker identification,” in Proc. IEEE Intl. Conf. Acoust., Speech,
bior1.5 biorthogonal spline wavelet is more effective for EGG Signal Process. (ICASSP), 2008, pp. 4821–4824.
analysis with this algorithm. The results in Tables I–IV show [9] R. Colton and J. Casper, Understanding Voice Problems: A Physiolog-
ical Perspective for Diagnosis and Treatment. New York: Williams
SIGMA using the dG wavelet (labeled SIGMA-dG) as well as & Wilkins, 1996.
the proposed bior1.5 (labeled SIGMA). [10] K. Verdolini, R. Chan, I. R. Titze, M. Hess, and W. Bierhals, “Corre-
The performance of SIGMA is slightly reduced with the dG spondence of electroglottographic closed quotient to vocal fold impact
stress in excised canine larynges,” J. Voice, vol. 12, no. 4, pp. 415–423,
wavelet, particularly with increased false alarms and increased Feb. 1998.
hit error on the opening tests. Miss rates are slightly reduced but [11] J. Gamboa, F. J. Jiménez-Jiménez, A. Nieto, I. Cobeta, A. Vegas, M.
the greater increase in false alarm rate diminishes the overall Orti-Pareka, T. Gasalla, J. A. Molina, and E. Garcia-Albea, “Acoustic
voice analysis in patients with essential tremor,” J. Voice, vol. 12, no.
performance results. 4, pp. 444–452, Feb. 1998.
[12] E. R. M. Abberton, D. M. Howard, and A. J. Fourcin, “Laryngographic
V. CONCLUSION assessment of normal voice: A tutorial,” Clinical Linguist. Phon., vol.
3, pp. 281–296, 1989.
We have shown that robust detection of GCIs and GOIs from [13] D. G. Childers, D. M. Hooks, G. P. Moore, L. Eskenazi, and A. L.
EGG signals is particularly challenging at the transition regions Lalwani, “Electroglottography and vocal fold physiology,” J. Speech.
around the ending of voicing. A new method for glottal activity Hear. Res., vol. 33, no. 2, pp. 245–254, Jun. 1990.
[14] D. M. Howard, “Variation of electrolaryngographically derived closed
detection from EGG recordings has been presented which is quotient for trained and untrained adult female singers,” J. Voice, vol.
accurate even in these challenging regions. It first detects sin- 9, no. 2, pp. 121–1223, Jun. 1995.
gularities in the EGG signal by the multiscale product of three [15] N. Henrich, C. d’Alessandro, M. Castellengo, and B. Doval, “On the
use of the derivative of electroglottographic signals for characterization
dyadic scales. It then employs a technique based upon the group of nonpathological voice phonation,” J. Acoust. Soc. Amer., vol. 115,
delay function which detects peaks in the multiscale product. In- no. 3, pp. 1321–1332, Mar. 2004.

Authorized licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.
1566 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 8, NOVEMBER 2009

[16] M. A. Huckvale, “Speech Filing System: Tools for Speech,” Tech. Rep. [34] A. Bouzid and N. Ellouze, “Local regularity analysis at glottal opening
Univ. College London, London, U.K., 2004 [Online]. Available: http:// and closure instants in electroglottogram signal using wavelet trans-
www.phon.ucl.ac.uk/resource/sfs form modulus maxima,” in Proc. Eurospeech, 2003, pp. 2837–2840.
[17] A. Bouzid and N. Ellouze, “Multiscale product of electroglottogram [35] S. Mallat and W. L. Hwang, “Singularity detection and processing with
signal for glottal closure and opening instant detection,” in Proc. wavelets,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp. 617–643, Mar.
IMACS MultiConf. Comput. Eng. Syst. Applicat., 2006, vol. 1, pp. 1992.
106–109. [36] B. M. Sadler and A. Swami, “Analysis of multiscale products for step
[18] A. Bouzid and N. Ellouze, “Glottal opening instant detection from detection and estimation,” IEEE Trans. Inf. Theory, vol. 45, no. 3, pp.
speech signal,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Vi- 1043–1051, Apr. 1999.
enna, Austria, Sep. 2004, pp. 729–732. [37] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM,
[19] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, “Estima- 1992.
tion of glottal closure instants in voiced speech using the DYPSA algo- [38] R. Smits and B. Yegnanarayana, “Determination of instants of signif-
rithm,” IEEE Trans. Speech Audio Process., vol. 15, no. 1, pp. 34–43, icant excitation in speech using group delay function,” IEEE Trans.
Jan. 2007. Speech Audio Process., vol. 5, no. 3, pp. 325–333, Sep. 1995.
[20] M. R. P. Thomas, N. D. Gaubitch, and P. A. Naylor, “Multichannel [39] G. Lindsey, A. Breen, and S. Nevard, “SPAR’s Archivable Actual-
DYPSA for estimation of glottal closure instants in reverberant Word Databases,” Tech. Rep. Univ. College London, London, U.K.,
speech,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Poznan, 1987.
Poland, Sep. 2007. [40] D. Chan, A. Fourcin, D. Gibbon, B. Granstrom, M. Huckvale,
[21] K. S. Rao, S. R. M. Prasanna, and B. Yegnanarayana, “Determination G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno,
of instants of significant excitation in speech using Hilbert envelope J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, and J. Zeiliger,
and group delay function,” IEEE Signal Process. Lett., vol. 14, no. 10, “EUROM – A spoken language resource for the EU,” in Proc. Eur.
pp. 762–765, Oct. 2007. Conf. Speech Commun. Technol., Sep. 1995, pp. 867–870.
[22] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech
signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 8, pp.
1602–1613, Nov. 2008.
[23] W. Saidi, A. Bouzid, and N. Ellouze, “Evaluation of multi-scale
product method and DYPSA algorithm for glottal closure instant de-
tection,” in Proc. 3rd Int. Conf. Inf. Commun. Technol.: From Theory Mark Thomas (S’06) received the M.Eng. degree in
to Applicat. (ICTTA), Apr. 2008, pp. 1–5. electrical and electronic engineering from Imperial
[24] M. Brookes, P. A. Naylor, and J. Gudnason, “A quantitative as- College, London, U.K., in 2006 where he is currently
sessment of group delay methods for identifying glottal closures in pursuing the Ph.D. degree.
voiced speech,” IEEE Trans. Speech Audio Process., vol. 14, no. 2, His research interests include glottal-synchronous
pp. 456–466, Mar. 2006. and multichannel speech processing, involving
[25] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood methods for analysis, prosodic manipulation and re-
from incomplete data via the EM algorithm,” J. R. Statist. Soc., Ser. B, verberation/noise reduction. His previous experience
vol. 39, no. 1, pp. 1–38, 1977. in industry was with the BBC R&D Department,
[26] M. B. Higgins and J. H. Saxman, “A comparison of selected phonatory where he worked on audio, video, and RF engi-
behaviours of healthy aged and young adults,” J. Speech Hear. Res., neering.
vol. 34, pp. 1000–1010, Oct. 1991.
[27] A. K. Krishnamurthy and D. G. Childers, “Two-channel speech anal-
ysis,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-34, no.
4, pp. 730–743, Aug. 1986. Patrick Naylor (M’89–SM’07) received the B.Eng.
[28] D. M. Howard and G. Lindsey, “Conditioned variability in voicing off- degree in electronics and electrical engineering from
sets,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 3, pp. the University of Sheffield, Sheffield, U.K., in 1986
406–407, Mar. 1988. and the Ph.D. degree from Imperial College, London,
[29] J. C. Catford, Fundamental Problems in Phonetics. Bloomington, IN: U.K., in 1990.
Indiana Univ. Press, 1977. Since 1989, he has been a Member of Academic
[30] Y. Lebrun and J. Hasquin-Deleval, “On the so-called ’dissociations’ be- Staff in the Communications and Signal Processing
tween electroglottogram and phonogram,” Folia Phoniatrica, vol. 23, Research Group, Imperial College London, where he
pp. 225–227, 1971. is also Director of Postgraduate Studies. His research
[31] M. R. P. Thomas, J. Gudnason, and P. A. Naylor, “Application of the interests are in the areas of speech and audio signal
DYPSA algorithm to segmented time-scale modification of speech,” in processing and he has worked in particular on adap-
Proc. Eur. Signal Process. Conf. (EUSIPCO), Lausanne, Switzerland, tive signal processing for acoustic echo control, speaker identification, multi-
Aug. 2008. channel speech enhancement, and speech production modeling. In addition to
[32] P. Jax and P. Vary, “On artificial bandwidth extension of telephone his academic research, he enjoys several fruitful links with industry in the U.K.,
speech,” Signal Process., vol. 83, pp. 1707–1719, 2003. U.S., and in mainland Europe.
[33] S. Mallat and S. Zhong, “Characterization of signals from multiscale Dr. Naylor is an Associate Editor of IEEE SIGNAL PROCESSING LETTERS and
edges,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 7, pp. a member of the IEEE Signal Processing Society Technical Committee on Audio
710–732, Jul. 1992. and Electroacoustics.

Authorized
View publication stats licensed use limited to: Imperial College London. Downloaded on January 4, 2010 at 08:01 from IEEE Xplore. Restrictions apply.

You might also like