You are on page 1of 27

applied

sciences
Article
Mapping Phonation Types by Clustering of Multiple Metrics
Huanchen Cai * and Sten Ternström

Division of Speech, Music and Hearing, School of Electrical Engineering and Computer Science,
KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden
* Correspondence: huanchen@kth.se

Featured Application: Categorical voice assessment based on a classification algorithm that can
assist clinicians by visualizing phonation types on a 2-D voice map.

Abstract: For voice analysis, much work has been undertaken with a multitude of acoustic and
electroglottographic metrics. However, few of these have proven to be robustly correlated with
physical and physiological phenomena. In particular, all metrics are affected by the fundamental
frequency and sound level, making voice assessment sensitive to the recording protocol. It was
investigated whether combinations of metrics, acquired over voice maps rather than with individual
sustained vowels, can offer a more functional and comprehensive interpretation. For this descriptive,
retrospective study, 13 men, 13 women, and 22 children were instructed to phonate on /a/ over
their full voice range. Six acoustic and EGG signal features were obtained for every phonatory cycle.
An unsupervised voice classification model created feature clusters, which were then displayed on
voice maps. It was found that the feature clusters may be readily interpreted in terms of phonation
types. For example, the typical intense voice has a high peak EGG derivative, a relatively high contact
quotient, low EGG cycle-rate entropy, and a high cepstral peak prominence in the voice signal, all
represented by one cluster centroid that is mapped to a given color. In a transition region between
the non-contacting and contacting of the vocal folds, the combination of metrics shows a low contact
quotient and relatively high entropy, which can be mapped to a different color. Based on this data set,
Citation: Cai, H.; Ternström, S. male phonation types could be clustered into up to six categories and female and child types into
Mapping Phonation Types by four. Combining acoustic and EGG metrics resolved more categories than either kind on their own.
Clustering of Multiple Metrics. Appl. The inter- and intra-participant distributional features are discussed.
Sci. 2022, 12, 12092. https://doi.org/
10.3390/app122312092 Keywords: voice map; phonation type; EGG; classification
Academic Editor: Douglas
O0 Shaughnessy

Received: 30 September 2022


1. Introduction
Accepted: 22 November 2022
Published: 25 November 2022
By ear, humans can usually distinguish between different types of phonations. A voice
can be perceived as ‘breathy,’ ‘tense,’ ‘hoarse,’ and so on. This is known in voice science
Publisher’s Note: MDPI stays neutral
as the phonation type. In some languages, phonation types are used as suprasegments to
with regard to jurisdictional claims in
distinguish phonological meanings [1–3]. Phonation type is also used to express paralin-
published maps and institutional affil-
guistic information, such as the creaky voice to express irritation in Vietnamese [4]. Such
iations.
‘phonation types’ are manifested by the shape of transglottal airflow modulated by the
activation of the laryngeal muscles and respiratory efforts. Phonation types can be seen as
distributed along one continuum from voiceless (no vocal fold contacting) to glottal closure
Copyright: © 2022 by the authors.
(full vocal fold contacting) [5] and along at least one other continuum of the regularity of
Licensee MDPI, Basel, Switzerland. vibration. Sundberg [6] defined ‘phonation modes’ by the waveform and amplitude of
This article is an open access article vocal fold (VF) vibration.
distributed under the terms and In the practice of voice professionals, the subjectively perceived voice quality is still
conditions of the Creative Commons a reference standard for assessing outcomes. Quantitative metrics in isolation are weak
Attribution (CC BY) license (https:// at assessing changes in vocal status. This both encourages researchers to improve the
creativecommons.org/licenses/by/ metrics and suggests that the characterization of one phonation mode might become more
4.0/). specific if several metrics are collated. What we are trying to do here is to offer to voice

Appl. Sci. 2022, 12, 12092. https://doi.org/10.3390/app122312092 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 12092 2 of 27

professionals, e.g., speech clinicians or voice teachers, a representation of the quantitative


and objective measurements of phonation type in a 2D and interactive way that they can
relate to their practice.

1.1. Classifying Phonation Types


To classify phonation into various types, many metrics and methods have been tested.
Those time domain or frequency domain metrics reflect attributes of airborne voice signals.
Glottal inverse filtering (GIF) [7] is the estimate of flow waveforms; H1–H2 is the amplitude
difference between the first and second harmonic proposed by Hillenbrand [8]; Gowda et al. [9]
use low-frequency spectral density (LFSD) to distinguish breathy, modal, and pressed phonation;
and so on.
The electroglottographic signal (EGG) is a method for representing the contacting of
the vocal folds that is relatively low-cost and non-invasive, as compared to endoscopic
imaging. Since the EGG signal can represent the changing contact area of the oscillating
vocal folds, while neither impeding nor being much affected by the articulation, the EGG is
often used to characterize phonatory behavior, sometimes as an ancillary diagnostic tool in
clinical practice. A few EGG metrics have been widely used, such as the open quotient (OQ),
the contact quotient (CQ), and the speed quotient (SQ). Additionally, some researchers
used other signals such as neck surface acceleration [10]. However, the physiological
interpretation of these metrics [11], that is, in what manner these artificially selected and
defined metrics are correlated with the physiological structures or movements, is still
incompletely described.
Some approaches successfully classify phonation types using multiple metrics com-
bined with signal processing methods [12] and algorithms such as the Gaussian mixture
model, random forests, DNN [7], or using classification on the EGG waveform itself [7,13].
Those models have shown better discrimination compared to the attempts that use one
particular metric for detecting phonation types. However, the interpretation of some EGG
metrics remains to be elaborated upon.

1.2. Definitions and Metrics


The basic concepts of voice mapping are presented by Ternström and Pabon [14].
Here, we briefly recapitulate an example of visually representing multiple metrics across
the 2D voice field. Its dimensions are the fundamental frequency f o (horizontal axis, in
semitones) and the signal level in dB (vertical axis, SPL re 20 µPa at 0.3 m), which serve as
the two independent variables. Note that both axes are on a logarithmic scale. Each metric
is represented as a scalar field or curved surface above the plane of SPL × f o and has the
role of a dependent variable. In this study, all metrics were analyzed for voiced segments
only, the criterion for voicing being that the autocorrelation of the acoustic signal at one
cycle delay must exceed a threshold of 0.96.
The metric is sampled at integer values of semitones and decibels, i.e., in the ‘cells’ of
the map. Each cell contains the average of the metric for all visits to that cell (Figure 1).
Each metric is visualized on its own ‘layer’ in a multi-layer map. The six metrics selected
for this example and for the present study are the following. The first three are metrics of
the EGG signal, and the last three pertain to the acoustic signal.
Appl. Sci. 2022, 12, 12092 3 of 27
Appl. Sci. 2022, 12, x FOR PEER REVIEW 3 of 25

Figure
Figure 1.1.Six
Sixlayers
layersof
ofaavoice
voice map,
map, each
each layer
layershowing
showingvalues
valuesofofone
onemetric,
metric,averaged perper
averaged cell.cell.
In In
thethe top
top rowroware arethree
threeEGG
EGGmetrics,
metrics, and
and in
in the
thebottom
bottomare
arethree
threeacoustic metrics.
acoustic AnAn
metrics. amateur choir
amateur choir
baritone sang messa di voce on the vowel /a:/, with increasing and decreasing SPL on tones spanning
baritone sang messa di voce on the vowel /a:/, with increasing and decreasing SPL on tones spanning
just over one octave. Data from Selamtzis and Ternström [15].
just over one octave. Data from Selamtzis and Ternström [15].
The quotient of contact by integration (Qci) is the area under the EGG pulse that has been
The quotient of contact by integration (Qci ) is the area under the EGG pulse that has
normalized to 1 in both time and amplitude. The Qci was computed according to [16]. Like
been normalized to 1 in both time and amplitude. The Qci was computed according to [16].
other metrics of the contact quotient, Qci can vary from nearly zero for a narrow EGG pulse
Like other metrics of the contact quotient, Qci can vary from nearly zero for a narrow EGG
to nearly 1 for a broad EGG pulse. The Qci thus represents the relative amount of contact-
pulse to nearly 1 for a broad EGG pulse. The Q thus represents the relative amount of
ing during a cycle. The exception is when the vocalcifolds are oscillating without contacting,
contacting during
in which case, theaEGG
cycle. The exception
becomes is whensine
a low-amplitude the wave,
vocal and
foldssoare area underwithout
theoscillating the
contacting,
normalizedinpulse
which
(Qci)case, the EGG
approaches 0.5.becomes a low-amplitude
This commonly happens in a sine wave,breathy
soft and/or and sovoice.
the area
under Thethe normalized pulse
normalized peak (Qci ) approaches
derivative (Q∆) is the0.5. This commonly
maximum derivativehappens in a signal;
of an EGG soft and/or
it
breathy voice.
represents the maximum rate of contacting during closure. Q∆ was computed as in [16]:
The normalized peak derivative (Q∆ ) is the maximum derivative of an EGG signal; it
2
represents the maximum rate of contacting
∆ 2 / during∙ sin
closure. Q∆ was computed as in (1) [16]:
  
where T is the period length in integer samples, Ap-p is the 2π peak-to-peak amplitude, and
Q∆ ≈ 2δmax / A p− p · sin (1)
δmax is the largest positive difference observed over the period T between two consecutive
sample points in the discretized EGG signal. In phonation without vocal fold collision, the
EGG T
where is the period
becomes length in integer
a low-amplitude sine wave. Hence, A
samples, the p is the peak-to-peak
p−minimum value that Q∆amplitude,
can assumeand
is the
is the
δmax peaklargest positive
derivative difference observed
of a normalized sine wave,over theisperiod
which 1. between two consecutive
sampleThe points in thesample
cycle-rate discretized
entropyEGG
(CSE)signal.
metricInisphonation withoutself-similar
low for a regular, vocal foldsignal
collision,
and the
EGG
highbecomes a low-amplitude
when a signal is transient,sine wave.
erratic, Hence,Athe
or noisy. minimum
threshold is setQto
value that
parameter ∆ can
keepassume
its
is the peak derivative of a normalized sine wave, which is 1.
The cycle-rate sample entropy (CSE) metric is low for a regular, self-similar signal and
high when a signal is transient, erratic, or noisy. A threshold parameter is set to keep
Appl. Sci. 2022, 12, 12092 4 of 27

its value at zero when the phonation is stable, even with pitch changing. When something
‘abnormal’ happens, such as a voice break between falsetto and modal voice, the CSE peaks.
CSE represents the cycle-to-cycle instability of the EGG waveform. The CSE was computed
as described in [17].
The audio crest factor is computed as the ratio of the peak amplitude of the RMS
amplitude for every phonatory cycle. It is a simple indicator of the relative high-frequency
energy of the voice signal [18]. It increases with the abruptness of glottal excitation and
is indirectly related to the maximum first derivative of glottal flow (the maximum flow
declination rate, or MFDR). The crest factor is highest in a strong or creaky voice, with
sparse transients in the glottal excitation, and high damping of the vocal tract resonances.
This is typically the case at low f o and high SPL on open vowels, i.e., /a/. At all but the
lowest signal-to-noise ratios, the crest factor is insensitive to noise. The crest factor is not
meaningful in very resonant conditions when a vocal tract resonance frequency is closely
matched by that of a harmonic.
The spectrum balance (SB) is here defined as the difference in the acoustic power level
(dB) above 2 kHz and below 1.5 kHz.
 
W>2kHz
SB = 10·log10 (dB) (2)
W<1.5kHz

The choice of 1.5 kHz was motivated by the desire to classify formants 1 and 2 of
the vowel /a/ as part of the low band, even with female and child voices. The SB is
indirectly related to the maximum second derivative of glottal flow [19]. The spectrum
balance usually increases (becomes less negative) as the voice power output increases.
It also increases when the relative amount of noise in the signal increases, regardless of
whether that noise originates in the voice or in the audio chain.
The cepstral peak prominence smoothed (CPPs) is a measure of regular periodicity in the
acoustic signal. The higher the CPPs, the more regularity and the less noise in the audio
signal. The actual value of the CPPs is highly dependent on the setting of the smoothing
parameters, and on the accurate selection of voiced non-fricated speech sounds. Here, the
calculation of smoothed Cepstral Peak Prominence (CPPs) follows Awan et al. [20] (seven
frames in time, 11 bins in quefrency), which gives lower values of the CPPs than in many
other studies.
In summary, for both the acoustic and EGG signals, we have chosen two metrics, where
the second is related to the time derivative of the first, and a third metric that represents
the degree of the (ir-)regularity of the phonation.

1.3. Basis for Clustering


To explain our rationale for clustering, we first visualize how one of the metrics varies
with SPL only. In the example of Figure 1, the ubiquitous upward trend with increasing
f o happens to be close to linear (although this is not always the case). To simplify the
visualization, we now factor out this dependency, so as to consider only the variation
with SPL, or rather, with the new ‘f o -detrended’ SPL.Figure 2 shows this operation for
the Q∆ metric (the normalized peak dEGG). It can be seen that in soft phonation (<65 dB
adjusted SPL), the Q∆ metric is close to its minimum value of 1 (green), and as vocal fold
collision sets in, it increases abruptly. If we were to map other metrics, they too are aligned
intrinsically with the f o and SPL, but each metric varies differently in different regions of
the voice map.
We now create a graph similar to the black curve in Figure 2 but of all the six metrics
that are objects of the present study (Figure 3). Here, the six metrics have been brought to
a suitable ratio scale or log scale and scaled to 0% . . . 100% of their maximum expected
range of variation in anticipation of statistical clustering. The metrics Qci and log10 (Q∆ )
are already in the range 0 . . . 1 and thus need no pre-scaling. For the following example,
each metric has also been smoothed, with a 5 dB running average on the SPL axis.
Appl. Sci. 2022, 12, 12092 5 of 27
Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 25

(a) (b)
FigureFigure
2. (a) A2.map
(a) of
A an exemplary
map metric Q∆ (the
of an exemplary per-cycle-normalized
metric peak derivativepeak
Q∆ (the per-cycle-normalized of thederivative
EGG of the
signal), typically skewed with increasing fo. The black line shows the average of Q∆ taken horizon-
EGG signal), typically skewed with increasing f o . The black line shows the average of Q∆ taken
tally over all semitones. Its abscissa is the vertical SPL axis; its ordinate is on the horizontal color
horizontally
bar. (b) The same map over all semitones.
with Its abscissa
the dependency on fo wasisremoved
the vertical SPL axis;the
by adjusting its SPLs
ordinate
so asistoon the horizontal
make
color bar. (b) The same map with the dependency on
dashed trend line horizontal. The increase in Q∆ is now more
theREVIEW f was removed by adjusting the
o sharply focused, on the scale of SPLs so as to
Appl. Sci. 2022, 12, x FOR PEER 6 of 25
adjusted
make SPL.
the dashed trend line horizontal. The increase in Q∆ is now more sharply focused, on the scale
of adjusted SPL.
We now create a graph similar to the black curve in Figure 2 but of all the six metrics
that are objects of the present study (Figure 3). Here, the six metrics have been brought to
a suitable ratio scale or log scale and scaled to 0%…100% of their maximum expected
range of variation in anticipation of statistical clustering. The metrics Qci and log10 (Q∆) are
already in the range 0…1 and thus need no pre-scaling. For the following example, each
metric has also been smoothed, with a 5 dB running average on the SPL axis.
In Figure 3, five subranges can be qualitatively discerned on the axis of adjusted SPL,
following a rationale that we will now discuss in detail. Consider first range A (50…60
dB), of very soft and breathy phonation. Here, the EGG cycle-rate sample entropy (CSE)
is high, meaning that the EGG wave shape is unstable. The quotient of contact by integra-
tion (Qci) is 0.5 (50%) because there is no vocal fold contact; therefore, the EGG signal is of
very low amplitude and the average EGG wave shape is sinusoidal, and thus the area
under the normalized wave shape is 0.5. The spectrum balance (SB) is relatively high but
descending, as the low-frequency band energy rises above some noise floor (which could
be due to aspiration noise, low-level system noise, or both). The cepstral peak prominence
smoothed (CPPs) is low, due to irregular vibration, noise, or both, but rises as the perio-
dicity of the acoustic waveform gradually increases. The crest factor is low, indicating a
lack of transients in the acoustic waveform. The normalized peak EGG derivative (Q∆) is
at its minimum because there is no vocal fold contact.

Figure 3. Variation of six metrics over the SPL range shown in Figure 2. The broken lines refer to
Figure 3. Variation of six metrics over the SPL range shown in Figure 2. The broken lines refer to
acoustic metrics and the solid lines to EGG metrics. The adjusted SPL axis can be visually divided
acoustic
into fivemetrics andbased
subranges the solid lines
on the to EGG
metric metrics.
values. Range The adjusted
A: soft SPLwithout
phonation axis can be visually
vocal divided
fold contact;
into five subranges
B: transition based vocal
zone where on thefold
metric values.
contact Range
sets in; A: softphonation
C: ‘loose’ phonationbutwithout vocalstable;
not entirely fold contact;
D:
B:‘firm’
transition zone where
phonation vocal
with high fold contact
stability; setsphonation,
E: ‘hard’ in; C: ‘loose’ phonationsaturation
approaching but not entirely
in moststable;
metrics.D: ‘firm’
phonation with high stability; E: ‘hard’ phonation, approaching saturation in most metrics.
Range B (60…70 dB) is a transition zone where vocal fold contact is setting in. This is
most Inevident
Figure from
3, fivethe
subranges
sharp risecanin Qbe qualitatively
∆ and discerned
a fairly steep on the
drop in CSE. Qciaxis
nowofdrops
adjustedfromSPL,
following a rationale that we will now discuss in detail. Consider first range
its uncontacted 50% state into a range with just a little contact in each cycle, about 30%. A (50 . . . 60 dB),
ofThe
verySBsoft anditsbreathy
reaches minimum phonation. Here, becomes
as the spectrum the EGGmore cycle-rate
dominatedsample entropy
by the fundamen-(CSE) is
tal partial.
high, meaning ThethatCPPs continues
the EGG wavetoshapeclimbis as phonation
unstable. The becomes
quotient more stable
of contact byand less
integration
breathy. The crest factor is starting to increase.
Range C (70…79 dB) might be termed ‘loose’ phonation: the CSE is approaching zero,
meaning that only a small instability remains in the EGG waveform; conversely, the CPPs
is now picking up rapidly, indicating increased stability in the acoustic waveform. Qci is
turning around and starts to rise as adduction increases. The (log of) Q∆, or the peak rate
Appl. Sci. 2022, 12, 12092 6 of 27

(Qci ) is 0.5 (50%) because there is no vocal fold contact; therefore, the EGG signal is of
very low amplitude and the average EGG wave shape is sinusoidal, and thus the area
under the normalized wave shape is 0.5. The spectrum balance (SB) is relatively high but
descending, as the low-frequency band energy rises above some noise floor (which could
be due to aspiration noise, low-level system noise, or both). The cepstral peak prominence
smoothed (CPPs) is low, due to irregular vibration, noise, or both, but rises as the periodic-
ity of the acoustic waveform gradually increases. The crest factor is low, indicating a lack
of transients in the acoustic waveform. The normalized peak EGG derivative (Q∆ ) is at
its minimum because there is no vocal fold contact.
Range B (60 . . . 70 dB) is a transition zone where vocal fold contact is setting in. This is
most evident from the sharp rise in Q∆ and a fairly steep drop in CSE. Qci now drops from its
uncontacted 50% state into a range with just a little contact in each cycle, about 30%. The SB
reaches its minimum as the spectrum becomes more dominated by the fundamental partial.
The CPPs continues to climb as phonation becomes more stable and less breathy. The crest
factor is starting to increase.
Range C (70 . . . 79 dB) might be termed ‘loose’ phonation: the CSE is approaching
zero, meaning that only a small instability remains in the EGG waveform; conversely, the
CPPs is now picking up rapidly, indicating increased stability in the acoustic waveform.
Qci is turning around and starts to rise as adduction increases. The (log of) Q∆ , or the peak
rate of contacting, has almost reached its maximum and will increase only slowly from here.
The harmonic content of the acoustic signal is now increasing, as evidenced by a steady
growth both in the SB and in the crest factor.
In range D (79 . . . 93 dB), phonation is ‘firm’ and stable. The CSE is zero, and the CPPs
is near its maximum. The Qci continues to increase steadily, together with the SB and the
crest factor, while the Q∆ has settled at a high value.
Finally, in range E (>93 dB), the acoustic metrics level out, or even reverse, in a manner
that is reminiscent of spectral saturation [21]. Only Qci still manages to increase a little,
implying a slight increase in adduction. This range might perhaps be called ‘hard’ phonation.
This example is intended to show how we might infer with some confidence the character
of the phonation, even though we have access only to external signals and not to endoscopic
imaging or internal aerodynamics. The changes with increasing SPL appear to comply
consistently in all six metrics with the proposed phonatory scenarios, A to E. This particular
example was chosen because in this task and with this subject, the f o dependency of most
metrics was simple enough to be factored out by a linear transform. This is not generally
the case: in most participants, when phonating over the larger range of the full voice range,
the metrics exhibit considerable non-linear variation also with f o , which is challenging to
visualize effectively. This is where clustering becomes useful, as we hope to demonstrate.
Clearly, the above observations suggest the possibility of classifying phonation types
by clustering combinations of metrics. From A through E, those combinations have pro-
vided essential phonetic information. Though they will require independent physiological
evidence for their validation, the different phonatory characters, when played back selec-
tively, are also (informally) audible to the authors. The clustering of these six (and/or any
other) metrics in combination will allow us to plot the outcomes not just along the SPL axis
but across the 2D voice field, so as to also visualize dependencies on f o .

1.4. Problem Formulation


We now ask whether unsupervised learning can succeed at a partitioning of the voice
range that can be motivated in terms of phonatory function, as we demonstrated here. We
first examine the outcomes from the clustering algorithm for different values of the number
k of clusters. Then, we compare the discriminatory power of the acoustic and EGG metrics
separately and in combination. Lastly, we try to find an optimal cluster number k for the
three groups of voices, i.e., men, women, and children in the present data set. A perceptual
validation will be a necessary later step, but it is not part of the present study.
Appl. Sci. 2022, 12, 12092 7 of 27

2. Methods
2.1. Data Acquisition
This retrospective study used data previously collected by Patel and Ternström [22],
who describe the data acquisition in greater detail. In total, 22 pre-pubertal children (9 boys,
13 girls) aged 4–8 years (mean 6.3, i.e., before mutation), and 26 vocally healthy adults (13 men,
13 women) aged 22–45 years (mean 28.7) were recruited (data from one female were removed
due to background noise). None of these participants reported vocal disorders or speech
complaints. Each participant gave informed consent before the recording.
Simultaneous recordings were made of the EGG and airborne voice (44.1 kHz/16 bits)
in a quiet recording studio. The SPL was calibrated in dB, at 0.3 m relative to 20 µPa.
A bespoke voice analysis software, FonaDyn v1.5 [23], was used for real-time monitoring,
visual feedback to the participant, recording, and data pre-processing.
Habitual speech for the speech range profile (SRP) was elicited for the adult participants
by three readings of the Rainbow Passage, while the children spoke spontaneously of a drawing
of a playground scene. The participants were then instructed to phonate on /a/ over their full
voice range (the voice range profile, VRP) until as large an area of cells (SPL × f o ) as possible
was populated.
The VRP was elicited following a protocol [24,25] by recording sustained phonation
and glissandi first in a soft voice at the lowest, then at a comfortable, and finally at the
highest fo . The participants first filled the map in the lower vocal intensity regions, then
the upper regions, and so on, until a connected contour in this map was obtained. After
completing the main parts of the map, the participants were asked to expand the upper
and lower contours, with a live voice map in FonaDyn as visual feedback, by varying vocal
loudness and fundamental frequency.

2.2. The Choice of Observation Sets


In this original data set, the i-th observation C is a list of several parameters, denoted as:

Ci = ( f o , SPL, mi1 , mi2 , . . . , min ) (3)

where n is the number of metrics (EGG and/or acoustic metrics, denoted as m). The grand
total number of cycles is 3,732,124. The first two metrics, f o and SPL, are the indices for
each observation on the voice map. f o is in semitones on the MIDI scale (60 = 261 Hz) and
ranges from 30 to 90. SPL is in the range of 40–120 dB. We define a cell on the voice map as
1 semitone × 1 dB.
We then define two kinds of observations: (a) the cycle-by-cycle observations of metrics
cycle
m1 , . . . , mn extracted directly from each phonated cycle are denoted as Ci , defined as:

cycle
Ci = [mi1 , mi2 , . . . , min ], (4)

and (b) the cell-by-cell averaged observations Cjcell on the integer grid of f o × SPL,

Cjcell = m j1 , m j2 , . . . , m jn ,
 
(5)

p
1
m jn =
p ∑ man , (6)
a =1

where p is the count of observations that lie inside each grid unit. The grand total number
of cells is 43,235. Each cell typically contains a mean of some 100 cycles.
In (a), each recording provides on the order of 100,000 observations of phonatory
cycles per participant, while in (b), we have one observation per cell. On a map of the
full voice range, there are on the order of 1000 cells, where each cell contains the metric
averages for a number of cycles, which is on the order of 100. The latter method is much
faster, but the per-cell averaging causes some loss of information. The tradeoff in quantity
Appl. Sci. 2022, 12, 12092 8 of 27

and computational time was assessed by comparing the classification results from these
two types of observations.
One set of K-means centroids was trained on the cycle-by-cycle observations, assigning
a cluster number to each cycle. On the subsequent binning into cells, however, a given
cell can receive points from more than one cluster, because the same f o and SPL can be
produced with different kinds of phonation. In this case, with cycle-by-cycle clustering, the
cell is displayed as belonging to the cluster that attracted the majority of cycles in that cell.
For the faster method of clustering, each cell is directly assigned a cluster number on the
basis of the metric means that it contains.
The agreement of the two methods for assigning cells to clusters was 94% (±2%, p < 0.05).
This degree of agreement is acceptable because, in the default voice map, the f o ranges from
30 to 90 in semitones, and the SPL ranges from 40 to 120 in dB, i.e., a map of 60 × 80 units.
An error rate of 6% on the sporadically distributed map is not easily discernible by the naked
eye. Additionally, we are less concerned about to which cluster any particular SPL × f o cell is
assigned. Instead, the actual distribution and the changes in distribution and centroids should
be focused on, to assess the stability of the K-means algorithm.
Computations were much faster when using the voice map cell-by-cell representation
rather than the data from the individual cycles. Even though K-means is known for its high-
speed calculation, it is less efficient when the total number of observations reaches millions,
which it does here when clustering groups of participants cycle-by-cycle. The chosen voice
map representation instead holds one observation per cell, resulting in about a thousand
observations per participant for a full voice range.

2.3. Computation of the Primary Metrics


For the computation of the individual metrics, the software FonaDyn v2.4.0 was used
in two ways: (1) for computing data from every phonatory cycle on a timeline, and (2) for
mapping the metric averages onto the voice field. FonaDyn computes numerous acoustic and
EGG metrics and also clusters or classifies them in real time—but not faster. For this study,
the subsequent clustering and classification were performed offline in MATLAB for speed
and because the real-time clustering of incoming data will drift over the course of a recording.
Each voice metric is a dependent variable that defines a scalar field over the 2D voice
field. Each metric is registered in its own ‘layer’ in a multi-layered voice map [26]. Table 1
shows the input metrics for the voice map.

Table 1. The chosen EGG and acoustic metrics as inputs and their typical ranges.

Symbol Definition Range


Qci quotient of contact by integration 0 ... 1
EGG Metrics Q∆ normalized peak derivative 1 . . . 10
CSE cycle-rate sample entropy 0 . . . 10
crest factor ratio of the peak amplitude of the RMS amplitude 3 . . . 12 dB
Acoustic Metrics SB spectrum balance −40 . . . 0 dB
CPPs cepstral peak prominence smoothed 0 . . . 15 dB

For presentation, the metrics are averaged in each SPL × f o cell, cycle-by-cycle.
The minimum number of cycles per cell presented in the voice map is set to 5, suppressing
outliers and reducing the confidence interval for the mean in each cell.
In addition to the six metrics mentioned above, FonaDyn also produces descriptive
metrics (total number of cycles) and the Fourier descriptors (magnitude and phase) of each
EGG cycle for the ten lowest harmonics. All of these were subjected to principal component
analysis (PCA).
It was found that the six chosen metrics together explained 91.5% of the variance
in total and that each of them exhibited an explained variance ratio of more than 1%. In
contrast, other metrics such as first harmonic, second harmonic, and so on, and also their
Appl. Sci. 2022, 12, 12092 9 of 27

differences (H1–H2, etc.) [8] did not perform as well in this data set, giving an explained
variance ratio of less than 0.1% per metric.

2.4. Choice of Classification Method


The criteria for choosing which algorithm to use were: (1) it should be compatible with
the data structures and features of the present study; hence, algorithms including density-
based and hierarchy-based models are not appropriate for this data set since the metrics are
not linearly distributed; (2) it should give interpretable and structured distributions in the
voice map. The Gaussian mixture model and the DNN model showed good predictability
but have not yet demonstrated well-structured voice maps in our pilot experiments; (3) it
must be possible to run in real-time and be computationally inexpensive, so as to afford
visual feedback with negligible delays. If a voice map of clusters is more structured and
less sparse than another map, then it was made with the better algorithm.
We applied the K-means++ algorithm following the routine described by Arthur and
Vassilvitskii [27]. We choose an initial center c1 at random from the data set, then select the
next centroid with probability
d2 ( x m , c1 )
, (7)
∑nj=1 d2 x j , c1
where the distance between c1 and the observation m is denoted as d( xm , c1 ). Then, the
Euclidean distances from one observation to each centroid are computed until the algorithm
converges to an optimum.
A fuzzy clustering algorithm based on distributions would be a possible alternative.
However, in this study, the data points in each cell have a probability distribution because
each cell can contain many cycles, not all of which will belong to the same cluster. In that
sense, this K-means can be regarded as an alternative implementation of fuzzy clustering.

3. Results
3.1. The Optimal Number of Clusters
Now the question arises as to how many clusters k are optimal for describing the
variation in the data. Intuitively, we expect the k to be limited to a countable range that, to
some extent, corresponds to the phonation categories perceived by listeners.
A frequently used method for determining the optimum number of clusters is the
Bayesian Information Criterion (BIC). BIC finds the large sample limit of the Bayes’ estima-
tor, which leads to the selection of a model that is the most probable. It depends on whether
the data observations are independent and identically distributed. BIC provides no more
information than the model itself but selects the model by penalizing the complexity of
several competing models. The BIC algorithm was improved by Teklehaymanot et al.,
whose variant shows better performance and goodness of fit on a large data set with many
data points per cluster.
Here, the higher the BIC value, the better the tradeoff between the number of clusters
and the performance, thus avoiding overfitting.
While keeping the configuration of the K-means model constant, we iterate the cluster
number k from 1 to 10. Figure 4 shows the BIC tendency across females, males, and children
with all six metrics. The BIC approaches its maximum at k = 6 clusters, then remains constant
for males, while for females and children, the same occurs for k = 4. Notice that the absolute
values of BIC across the three groups do not imply an actual difference in information because
the amount of data points is not normalized across groups. Here, the BIC suggests the optimal
value of the cluster number for the model using six chosen metrics to be 4 to 6. A larger k will
be penalized by the BIC algorithm, in order to avoid overfitting.
children with all six metrics. The BIC approaches its maximum at k = 6 clusters, then re-
mains constant for males, while for females and children, the same occurs for k = 4. Notice
that the absolute values of BIC across the three groups do not imply an actual difference
in information because the amount of data points is not normalized across groups. Here,
the BIC suggests the optimal value of the cluster number for the model using six chosen
Appl. Sci. 2022, 12, 12092 10 of 27
metrics to be 4 to 6. A larger k will be penalized by the BIC algorithm, in order to avoid
overfitting.

Figure
Figure 4. 4.
BIC value,
BIC thethe
value, maximum
maximumof which indicates
of which the optimum
indicates number
the optimum of clusters
number when using
of clusters when using
all six voice metrics. The circle marker represents the optimum for male voices of 6 clusters. For both
all six voice metrics. The circle marker represents the optimum for male voices of 6 clusters. For both
female (star marker) and child voices (square marker), the optimum is 4 clusters.
female (star marker) and child voices (square marker), the optimum is 4 clusters.
Compared to the models with all six metrics, models with only one category of met-
Compared to the models with all six metrics, models with only one category of metrics
rics (acoustic or EGG) show a different tendency in Figure 5. The models with only acous-
(acoustic or EGG) show a different tendency in Figure 5. The models with only acoustic
tic metrics reach a local maximum at the beginning of the curves of BIC, k = 2 for children,
metrics reach a local maximum at the beginning of the curves of BIC, k = 2 for children,
k = 3 for females, and k = 4 for males. The models with only EGG metrics, on the other
k = 3 for females, and k = 4 for males. The models with only EGG metrics, on the other
hand, reach their maximum at k = 4 for all three groups. This can be taken to mean that,
hand,
with thereach their
present datamaximum
set and with k = chosen
at the 4 for all three groups.
metrics, the EGG This can
signal canbe takenatolarger
resolve mean that,
with the
number of present data
phonatory set and with
conditions the the
than can chosen metrics,
acoustic theInEGG
signal. othersignal
words,can
forresolve
this dataa larger
number of phonatory conditions than can the acoustic signal. In other words,
set, the models with only EGG metrics can yield higher distinguishability of phonation
Appl. Sci. 2022, 12, x FOR PEER REVIEW for11this data
of 25
set, the models with only EGG metrics can
types while retaining a lower risk of overfitting. yield higher distinguishability of phonation
types while retaining a lower risk of overfitting.

(a) (b)
Figure
Figure 5. BIC
5. BIC valuesfor
values formodels
modelswith
with(a)
(a) acoustic
acoustic metrics
metrics only
onlyand
and(b)
(b)EGG
EGGmetrics
metricsonly.
only.

3.2.3.2.
Description
Descriptionofofthe
theClassification
ClassificationVoice
Voice Maps
Maps
WeWe first consider
first considersome
somebasic
basicexamples
examples of clustering
clusteringtotobetter
betterunderstand
understand thethe voice
voice
map
maprepresentation.
representation.InInFigure
Figure 6,
6, aa male’s (a), aa female’s
male’s (a), female’s(b),
(b),and
anda achild’s
child’s(c)(c) voice
voice maps
maps
with
with thethe same
same numberofofclusters
number clusterskk== 55 are
are displayed.
displayed. The
Theparticipant
participantnumber
number is is
aligned
aligned
with
with thethe originalorder
original orderininthe
thedata
dataset.set. Each
Each group
groupwaswastrained
trainedseparately,
separately, i.e., thethe
i.e., male
male
group was trained with the voice and EGG signals of 13 male participants.
group was trained with the voice and EGG signals of 13 male participants. In this map, In this map,
each
each cellcell
in in the
the maprepresents
map representsaaminimum
minimum of of five
five phonatory
phonatorycycles.
cycles.
3.2. Description of the Classification Voice Maps
We first consider some basic examples of clustering to better understand the voice
map representation. In Figure 6, a male’s (a), a female’s (b), and a child’s (c) voice maps
with the same number of clusters k = 5 are displayed. The participant number is aligned
Appl. Sci. 2022, 12, 12092
with the original order in the data set. Each group was trained separately, i.e., the male
11 of 27
group was trained with the voice and EGG signals of 13 male participants. In this map,
each cell in the map represents a minimum of five phonatory cycles.

(a) (b) (c)


Figure 6. Examples of using k = 5 clusters, on individual full voice range voice maps (VRPs) on /a:/
Figure 6. Examples of using k = 5 clusters, on individual full voice range voice maps (VRPs) on /a:/
vowel (a) VRP of male participant M01; (b) VRP of female participant F02; (c) VRP of child (boy)
vowel (a)
participant VRPB02.ofThe
male participant
color M01;
plotted in any (b) cell
given VRPis of
thatfemale participant
of the most F02;occurring
frequently (c) VRPphonation
of child (boy)
participant
type in theB02.cell,The
i.e., color plotted in
the dominant any given
cluster. Each cell
cell is
maythatcontain
of the cycles
most frequently
from otheroccurring
clusters asphonation
well,
but in
type suchtheoverlap is not
cell, i.e., the visualized
dominanthere. TheEach
cluster. is fomay
x-axiscell in semitones (on thefrom
contain cycles MIDI scale,
other48 = 110 Hz),
clusters as well,
and
but the overlap
such y-axis is is
SPLnotinvisualized
dB(C) at 0.3here.
m. The x-axis is f in semitones (on the MIDI scale, 48 = 110 Hz),
o
and the y-axis is SPL in dB(C) at 0.3 m.
Inspecting broadly the overall contours, even though they are from individuals, it
can Inspecting
be seen thatbroadlythe malethe participant has a lower
overall contours, evenpitch than the
though theyfemale and child;
are from the child
individuals, it can
be seen that the male participant has a lower pitch than the female and child;adults
has a smaller pitch and SPL range than the adults, fully as might be expected. The the child
exhibit a greater diversity of phonation types than the child. One can assume that every
has a smaller pitch and SPL range than the adults, fully as might be expected. The adults
participant uses his/her strategy so that the phonation type distribution would inevitably
exhibit a greater diversity of phonation types than the child. One can assume that every
vary. People will vary their phonation type in order to be able to phonate without undue
participant uses his/her strategy so that the phonation type distribution would inevitably
effort at different fo and SPLs. In the map of the male voice (Figure 6a), it can be seen that
vary. People will vary their phonation type in order to be able to phonate without undue
there are three main regions in the higher, middle, and lower parts, which could be named,
effort at different f o and SPLs. In the map of the male voice (Figure 6a), it can be seen that
from top to bottom, as loud, transition, and soft regions, together with some subgroups
there
aroundare these
three regions.
main regions The mapin the higher,
(Figure 6b)middle, and lower
of the female voiceparts,
showswhich
similarcould be named,
partitions,
from top to bottom, as loud, transition, and soft regions, together with some subgroups
around these regions. The map (Figure 6b) of the female voice shows similar partitions,
with an extra split in the soft region. Except for the child (Figure 6c), the phonation type as
classified by clustering is fairly consistent throughout the whole vocal range. Notice that
the outliers near the contours are also classified as a cluster. In the softest regions, relatively
breathy voices appear in the male and female participants.

3.2.1. Intra-Participant Distribution


Now, in Table 2 we may examine the distribution of phonation types in greater detail.
The change of the distribution is given with an increasing cluster number k. The coordinates
in the clustering space of the cluster centroids are given in polar plots in Appendix A.
Even though the cross-age comparison does not control well for the age variable and
the cross-age groups are divided into an adult (male and female) group and a child (before
mutation) group, the current plots still show clearly the large differences and variability
due to age (or, in some sense, the impact of voice mutation).
We now consider how the distribution of detected phonation types over the voice field
changes as the number k of clusters is varied. As the number of k increases, the phonation
type map progressively splits into more regions. For the male participant, it goes from the
opposition of ‘loud’ and ‘soft’ for k = 2 to a third ’transition’ zone in between for k = 3.
as
as
as classified
asasclassified
classified
classified by
classifiedby
by
by clustering
byclustering
clustering
clusteringis
clustering is
is fairly
isisfairly
fairly
fairly consistent
fairlyconsistent
consistent throughout
consistentthroughout
consistent throughout
throughoutthe
throughout the
the whole
thewhole
the whole vocal
wholevocal
whole vocal range.
vocalrange.
vocal range. Notice
range.Notice
range. Notice
Notice
Notice
that
that
that
that the
thatthe
the
the outliers
theoutliers
outliers
outliers near
outliersnear
near
near the
nearthe
the
the contours
thecontours
contours
contoursare
contours are
are
are also
arealso
also classified
alsoclassified
also classified as
classifiedas
classified asasaaaaacluster.
as cluster.
cluster. In
cluster.In
cluster. In the
InInthe
the softest
thesoftest
the softest regions,
softestregions,
softest regions,
regions,
regions,
relatively
relatively
relatively
relatively
relatively breathy
breathy
breathy
breathy
breathy voices
voices
voices
voices
voices appear
appear
appear
appear
appear in
in
in the
ininthe
the
the
the male
male
male
male
male and
and
and
and
and female
female
female
female
female participants.
participants.
participants.
participants.
participants.

3.2.1.
3.2.1.
3.2.1.
3.2.1.
3.2.1. Intra-Participant
Intra-Participant
Intra-Participant
Intra-Participant
Intra-Participant Distribution
Distribution
Distribution
Distribution
Distribution
Appl. Sci. 2022, 12, 12092 12 of 27
Now,
Now,
Now,
Now,
Now, in
in
inininTable
Table
Table
Table
Table 2
222wewe
2we
we
we may
maymay
may
may examine
examine
examine
examine
examine the
the
the
the
the distribution
distribution
distribution
distribution
distribution of
of
of phonation
ofofphonation
phonation
phonation
phonation types
types
types
types
types in
in
in greater
inin
greater
greater
greater
greater detail.
detail.
detail.
detail.
detail.
The
The
The
The change
Thechange
change
changeof
change of
of the
ofofthe
the
the distribution
thedistribution
distribution
distributionis
distribution is
is given
isisgiven
given
givenwith
given with
with
withan
with an
an increasing
anincreasing
an increasing
increasingcluster
increasing cluster
cluster number
clusternumber
cluster number
numberk.
number k.
k. The
The coordi-
Thecoordi-
The
k.k.The coordi-
coordi-
coordi-
nates
nates
nates
nates
nates in
in
inininthe
the
the
the
the clustering
clustering
clustering
clustering
clustering space
space
space
space
space of
of
ofofofthe
thethe
the
the cluster
cluster
cluster
cluster
cluster centroids
centroids
centroids
centroids
centroids are
are
are
are are given
given
given
given
given in
in
inininpolar
polar
polar
polar
polar plots
plots
plots
plots
plots in
in
in Appendix
ininAppendix
Appendix
Appendix
Appendix A.
A.
A.
A. A.
Table 2. Male participant M01, female participant F06, and child participant B09, with their voice
maps
Table
Table
Table by2.
Table
Table
Table 2.cluster
2.
2. Male
2.
2. Male
Male
Male
MaleMale numbers.
participant
participant
participant
participant
participant
participant Notice
M01,
M01,
M01,
M01,
M01,
M01, that
female
female
female
female
female
female the colors across
participant
participant
participant
participant
participant
participant F06,
F06,
F06,
F06,
F06,
F06, kand
and
and indicate
and
andand child
child
child
child
child
child only the relative
participant
participant
participant
participant
participant
participant B09,
B09,
B09,
B09,
B09,
B09, position.
with
with
with
with
with
with their
their
their
their
their
their The
voice
voice
voice
voice same
voice
voice
maps
color
maps
maps
maps
maps
maps by
not
by
by
by by
by cluster
necessarily
cluster
cluster
cluster
cluster
cluster numbers.
numbers.
numbers.
numbers.
numbers.
numbers. Notice
denotes Notice
Notice
Notice
Notice
Notice the that
thatsame
that
that
that
that the
the
the
the
the colors
thecluster.
colors
colors
colors
colors
colors across
across
across
across
across
across indicate
Thekkkklast
indicaterowonly
indicate
indicate
kkindicate
indicate only
only
only
only
only the
represents
the
the
the
the relative
the relative
relative
relative
relative
relative position.
the position.
speech
position.
position.
position. The
range
position. TheThe
The
The
The same
profiles,
same
same
same
same
same
color
color
color
color
color
color
marked not
not
not
not not
not
as necessarily
necessarily
necessarily
necessarily
necessarily
necessarily
grids, denotes
denotes
denotes
denotes
denotes
denotes
overlapping the
thethe
the
the
the
on samesame
same
same
same
same
the cluster.
cluster.
cluster.
cluster.
cluster.
cluster.
density The
TheThe
The
The
The
map, lastlast
last
last
last
last
where row
row row
row
row
row therepresents
represents
represents
represents
represents
represents
darker the the
the
the the
the
the speech
speech
speech
speech
speech
speech
region, range
range
range
range
range
range
the moreprofiles,
profiles,
profiles,
profiles,
profiles,
profiles,
VF cycles
marked
marked
marked
marked
marked as
asas
asasgrids,
as grids,
grids,
grids,
grids, overlapping
overlapping
overlapping
overlapping
overlapping on
on
onon the
thethe
the density
density
density
density map,map,
map,
map, where
where
where
where the
thethe
the darker
darker
darker
darker thethe
the
the region,
region,
region,
region, the
the
the more
the more
more
more VF
VF
VFVF cycles
cycles
cycles
cycles
are marked
contained. grids,
This enables a on
overlapping onthe
comparison density
the density ofmap,map,
the where
where
whole the the
voice darker
darker
range the region,
the
with the the
region, more
the
habitual more VF VFcycles
speech cycles
range.
are
are
are
are
are contained.
are contained.
contained.
contained.
contained.
contained. This
This
This
This
This
This enables
enables
enables
enables
enables
enables aaacomparison
a acomparison
acomparison
comparison
comparison
comparison of
of
of the
of
ofofthe
the
the whole
the
the whole
whole
whole
whole
whole voice
voicevoice
voice
voice
voice range
range
range
range
range
range withwith
with
with
with
with the
thethe
the
the
the habitual
habitual
habitual
habitual
habitual
habitual speech
speech
speech
speech
speech
speech range.
range.
range.
range.
range.
range.
The x-axis is f in semitones
The x-axis is fooo in semitones (on the MIDI (on the MIDI scale, 48 = 110 Hz), and the y-axis
MIDI scale, 48 = 110 Hz), and the y-axis is SPL in dB(C) at 0.3 m.
is SPL in dB(C) at 0.3 m.
The
The
The
The
The x-axis
x-axis
x-axis
x-axis
x-axis is
isisis fin
ffofoooofin
is in
oo semitones
in
o in semitones
semitones
semitones
semitones (on
(on
(on
(on
(on the
the
the
the
the MIDI
MIDI
MIDI
MIDI
MIDI
MIDI
MIDIscale,
scale,
scale,
scale,
scale, 48
48 ===110
48
4848 =110
=110
110
110Hz),
Hz),
Hz),
Hz),
Hz), and
and
and
and
and the
thethe
the
the y-axis
y-axis
y-axis
y-axis
y-axis is
is SPL
is
isis
SPLSPL
SPL
SPL in
in dB(C)
in
inin dB(C)
dB(C)
dB(C)
dB(C) at 0.3
at
atat
at 0.3
0.3
0.3
0.3 m.
m.m.
m.
m.
Male
Male
Male
Male
Male
Male
MaleM01
M01
M01
M01
M01
M01
M01 Female
Female
Female
Female
Female
Female F06
F06F06
F06
F06
F06
Female F06 Child
Child
Child
Child
Child
Child B09
B09
B09
B09
B09
B09
Child B09

k =kkkk2==k=k=2=22=222

k =kkkk3==k=k=3=33=333

k =kkkk4==k=k=4=44=444

k =kkkk5===kk=55==55 55
Appl.
2022,Sci.
2022, 2022,
x xFOR12,
x12,
FOR xFOR
FORREVIEW
xPEER
PEERPEER REVIEW
REVIEW 13 13 2513
of of ofof2525
Appl.
Appl.
Appl.
Sci.
Appl. Sci.
Appl.
Sci.
Sci. 2022,Sci.
2022, 12,
12,
12,
2022,
12,
x FOR FOR PEER
PEER PEER REVIEW
REVIEW
REVIEW 1313
of of13
25 2525

k = k6k=k=k6=6=6k6 k= =6 6

Speech
Speech range
SpeechSpeech
range
Speech range
profile
range
profile
Speech
rangerange profile
profile
profile
profile
Speech range profile

Even
Even
Even
Even Even
though
Even though
though
though though
the
though
the the cross-age
the thecross-age
the cross-age
cross-age
cross-age
cross-age comparison
comparison
comparison
comparison
comparison
comparison does
does
does
does does
not
does
not
not notcontrol
control
not control
not
control
control control
well
well
well well
for
well
well
for
for the
for for
for
the
the age
thethe
age
age the
age agevariable
variable
age variable
variable
variable
variable and and
andand
and
and
the
the
the the
cross-age
the cross-age
cross-age
the cross-age
cross-age
cross-age groups
groups
groups
groups groups
are
groups
are are
are are
divided divided
dividedinto
are divided
divided
divided into
into an
into into
aninto
an adult
an an
adult adult
(male
an (male
adult
adult (male
adult
(male (male
andand
(male
andand and
female)
and female)
female)group
female)
female)
female) group
group
group group
and
group
and and
and a and
child
a anda a
child
a child
child child
(before(before
(before
a (before
child (before
(before
mutation)
mutation)
mutation) group,
mutation)
mutation)
mutation) group,
group,
group, group,
the
group,
the the
the thecurrent
current
the
current
current current
current plots
plots
plots
plots plots
still
plots
still
still still
show
still show
still
show
show show
clearly
show clearly
clearlythe
the
clearly
clearly
clearly the
the thedifferences
large
the
large
large large
large differences
differences
large differences and
differences
differences and and
and and variability
variability
variability
and variability
variability
variability
due
due
due
due to
to due
due
toage
to
ageage
to
age to
(or,
(or, age
age
(or,
in in(or,
in(or,
(or, some
in
some ininsome
some
some some
sense,
sense,sense,
the
sense,
sense,
sense, the
the theimpact
impact
the impact
the
impact
impact impact
ofof
of voice
of ofofvoice
voice
voice
voice voice mutation).
mutation).
mutation).
mutation).
mutation).
mutation).
We Wenow We
now now
consider
considerconsider
howhow how
the the distribution
distribution
the distribution of of of detected
detected
detected phonationphonation
phonation types
types types
overover over
the the voice
voice
the voice
Appl. Sci. 2022, 12, 12092 13 of 27

The breathy soft zone in the bottom is characterized by the maximal values of CSE
and Qci and very low CPPs. Together, the Q∆ and the Crest factor indicate great phonatory
instability in this area, as the centroid polar plot shows in Table A1. The firm and loud
regions at the top of the map are where the phonation is the most stable, described by the
CPPs, SB, Crest factor, Qci , and Q∆ at their maximum, and a minimal CSE. Those regions
are the loudest part of the voice map. Those metrics show a high-functioning phonation
type. The transition zone starts with the lower left part of the voice map and ends in the
upper right, with a slope of about +1.3 dB/semitone. The transition region is featured by
its Qci slightly below its uncontacted 50% state. However, the values of the other metrics
stay in the range between the top and bottom regions.
With the change from three to four clusters, an interesting split occurs on the left side of
the loud region, in a cream color 2 next to the dark blue 4. It represents the low-pitch modal
phonation type and forms the center of the habitual speech range (when the participant
read the Rainbow passage thrice) that is denoted by the grids in the bottom row. Its centroid
exhibits a very high Q∆ and Crest factor and a Qci of about 60%, while the voice metrics
CSE, CPPs, and SB keep the same level as in the transition zone. This split corresponds to
the habitual speech range.
When it comes to k = 5, the classifier picks up the breathiest regions as a new cluster,
characterized by zero CPPs, high SB, CSE, and Qci, and a moderate Crest factor and Q∆.
These illustrate the significant instability in the very soft voice. Finally, with k = 6, the
transition zone in the male splits into a lower and an upper section, with the latter largely
corresponding to the falsetto voice (color 5).
In the map of the female voice, the progression of splitting is structurally different—see
also Table A2 for more details. When the cluster number k equals 3, a similar distribution,
‘loud-transition-soft,’ is manifested, with a set of centroids that is similar to the male voice.
However, when the k increases to 4, it is instead the soft region that starts to split up,
giving a specific size to the breathy region, which is characterized by zero CPPs, large
CSE, SB, and Qci as adduction increases. The ‘fifth’ cluster still acts on the soft or breathy
regions, evolving into a cluster with lower Q∆ and higher CPPs than the fourth cluster.
This is due to a slightly more intense mode of phonation. The sixth cluster adds little to the
overall picture. In this case, we may assume that the cluster number k = 4 for the female is
consistent with the BIC curve because, in the soft or breathy region, there is a difference
only in breathiness.
The child’s voice map shows similarities with that of the adult female—see Table A3.
When the cluster number k is increased beyond 4, the general distribution does not
change much.
The values of the metrics for all centroid locations are listed in Table A4.

3.2.2. Union Maps of the Classification


The inter-participant variability in voice maps is considerable, but there are methods
for deriving a map that describes a group of participants. In Table 3, we present ‘union’
cluster maps where the categorical cluster of one f o -SPL cell is determined by the majority
of all objects in the same group, i.e., the male, female, or child’s group. Constructing such
a union map can visualize some structural similarities within the groups. However, it is
worth noting that this data set has a limited number of participants. In addition to the
apparent physiological differences within groups, the participants also exhibit different
strategies of phonation.
3.2.2.
3.2.2. 3.2.2.
Union
Union Union
MapsMaps Maps
of of
thetheof Classification
the Classification
Classification
The The inter-participant
inter-participant
The variability inin voice maps is considerable, but there are methods
The The Theinter-participant
inter-participant
inter-participant
The inter-participant
inter-participant
variability
variabilityvariability
variability in
variability
variability
invoice
voice ininvoice
voice
inof
maps
mapsmaps
voice
voice
maps
ismaps
is
maps
considerable,
is isisconsiderable,
considerable,
considerable,
isIn considerable,
considerable,
but
but butthere
but
there
there there
are
butare
but
are
there
there
methods
are
aremethods
methods
methods
are methods
methods
for for The The
deriving
deriving
for inter-participant
inter-participant
deriving
a a
map map
a that
map that variability
variability
describes
describes
that describes
a a
group in
ingroupvoice
a voice
group
of maps maps
participants.
of is
participants. is considerable,
considerable,
participants. In Table
TableIn but
Table
3, but
3,
we wethere
there
3, are are
present
present
we methods
methods
present‘union’
‘union’ ‘union’
forfor deriving
deriving
for deriving a map
a map athat that
map describes
describes
that describes a group
a group a of of
group participants.
participants.
of participants. InIn Table
Table In 3, 3,
wewe
Table present
present
3, we ‘union’
‘union’
present ‘union’
for
forcluster
cluster for
deriving deriving
deriving
maps
cluster
maps awhere
amaps
map
where mapathat
wheremap
the that
the that describes
describes
describes
categorical
categorical
the categorical acluster
acluster
group groupacluster
group
ofof
of of
one oneof of
f f
o-SPL
oneoparticipants.
participants.
participants. -SPL focell
-SPL In
cellisIn
is
cell In
Table
Table
determined
is Table
3,
determined 3,
wewe
determined 3, bywe
present
present
by the present
the
by ‘union’
‘union’
majority
majority
the ‘union’
majority
cluster
cluster maps
maps
cluster where
where
maps thethe
where categorical
categorical
the cluster
cluster
categorical of of
clusterone one fo-SPL
of o-SPL
fone cell
f cell
o-SPL is is determined
determined
cell is determinedbyby thethe majority
majority
by the majority
cluster
cluster
of ofallallcluster
maps
maps
objects
objects
of all mapswhere
where
in the where
the the
same the
categorical
group,categorical
categorical cluster
cluster
i.e., the cluster
of of
male, one one of one
fo-SPL
female, o-SPL
ffemale, o-SPL
orfcell
or cell cell
is is determined
determined
is determined
child’s group. byby thethe
Constructing by the
majority majority
majority
such
of ofallall allobjects
ofobjects
objects
of all
in
in the
in
the
objects
objects
in
same
theinthe
same
in thesame
same
the
group,
group,
group,
same
same
group,
i.e.,
i.e.,
group,
group,
the
i.e.,
the i.e.,
the male,
male,
i.e.,
i.e.,
the
themale,
male,
the
female,
female,
female,
male,
male,
or
female,
female,
orchild’s
or
child’s orchild’s
child’s
or
group.
group.
group.
child’s
child’s
group.
Constructing
Constructing
Constructing
Constructing
group.
group. Constructing
Constructing
such
such suchsuchsuch
such
of
ofunion
aaallall
union objects
objects map in in
the
canthesame same
visualize group,
group, some i.e.,
i.e., the
thestructural
structural male, male, female,
female,
similarities or
orwithin child’s
child’s
within group.
group.
the Constructing
Constructing
groups. However, such such
it
is is
Appl. Sci. 2022, 12, 12092 aa union aaunion
union map
mapmap
union can
map
can can
map visualize
can
visualize
can visualize
visualize some
some
visualizesome some
structural
structural
structural
some similarities
similarities
structural similarities
similarities
similarities within
within within
the
the the
within groups.
the groups.
groups.
groups.
the However,
However,
However,
groups. However,it is
it
However,
14 of it
27 isititisis
a union
a union
worth anoting
union
map map
noting map
can can can
visualize
that this visualize
visualize some
data some
set some
structural
has aastructural
structural aasimilarities
limited similarities
similarities
number within
within
of within
the the theIngroups.
groups.
groups.
participants. However,
However, However,
ittheis it is
ittois the
worth
worthworthworth noting
noting
worth noting
that
thatthat
noting this
that
this this
that data
this
datadata
this set
data
set
datahas
set
has set
hasaa has
set limited
has limited
limited limited
number
number
number
limited number
numberof
of ofparticipants.
ofof participants.
participants.
participants. InIn
participants. In addition
addition
InInaddition
addition
addition toto
to to
addition thethe
the
to the
the
worth
worth worth
apparent
apparent noting
noting
apparent noting
thatthat
physiological
physiological that
this
this data
physiological this
data data
set set
differences
differences set
has
has within
differences has
a
a within
limited a
limited
withinlimited
groups,number
groups, number
groups, number
thethe of of of participants.
participants.
participants.
participants
participants
the participants also In
Inexhibit
also In
addition
addition
exhibit
also addition
exhibit to
to the
different
different to
the
different
apparent
apparent apparent physiological
physiologicalphysiological differences
differences differences within
within withingroups,
groups, the
groups, the participants
participants
the also
participants also exhibit
exhibit
also different
different
exhibit different
apparent apparent
apparent
strategies
strategies physiological
ofphysiological
physiological
strategies of phonation.
phonation.
of differences differences
differences within
within within groups,
groups, groups,thethe the participants
participants
participants alsoalso also
exhibit
exhibit exhibit different
different
different
strategies
strategies of
strategies
strategies
of ofphonation.
phonation.
phonation.
of phonation.
phonation.
3. strategies
‘Union’
Table strategies mapsof phonation.
of
of phonation. phonation types that display the majority count of all the cluster data in one
group.TableTable
The
Table
Table 3.
columns
Table
3. 3. ‘Union’
‘Union’
‘Union’ mapsmaps
represent
3. ‘Union’ maps of
mapsof
thephonation
groups:
phonation
of of phonation
phonation types
13
types
typesmales,
that
typesthat
that 13display
display
that females,
display thethe
display the majority
and
majority 22 count
the majority
majority count
children.
count
count of of
countall
of all
The ofthe
the
all rows
all cluster
cluster
the data
therepresent
cluster
data in
indata
in inin
3.Table
‘Union’ maps
3. ‘Union’ of phonation
maps types
of phonation that
types display
that displaythe majoritythe majority of all
count the
of allcluster
cluster datadata
the cluster in data
one
the clusterTable
Table
one Table
3.
group.
one ‘Union’
group.
number The 3.
3.group.
‘Union’
The
k ‘Union’
maps
maps
columns
columns
The
from ofmaps
columns
2 to of of
phonation
represent
represent
6. The phonation
phonation
represent
x-axis types
thetypes
the groups:
groups:
the
is f types
that that
groups:
in13 that
display
13 13display
display
males,
males,
semitones 13the
the
males, 13 (on the
majority
majority
females,
females,
13the majority
females,
and
MIDI count
count
and of
22count
22scale,
and of22all
all ofthe
the
children.
children.
48 all
children.
= The
110 the
cluster
cluster
The cluster
data
rows
rows
The
Hz), data
rep- in
rep-
rows
and data
inrep- in
one one group.
group. The The columns
columns represent
represent thethe groups:
groups: o 13 13 males,
males, 13 13 females,
females,
one group. The columns represent the groups: 13 males, 13 females, and 22 children. The rows rep- and and22 22 children.
children. The The rows
rows rep- rep-
one one
resent one
group.
group.
resent the
resent
the group.
The The
cluster
the The
columns
columns
cluster
cluster
number columns
number represent
number
k k
from represent
represent
fromk2 the
2
from
to the
to
6. 2The the
groups:
groups:
6. The
to 6. groups:
13 13
x-axis
The
x-axis males,
is is
x-axis
f o13
males, f
in males,
o 13
in
is 13
ofemales,
semitones
semitones
f in 13 females,
females,
semitonesand
(on and
(on
the and
22
22the
(on
MIDI 22
the
MIDI children.
children.
children. The
scale,
scale,
MIDI 48 The
48
scale,
= =The
rows
rows
110110
48 Hz),
= rows
rep-Hz),
110 rep-
rep-Hz),
the y-axis resent
resent is the
SPLthe incluster
cluster
resent dB(C)
the number
at 0.3
number
cluster k m.k from
from
number 2 to
2k to
from 6.2 The
6. The 6.x-axis
tox-axis fis
Theisx-axis o info in is semitones
semitones
fo in semitones(on(onthethe
(on
MIDI scale,
scale,
the
MIDI MIDI48 48 = 110
= 110
scale, 48Hz),=Hz),
110 Hz),
resent
and resent
resent
thethe the
cluster cluster
thecluster number
number
is number
k in fromk2 from
k dB(C)
from 2m.
toatto
6.m.2 to
6.
The 6. x-axis
The Theisx-axis
x-axis fis fo in
o in o in semitones
is fsemitones
semitones (on(on (on
thethe the scale,
scale, scale,
48 48 = 48
= 110110= Hz),
Hz),110 Hz),
y-axis is SPL in dB(C) at 0.3
and the
and
y-axis is SPL
SPL
y-axis is
inSPL
dB(C) at 0.3 0.3 m. MIDI
MIDI MIDI
and
and the
the y-axis SPL
theisy-axis
y-axis
and is in
in dB(C)
dB(C)
SPL at
inatdB(C)
0.3 0.3atm.0.3 m.
m.
andand and
thethe theisy-axis
y-axis
y-axis is SPL
SPL is SPL
in in inat
dB(C)
dB(C) dB(C)
at 0.3
0.3 atm.
m. 0.3 m.
1313 13 Males
Males
Males
13 1313
13 Females
Females 2222
Children
Children
1313 13Males
Males
Males
13 Males
Males
13 Females
13 13
13Females
Females
Females
13 Females
Females
22
22 Children
22 22
22Children
Children
Children
22 Children
Children
1313 Males
Males 1313 Females
Females 2222 Children
Children

=kk 22== 22kk==22


kk =
k =k2= 2k = 2

kk ==kk 33== 33kk==33


k =k3= 3k = 3

kk ==kk 44== 44kk==44


k ==k4= 4k = 4

Appl.
Appl. Sci.
Sci.
Appl. 2022,
2022, 12,
2022,
Sci.12, x12,
x FORFOR PEER
PEER
x FOR REVIEW
REVIEW
PEER REVIEW 15 15
of of
1525of 25
25
Appl.
Appl. Sci. Sci.
2022,
Appl. 2022, x12, x12,
FOR
12,2022,
Sci. FOR
PEERPEER
x FOR REVIEW
REVIEW
PEER REVIEW 15 of15 25
of
15 25
of 25

k =k 5k= =5k5= 5
k=5 k=5

k =k 6k= =6k6= 6
kk == 66 k = 6

Overall,
Overall, Overall,
male male voices
voices
male voicesgenerally
generally generally
havehave ahavea larger
larger a largerpitch
pitch and and
pitch volume
volume
and volume range
range range
than than females
females
than females
Overall,
Overall, male
Overall, malevoices
male voices
voicesgenerally
generally have
generally have ahave a larger
larger pitch
a larger pitchand and
pitch volume
andvolume range
volume range than
range than females
females
than females
andand
Overall, children.
children.
and children.
male In In
voices contrast,
contrast,
In contrast,
generallythe the female
female
the
have female
and
a andand
larger the the
and child’s
child’s
pitchthe child’s
and voices
voices
volume voices
are aremoremore
are moreconcentrated.
concentrated.
concentrated. All All All
andand andchildren.
children. children. In contrast,
In contrast, In thethe
contrast, female
female
the female and the the
and child’s
child’s
the voices
child’s voices arerange
voices are
more more
are than females
concentrated.
concentrated.
more concentrated. and
AllAllAll
three
three
children. In groups
groups
three groups
contrast, produce
produce produce
the similar
similar
female similar
and distributional
distributional
thedistributional
child’s trends
trends
voicestrends
withwith
are k k
with
= 2
more= 2
and
k and
= 2k k
and
= =
3. 3.
Soft
k =Soft
3.and and
Soft loud loud
and re- re- re-
loud
three
three threegroups
groups produce
groups produce similar
similar
produce distributional
distributional
similar distributional trends
trends with
trends withkwith k2concentrated.
= k2 =and =and k3.=Soft
2k =and k3.=Soft
3.and All
and
Soft loud three
loud
and re-
loudre- re-
groups gions
gions
produce partition
gions
partitionpartition
the
similar themap, map,
the andand
map,
distributional thenthen
and a a
then transition
transition
a transition zone
zone
k zone slanted
slanted
zone slantedupward
upward
k upward upward appears
appears appears
in in between
between
in between
gions
gions partition
partition
gions thethe
partition map, map,
the and
map, and
then then
and atrends
transition
then withzone
a transition
a transition = 2slanted
and
zoneslanted = 3.
slanted Soft
upward and
appears
upward loud
appears regions
in
in between
appears between
in between
these
these categories,
these
categories,
categories, though
though though arrayed in in slightly different manners in in each group. When it itit it
partition
these the
these map,
categories,
these and
categories,
categories, then
thoughthough aarrayed
though
arrayed
transition
arrayed
arrayed arrayed
slightly
in in
zone slightly
slightly
in slightly
different
slanted
in slightly
different
different
different
manners
upward manners
appears
manners
manners
different manners
each
in in
in each in
eachgroup.
each
between
in group.
each
group.
group. When
When
group. these
WhenWhen
it
When it
comes
comes comes
to to
k =k4,= 4,
to the
k =the4, process
process
the process
of of
the the
of split
splitthe hashas
split changed.
changed.
has changed.
The The fourth
fourth
The fourthcluster
cluster occurs
cluster
occurs occurs
in in
thethein male’s
male’s
the male’s
comes
categories,
comes to kto= k4,
though
comes to=arrayed
k4,=the
the process
process
4, the inprocess of the
slightly
of the split
of split
different
the has has
split changed.
manners
changed.
has TheThe
changed. fourth
infourth
each
The cluster
group.
cluster
fourth occurs
When
occurs
cluster in
initthe
occurs the
comesmale’s
in male’s
thetomale’s
loud
loud region,
region,
loud thethe
region, female’s
female’s
the female’s
softsoft region,
region,
soft region,
and andthe the
and children’s
children’s
the children’s outer
outer circle.
outer
circle. This
circle.
This could
could
This mean
could
mean mean
thatthatthat
loud
loud loudregion,
region, thethe
region, female’s
female’s
the softsoft
female’s region,
region,
soft andand
region, the andthe children’s
children’s
the outer
children’s outer circle.
circle.
outer This This
circle. couldcould
This mean
couldmean that
mean thatthat
thethe male’s
male’s
the voice
male’s
voice is
voice
is moremore
is stable
stable
more in
stable
in the the
in soft
soft
the region,
region,
soft region,
and and the the
and female’s
female’s
the female’s voice
voice is
voice
is moremore
is morestable
stable in in
stable
in
thethe male’s
male’s
the voice
male’s voice is
voiceis more
more is stable
morestable
stablein the
in the soft
in soft
theregion,region,
soft andand
region, theandthe female’s
female’s
the voice
female’s voiceis is more
more
voice is stable
more stable in in in
stable
thethe
loudloud
the region,
region,
loud region,
as as
the the
as order
order
the of
order
of split
split of maymay
split indicate
indicate
may indicate
the the likelihood
likelihood
the likelihood
of of changes.
changes.
of changes.
The The children’s
children’s
The children’s
thethe
loud loud
the region,
region,
loud as the
region, as theorder
as order
the of
orderof split
split ofmay may
split indicate
indicate
may thethe
indicate likelihood
likelihood
the of changes.
of changes.
likelihood TheThe
of changes. children’s
children’s
The children’s
voice
voice group
group
voice group
was was stationary
stationary
was forfor
stationary this
thisfor stage.
stage.
this stage.
voice
voice group
voicegroupwaswas
group stationary
stationary
was for for
stationary this
thisfor stage.
stage.
this stage.
3.2.3.
3.2.3. Acoustic
Acoustic
3.2.3. Versus
Acoustic
Versus EGG
Versus
EGG Metrics
Metrics
EGG Metrics
3.2.3.
3.2.3. Acoustic
Acoustic Versus
Versus EGGEGG Metrics
Metrics
Appl. Sci. 2022, 12, 12092 15 of 27

k = 4, the process of the split has changed. The fourth cluster occurs in the male’s loud
region, the female’s soft region, and the children’s outer circle. This could mean that the
male’s voice is more stable in the soft region, and the female’s voice is more stable in the
loud region, as the order of split may indicate the likelihood of changes. The children’s
voice group was stationary for this stage.

3.2.3. Acoustic Versus EGG Metrics


Instead of showing just one metric across the voice range, the voice map of clusters
gives a solution for simultaneously picking up information from several metrics of both
Appl. Sci. 2022, 12, x FOR PEER REVIEW
acoustic and EGG signals. For example, Figure 7 shows the map layers of16all of 25
metrics for
comparison with k = 4 clusters in another male, M06.

Figure 7. Voice map layers of the phonation types and the underlying metrics, from a male speaker,
Figure 7. Voice map layers of the phonation types and the underlying metrics, from a male speaker,
M06, when k = 4. Clockwise from top left: four phonation type clusters, the Crest factor, the spectrum
when k(SB),
M06,balance = 4.theClockwise fromprominence
cepstral peak top left: four phonation
smoothed type
(CPPs), the clusters, the Crest
EGG cycle-rate factor,
sample the spectrum
entropy
(CSE),
balance (SB),thethe
normalized
cepstral peak
peakdEGG (Q∆), andsmoothed
prominence the EGG quotient
(CPPs),ofthe
contact
EGGbycycle-rate
integrationsample
(Qci). The x-
entropy (CSE),
axis is fo in semitones (on the MIDI scale, 48 = 110 Hz), and the y-axis is SPL in dB(C) at 0.3 m.
the normalized peak dEGG (Q∆ ), and the EGG quotient of contact by integration (Qci ). The x-axis is f o
in semitones
Are(on all the MIDI
these scale,needed?
metrics 48 = 110To Hz), and the
address y-axis
this is SPLthe
question, in dB(C) at 0.3 m.
classification was per-
formed with EGG only (three metrics), acoustic only (three metrics), and EGG + acoustic
Themetrics),
(six Crest factor
as cancontributes
be seen for onethe participant
most to the in cluster
Figure 8.corresponding
We see a betterto the speech
portrait of range
on the left of
parallel this voice
layering map,8b)
in (Figure colored by themetrics,
with acoustic pale yellow in theapproximation
but a better phonation type of thelayer (and
the redder regionrange
modal speech on the leftgrey
(the of the Crest
on the leftfactor
aroundlayer). At theand
50 semitones same80 time,
dB) inin EGG 8c)
(Figure metrics, the
withQEGG
Qci and metrics. With
seemingly implycombined
the acousticfrom
tendency and EGG
the features,
lower wetoobtain
left the a moreright.
upper struc-The CPPs

turedas
and CSE voice map. indicators for voice signal and EGG signal each independently expose
stability
the lower breathy and soft regions. Though the CPPs and CSE show similar impacts on
distinguishing the loud and soft regions, a slightly different portrait in the transitional
region affects the boundaries of the loud and the transition voices.
Are all these metrics needed? To address this question, the classification was per-
formed with EGG only (three metrics), acoustic only (three metrics), and EGG + acoustic
(six metrics), as can be seen for one participant in Figure 8. We see a better portrait of
parallel layering in (Figure 8b) with acoustic metrics, but a better approximation of the
modal speech range (the grey on the left around 50 semitones and 80 dB) in (Figure 8c)
(a) (b) (c)
Figure 8. A classification that uses (a) all six metrics combined, (b) acoustic metrics only, and (c)
EGG metrics only. Here, the acoustic metrics are the crest factor, CPPs, and SB, and the EGG metrics
are the Qci, Q∆, and CSE. The x-axis is fo in semitones (on the MIDI scale, 48 = 110 Hz), and the y-axis
is SPL in dB(C) at 0.3 m.
(CSE), the normalized peak dEGG (Q∆), and the EGG quotient of contact by integration (Qci). The x-
axis is fo in semitones (on the MIDI scale, 48 = 110 Hz), and the y-axis is SPL in dB(C) at 0.3 m.

Are all these metrics needed? To address this question, the classification was per-
formed with EGG only (three metrics), acoustic only (three metrics), and EGG + acoustic
Appl. Sci. 2022, 12, 12092 (six metrics), as can be seen for one participant in Figure 8. We see a better portrait 16
of of 27
parallel layering in (Figure 8b) with acoustic metrics, but a better approximation of the
modal speech range (the grey on the left around 50 semitones and 80 dB) in (Figure 8c)
withEGG
with EGGmetrics.
metrics.With
Withcombined
combinedacoustic
acousticand
andEGG
EGGfeatures,
features,we
weobtain
obtainaamore
morestructured
struc-
tured voice
voice map. map.

(a) (b) (c)


Figure 8. A classification that uses (a) all six metrics combined, (b) acoustic metrics only, and (c)
Figure 8. A classification that uses (a) all six metrics combined, (b) acoustic metrics only, and (c) EGG
EGG metrics only. Here, the acoustic metrics are the crest factor, CPPs, and SB, and the EGG metrics
metrics
are theonly.
Qci, QHere, the acoustic metrics are the crest factor, CPPs, and SB, and the EGG metrics are
∆, and CSE. The x-axis is fo in semitones (on the MIDI scale, 48 = 110 Hz), and the y-axis
the Q , Q , and CSE. The x-axis is f o in semitones (on the MIDI scale, 48 = 110 Hz), and the y-axis is
ci in ∆dB(C) at 0.3 m.
is SPL
SPL in dB(C) at 0.3 m.

This confirms that acoustic metrics and EGG metrics manifest different aspects of vocal
function. Combining the metrics of the airborne signal and the EGG signal resolves the
phonation types in greater detail. However, if limited by conditions, only one set of metrics
can also be well interpreted, as in this case, the acoustic metrics are better at revealing the
structural feature of the voice map. As a complement, the EGG metrics are more suited for
finding the transition zone at the onset of vocal fold collision.

4. Discussion
As we can see above, the unsupervised learning method gives interpretable results that
are relatively consistent with our prior knowledge of phonation. Thus, it is acceptable to stay
with the current K-means++ method because of its computational rapidity and efficiency.
Our present choice of metrics was in part determined by the current availability
of metrics in the bespoke software that we are using as an acquisition front-end for the
clustering. There are of course several other candidates for metrics, especially if concurrent
automatic inverse filtering proves to be possible. Of particular interest would be metrics
that represent the power in the rest of the spectrum relative to that of the fundamental
partial, such as the ‘harmonic richness factor’ (HRF) [28] for the inverse filtered glottal flow
signal; Lrest/H1 [29] for the radiated sound or the maximum flow declination rate (MFDR ).
This measurement paradigm strongly depends on the assumption that there should
exist distinct regions in the f o /SPL voice map. We expect to see connected regions represent-
ing phonation types instead of a speckled field of scattered dots. This assumption, to some
extent, serves as a validation criterion. Spatial coherence is what we expect the voice maps
to show, since our experience tells us that different kinds of phonation tend to be used in
different parts of the voice range. A further perceptual assessment performed by voice pro-
fessionals and clinicians will be required to establish the classified phonation modes either
physiologically or perceptually. This perceptual assessment would be purpose-oriented
and empirical. Different investigators may have different criteria for categorization into
phonation types. For example, a trained singer could require a more specialized set of
phonation types than a singing amateur. A phonetician would call for more information on
the vibratory status; thus, more information on the vocal folds is needed. For pathological
voices, the complexity might increase further; the optimum number of clusters could be
dependent on the type of voice disorder and the treatment involved. In this case, the
ground truth must also be task-oriented and requires professional or non-professional
knowledge to confirm the validity of the classification. To assist in such work, FonaDyn
includes a built-in audio player, which allows users to interactively listen to or annotate
what a region indicated in the voice map sounded like.
Appl. Sci. 2022, 12, 12092 17 of 27

Many readers will be familiar with MDVP in Figure 9 (multidimensional voice pro-
gram). The MDVP displays several objective voice quality parameters in a polar ‘radar’
plot, including Base Frequency (f o ), Jitter, Shimmer, NHR (Noise to Harmonic Ratio), VTI
(Voice Turbulence Index), and ATRI (Amplitude Tremor Intensity Index). MDVP is well
established in clinical practice and may provide an objective yet non-invasive and com-
fortable alternative to assess voice quality—and to some extent, diagnose voice-related
abnormalities. However, a limitation lies in that the MDVP
Appl. Sci. 2022, 12, x FOR PEER REVIEW
is designed for use with in-
18 of 25
dividual sustained vowels, while most metrics are greatly affected by the f o and SPL.

Jitt
DUV RAP

DSH PPQ

DVB vFo

FTRI ShdB

SPI Shim

VTI APQ
NHR vAm

Figure 9. Example of an MDVP graph, adapted from [30].


Figure 9. Example of an MDVP graph, adapted from [30].
With clustering, we can combine the advantage of applying multiple metrics with
the conceptual and graphical convenience
With clustering, we can combine of a flat 2D the
voiceadvantage
map. Such a representation
of applying multiple metrics with
collates the systematic changes across the voice range, thus providing a more complete
the conceptual and graphical convenience of a flat
basis for classification. This could be helpful for the clinical diagnosis and valuation 2D voice map.of Such a representation
collates
vocal status,the systematic
for example, before changes across
and after medical the voice
intervention. Therange, thus providing a more complete
clinical application
relies on the plausible assumption that different phonation types correspond to changes
basis for classification. This could be helpful for the clinical diagnosis and valuation of
in the associated physiological structures and mechanisms, and/or that vocal intervention
vocal
(such asstatus,
therapy and forsurgery)
example, before
has effects and regions
on different after medical
of one’s voice intervention.
range. In other The clinical application
words, we
relies onexpect that the effects
the plausible of interventionsthat
assumption can be observed inphonation
different the voice maps. types correspond to changes in
Even if voice maps cannot be used in certain applicational scenarios, we would en-
the associated physiological structures and mechanisms,
courage improving recording protocols, by well controlling the fo and SPL and avoiding and/or that vocal intervention
(such as therapy
the excessive averagingand of thesurgery)
metrics. Thehas effectsbetween
covariation on different
phonation regions of one’s voice range. In other
type and pitch
or SPL [1,31] will otherwise result in sources of variation that remain unaccounted for.
words, we expect that the effects of interventions can be observed in the voice maps.
It is not quite safe to conclude that each cluster found represents a distinct type of
phonation. Theif
Even BICvoice
analysismaps
suggestscannot be used
a useful maximum in certain
number of clusters,applicational
and increas- scenarios, we would
encourage improving
ing k further will tend simply to recording protocols,
stratify the data into more by welleven
‘clusters’, controlling the f o and SPL and avoiding
when they are
evenly distributed and not attributable to distinct modes of phonation. Applying the BIC
the excessive averaging of the metrics. The covariation between phonation type and pitch
criterion did not give a very clear outcome in this data set since the maxima in the BIC
or SPL
curves were[1,31] will
not very otherwise
prominent, exceptresult
perhapsin sources
when of variation
all six metrics thatRe-
were clustered. remain unaccounted for.
solvingItthis
is issue
not will require
quite safea deeper analysis of how
to conclude thattheeach
centroids were distributed.
cluster found represents a distinct type of
The set of metrics chosen here has not been tested on some common phonation types,
phonation.
such as the creakyThe BICnoranalysis
voice, have we yet suggests a useful pathological
tried to characterize maximum number
voices in this of clusters, and increasing
kway.
further willsets
Other data tend
will simply
have to beto stratify
collected, for the
which data into more
participants would‘clusters’,
be requested even when they are evenly
to produce a greater variety of phonation types. This problem becomes more prominent
distributed and not attributable to distinct modes of phonation. Applying the BIC criterion
in a social or sociolinguistic context. “One person’s voice disorder might be another per-
did not give
son’s phoneme” [32]. a very clear outcome
One phonation in this‘abnormal’
type considered data set or since the maxima
‘noisy’ could be an- in the BIC curves were
othervery
not one’s feature,
prominent,making except
only the comparison
perhaps between when all pre-six
andmetrics
post-intervention
were clustered. Resolving this
meaningful to some extent. A larger data set and a set of better-controlled parameters or
issue willarerequire
algorithms required to a restrict
deeper analysis
these of how the centroids were distributed.
disadvantages.
Theweset
Here, ofused
have metrics chosen
the metric values here has as
themselves notthebeen tested
clustering on Itsome
features. can be common phonation types,
seen in Figure 2 that piecewise linear segments can reasonably approximate the variation
such as the creaky voice, nor have we yet tried to characterize
with the SPL of the metrics, so it is possible that phonatory regions would be better char-
pathological voices in this
way. Other
acterized data
by instead sets will
clustering have of
the gradients tothe
bemetrics
collected,
[26] (pp.for which
40–70). participants
However, com- would be requested
puting
to the gradients
produce (the metric
a greater derivatives
variety with respect types.
of phonation to SPL andThis fo) increases
problem noise,
becomes more prominent in
a social or sociolinguistic context. “One person’s voice disorder might be another person’s
phoneme” [32]. One phonation type considered ‘abnormal’ or ‘noisy’ could be another one’s
feature, making only the comparison between pre- and post-intervention meaningful to some
extent. A larger data set and a set of better-controlled parameters or algorithms are required
to restrict these disadvantages.
Here, we have used the metric values themselves as the clustering features. It can be
seen in Figure 2 that piecewise linear segments can reasonably approximate the variation
Appl. Sci. 2022, 12, 12092 18 of 27

with the SPL of the metrics, so it is possible that phonatory regions would be better
characterized by instead clustering the gradients of the metrics [26] (pp. 40–70). However,
computing the gradients (the metric derivatives with respect to SPL and f o ) increases noise,
and can be performed only when an entire map is completed, i.e., not in real-time. We
intend to explore this further in future investigations.

5. Conclusions
By classifying the phonation types through K-means++, we then visualize the distri-
bution of phonation types across the voice range on the voice map and found possible
differences across participants, genders, and ages. It appears that the males exhibited
a greater variety of phonation types than the females and the children. The optimal number
of clusters for males in this data set is six, while the optimal number of clusters for females
and children is four. This would be consistent with the notion of larger vocal folds being
able to enter into more different modes of vibration.
A systematic phonation type splitting with increasing k manifested itself differently in
the male, female, and child groups. The clinically relevant correlations of these distributions
remain to be validated by relating the centroid locations to perceptual and physiological
assessments. The EGG metrics contribute noticeably to the classification of phonation types
in terms of vocal fold information, thus complementing the acoustic metrics. It appears
that defining the phonation type by clustering features derived from the acoustic and EGG
signals and visualizing them over the voice field is a promising technique.

Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/app122312092/s1.
Author Contributions: Conceptualization and methodology, H.C. and S.T.; Software, H.C. and S.T.;
Data Curation, H.C.; Writing—original draft preparation H.C.; Writing—review and editing, S.T.;
Supervision, S.T. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the China Scholarship Council, a funding institution under
the Ministry of Education of China, grant number 202006010113 within the KTH-CSC programme.
The data collection was carried out in an earlier project [22] that was supported by a travel grant
for Rita Patel from the Wenner-Gren Foundations, number GFOh2018-0014. The FonaDyn software
development is supported by in-kind KTH faculty grants for the time of author S.T.
Institutional Review Board Statement: The data collection for the earlier study [22] was approved
by the Swedish Ethical Review Authority, Registration No. 2019–01283.
Informed Consent Statement: Informed consent was obtained from all participants involved in the
earlier study for which the data collection was made.
Data Availability Statement: Complete voice maps for all individual participants are provided in
the Supplementary File for the present article.
Acknowledgments: We are grateful to Rita R. Patel for initiating and collaborating on the preceding
prospective study, for which the data collection was originally carried out, and to the participants.
Andreas Selamtzis originally proposed the clustering of multiple metrics, during his doctoral studies
in 2017. Olov Engwall co-supervised H.C. and kindly offered helpful comments on the manuscript.
Conflicts of Interest: The authors declare no conflict of interest.

Appendix A
The voice maps and polar plots of the metrics of the participants discussed above are
listed. The centroid polar is normalized in a range of 0 . . . 1 and represents the metric values
of the centroids. The color scheme in the centroid polar is the same as in the phonation type
map, thus indicating the same cluster. The right columns contain the individual metric
maps for the given participant. Notice that in Table 3, the ‘Union’ maps of the phonation
types share the same sets of centroids by groups.
Appl. Sci. 2022, 12, 12092 Appl.Sci.
Appl. 2022,12,
Sci.2022, 12,xxFOR
FORPEER
PEERREVIEW
REVIEW
Appl. Sci. 2022, 12, x FOR PEER REVIEW 20 of
20 of 25
25 20 of 19
25of 27
Appl. Sci. 2022, 12, x FOR PEER REVIEW
Appl. Sci. 2022, 12, x FOR PEER REVIEW 20 of 25 20 of 25
Appl. Sci.2022,
Appl. Sci. 12,xxFOR
2022,12, FORPEER
PEERREVIEW
REVIEWAppl. Sci. 2022, 12, x FOR PEER REVIEW 20
20 of
of 25
25 20 of 25

Table A1. Male participant M01.


TableA1.
Table A1.Male
Maleparticipant
participantM01.
M01. Table A1. Male participant M01.
Table A1. Male participant M01. Table A1. Male participant M01.
Table
Table A1.
A1. Male
Male participant
participant M01. Table A1. Male participant M01.
M01. Polar
Cluster
Cluster Cluster Centroid
Centroid Polar
Centroid Plot
Polar Plot
Plot Metric
Centroid Polar
Metric Plot Metric
Cluster Number Cluster Phonation Type Map
Cluster Centroid Polar Metric
Centroid Single
Polar Name
SinglePlot Metric Single Metric
Metric Voice
Voice Map
Cluster
Cluster
Phonation
Phonation Type
Type Map
Map
Cluster
Phonation TypeNumber
Map
Phonation
(Range
Phonation
Centroid
Centroid 0 . .Plot
Type
Type
Polar .Map
1)
Map
Plot
Metric
Centroid
Metric Polar
Single
Metric Voice
Metric
Plot
Metric
Voice Map
Map
Metric
Voice Map
Single Map
Single Metric Voice Map
Number
Number (Range 0…1)Plot
(RangePolar
0…1) Metric
Name
Name (Range 0…1) Name
Number Phonation
Phonation Type
TypeNumber
Map
Map Phonation Type Map
(Range 0…1) Name (RangeSingle
0…1) Metric
Single Name
Metric Voice
Voice Map
Map Single Metric Voice Map
Number
Number Number (Range
(Range 0…1)
0…1) Name (Range 0…1)
Name Name

Crest
Crest Crest Crest
k=2 kk == 22 k=2 Crest Crest
k=2 k=2 Crest
Crest
factor
factor factor Crest
factor
kk == 22 k=2 factor factor
factor
factor factor

Spectrum
Spectrum Spectrum
kk == 33 k=3 Spectrum Spectrum
k=3 k=3 k=3 Spectrum
Spectrum
Balance
Balance Spectrum Balance Spectrum
Balance
kk == 33 k=3 Balance Balance
Balance
Balance Balance

kk == 44 k=4 CPPs
CPPs CPPs
k=4 k=4 CPPs CPPs
k=4 kk == 44 k=4 CPPs
CPPs CPPs CPPs

kk == 55 k=5 Qcici
Q Qci
k=5 k=5 Qcici Qci
kk == 55 k=5 Q
Qcici Qci

kk == 66 k=6 Q∆∆
Q Q∆
k=6 k=6 Q∆∆ Q∆
k=3 k=3 Balance
Balance Balance
Balance Balance

Appl. Sci. 2022, 12, 12092 20 of 27

kk == =44 4 kk == 44 CPPs
CPPs CPPs
CPPs CPPs
Table kA1. Cont. k=4 CPPs

Centroid Polar Plot


Cluster Number Phonation Type Map Metric Name Single Metric Voice Map
(Range 0 . . . 1)

k=5 kk k== =55 5 kk == 55 Qcicici


QQci
Qci Q
Qcici
k=5 Qci

k=6 kk k== =66 6 kk == 66 Q∆∆∆


QQ∆ Q∆ Q
Q∆∆
k=6 Q∆

Speech
Speech Speech
Speech Speech
range Speech
range CSE CSE
Speech range range
range range CSE
CSE CSE
profile range
profile CSE CSE
profile profile profile
profile profile
Appl. Sci. 2022, 12, 12092 2022, 21 of 27
Appl.
Appl. Sci.
Sci.2022,
Appl.Sci. 12,xxxFOR
12,
2022,12, FOR PEER
FORPEER REVIEW
PEERREVIEWAppl. Sci. 2022,
Appl. Sci.
REVIEW 12, x
2022, 12, x FOR
FOR PEER
PEER REVIEW
REVIEW 21
21 of
21 of 25
of 25
25 21
21 of
of 25
25

Table A2. Female participant F06.


Table
Table A2.
TableA2. Female
A2.Female participant
Femaleparticipant F06. Table
F06.
participantF06. Table A2.
A2. Female
Female participant
participant F06.
F06.
Cluster
Cluster
Cluster Cluster
Cluster Centroid
Centroid
Centroid Polar
Centroid Polar
Polar
Polar Plot
Plot Plot Centroid
Metric
Metric Polar
Polar Plot
CentroidSingle Plot Metric
Metric
Cluster Number Phonation
Phonation
Phonation Type
Phonation
Type
Type Map
Type Map
Map
Map Phonation
Phonation TypePlot
Type Map
Map
Metric Metric
Single
Single Metric
Metric
Metric Voice
Name Voice
Voice Map
Map
Map Single Metric
Single Metric
Single Voice
Metric Voice Map
Voice Map
Map
Number
Number
Number Number
Number (Range
(Range
(Range 0
0…1)
(Range0…1)
0…1). . . 1) Name
Name (Range 0…1)
Name (Range 0…1) Name
Name

Crest
Crest
Crest Crest Crest
Crest
k=2 kkk===222 kk == 22
factor
factor
factor factor factor
factor

Spectrum
Spectrum
Spectrum Spectrum
Spectrum
k=3 kkk===333 kk == 33 Spectrum Balance
Balance
Balance
Balance Balance
Balance

k=4 kkk===444 kk == 44 CPPs


CPPs
CPPs CPPs CPPs
CPPs

kkk===555 kk == 55 Q
Q
Qcicicici Qcicici
Q

kkk===666 kk == 66 Q
Q
Q∆∆∆∆ Q
Q∆∆∆
Balance
Balance
Balance
Balance Balance
Balance
Balance

Appl. Sci. 2022, 12, 12092 22 of 27

kkk===444 k=4 CPPs


CPPs CPPs
CPPs
k k= =4 4
Table A2. Cont. kk == 44 CPPs
CPPs
CPPs CPPs

Centroid Polar Plot


Cluster Number Phonation Type Map Metric Name Single Metric Voice Map
(Range 0 . . . 1)

k=5 kkk===555 k=5 Q


Q cici Qci Q
k k= =5 5 kk == 55 QQ
Q cicici Qcicici
Q

kkk===666 k=6 Q
Q ∆∆ Q
k=6 k k= =6 6 kk == 66 QQ
Q ∆∆∆ Q∆ Q∆∆∆
Q

Speech
Speech
Speech Speech
Speech
Speech
Speech Speech
Speech range range
range
range range
range CSE
CSE
CSE CSE
CSE
range
range range CSE
CSE CSE CSE
profile profile
profile profile
profile
profile
profile profile
profile
Appl. Sci. 2022, 12, 12092 Appl. Sci.
Appl. Sci. 2022,
Sci. 2022, 12, xxxxx FOR
12,
2022, 12, FOR PEER
FOR PEER REVIEW
PEER REVIEW
REVIEW 22 of
22 of 25
of 25
25 23 of 27
Appl.
Appl.
Appl. Sci.
Sci. 2022,
2022, 12,
12, FOR
FOR PEER
PEER REVIEW
REVIEWAppl. Sci.
Appl. 2022, 12,
Sci. 2022, 12, xx FOR
FOR PEER
PEER REVIEW
REVIEW 22
22
22 of
of 25
25 22 of
22 of 25
25

Table A3. Child participant B09.Table A3.


Table A3. Child
A3. Child participant
Child participant B09.
participant B09.
B09. Table
Table A3.
A3. Child
Child participant
participant B09.
B09.
Table
Table A3. Child participant B09.
Cluster
Cluster Cluster
Cluster Centroid Polar
Centroid
Centroid Plot
Polar
Polar PlotPlot Metric
Centroid Polar
Metric
Centroid Polar Plot Metric
Cluster Number Cluster Phonation
Phonation
Phonation Type
Type
Phonation Maps
Maps
Type
Type Maps
Maps
Centroid
Phonation
Phonation
Polar
Type
Type
Plot
Maps
Maps
Metric MetricPlot
Single
Single
Single Metric
Metric Metric
Voice
Voice
Name Voice
Metric Map
Map
Map Single Metric
Single
Single Metric Voice
Metric Voice Map
Voice Map
Map
Number
Number
Number Number
Number (Range 0…1)
(Range
(Range 0
0…1)
(Range 0…1)
0…1). . . 1) Name
Name (Range 0…1)
Name (Range 0…1) Name
Name
Number (Range Name

Crest
Crest
Crest Crest Crest
Crest
k=2 kkkk ==== 2222 kk == 22 factor
factor factor factor
factor factor

Spectrum
Spectrum
Spectrum
Spectrum Spectrum
Spectrum
k=3 kkk === 333 kk == 33 Balance
Balance
Spectrum Balance
Balance
Balance Balance

k=4 kkk === 444 kk == 44 CPPs


CPPs
CPPs CPPs CPPs
CPPs

kkkk ==== 5555 kk == 55 Q


Qcicicicicici
Q
Q Qcicici
Q

kkkk ==== 6666 kk == 66 Q


Q∆∆∆∆∆∆
Q
Q Q∆∆∆
Q
Balance
Balance Balance

Appl. Sci. 2022, 12, 12092 24 of 27

kk =k==444 k=4 CPPs


CPPs
CPPs CPPs
Table A3. Cont.

Centroid Polar Plot


Cluster Number Phonation Type Maps Metric Name Single Metric Voice Map
(Range 0 . . . 1)

k=5 kk =k= =55 5 k=5 QQ


Q cici
ci ci Qci Qcici

k=6 kk =k= =66 6 k=6 QQ


Q ∆∆∆ ∆ Q∆ Q∆∆

Speech
Speech
Speech Speech
Speech range range
range
range range CSE
CSE
CSE CSE
CSE
profile profile
profile
profile profile
Appl. Sci. 2022, 12, 12092 25 of 27

Table A4. Centroid values for k = 2 . . . 6.

Crest Factor SB CPP


k Centroid Group CSE Q Qci Inferred Phonation Type
(dB) (dB) (dB)
Male 1.829 −31.716 5.410 3.709 1.707 0.468 soft
1 Female 1.789 −26.473 5.114 4.328 1.533 0.489 soft
Children 1.953 −18.478 4.932 4.514 2.885 0.482 soft
2
Male 2.310 −24.547 9.762 0.387 3.702 0.424 loud
2 Female 2.083 −19.495 8.659 0.579 3.636 0.403 loud
Children 2.329 −12.529 8.436 1.231 5.260 0.395 loud
Male 1.820 −31.693 5.269 4.294 1.449 0.493 soft
1 Female 1.799 −26.559 5.062 4.442 1.470 0.497 soft
Children 1.955 −18.389 4.796 4.733 2.792 0.493 soft
Male 2.003 −27.684 7.291 0.957 3.125 0.370 transition
3 2 Female 1.843 −21.998 6.958 1.154 3.199 0.360 transition
Children 2.064 −17.004 6.847 1.810 4.275 0.382 transition
Male 2.532 −22.843 11.490 0.154 4.130 0.475 loud
3 Female 2.283 −17.535 10.083 0.216 4.003 0.443 loud
Children 2.530 −9.117 9.588 1.036 6.052 0.414 loud
Male 1.820 −31.700 5.267 4.296 1.444 0.493 soft
1 Female 1.798 −26.608 5.057 4.446 1.467 0.497 soft
Children 1.891 −22.639 4.302 4.759 3.112 0.490 soft
Male 2.975 −26.402 10.936 0.215 4.167 0.428 low modal
2 Female 2.487 −21.014 10.494 0.158 4.124 0.425 low modal
Children 2.077 −17.439 6.856 1.688 4.435 0.378 low modal
4
Male 1.973 −27.858 7.230 0.994 3.112 0.366 transition
3 Female 1.837 −22.417 6.940 1.169 3.201 0.356 transition
Children 2.111 −10.015 5.991 4.385 2.450 0.487 transition
Male 2.224 −20.864 11.427 0.164 3.994 0.499 loud
4 Female 2.063 −14.266 9.378 0.368 3.785 0.459 loud
Children 2.519 −9.527 9.786 0.829 6.296 0.410 loud
Male 2.01 −22.74 3.42 4.74 4.36 0.50 soft
1 Female 2.22 −18.41 3.29 4.30 5.84 0.53 soft
Children 3.068 −13.786 4.448 4.381 6.722 0.482 soft
Male 1.80 −33.64 4.74 4.09 2.42 0.48 transition
2 Female 1.70 −31.61 3.90 4.40 3.43 0.49 transition
Children 1.873 −22.840 4.283 4.758 3.037 0.488 transition
Male 3.30 −28.58 7.41 1.48 8.67 0.42 low modal
5 3 Female 1.87 −23.19 5.46 4.30 2.02 0.49 low modal
Children 2.060 −17.634 6.887 1.629 4.370 0.377 low modal
Male 2.04 −28.88 6.94 1.53 4.69 0.37 falsetto
4 Female 1.91 −22.52 7.15 1.38 4.15 0.36 falsetto
Children 1.995 −10.352 6.221 4.311 2.293 0.484 falsetto
Male 2.44 −21.87 11.27 0.27 6.87 0.48 loud
5 Female 2.35 −18.48 10.17 0.35 6.62 0.44 loud
Children 2.427 −9.160 10.077 0.642 6.033 0.406 loud
Appl. Sci. 2022, 12, 12092 26 of 27

Table A4. Cont.

Crest Factor SB CPP


k Centroid Group CSE Q Qci Inferred Phonation Type
(dB) (dB) (dB)
Male 2.020 −24.113 7.870 0.811 3.446 0.309 soft
1 Female 1.708 −31.765 4.632 4.123 1.390 0.497 soft
Children 2.092 −17.451 3.604 4.943 5.341 0.535 soft
Male 1.814 −32.139 5.199 4.536 1.350 0.497 transition
2 Female 1.890 −21.419 5.467 4.761 1.553 0.496 transition
Children 3.352 −13.193 4.898 4.073 6.815 0.447 transition
Male 1.909 −22.562 7.087 1.206 2.655 0.439 low modal
3 Female 2.510 −21.150 10.551 0.151 4.137 0.427 low modal
Children 1.849 −23.378 4.573 4.660 2.640 0.478 low modal
6
Male 2.990 −26.321 10.941 0.207 4.183 0.428 high modal
4 Female 1.755 −25.489 6.862 1.103 3.073 0.390 high modal
Children 2.064 −17.656 6.906 1.613 4.388 0.376 high modal
Male 2.000 −35.203 6.699 1.251 3.093 0.388 falsetto
5 Female 1.964 −18.399 7.246 1.102 3.414 0.322 falsetto
Children 2.009 −9.954 6.366 4.241 2.239 0.477 falsetto
Male 2.261 −21.097 11.878 0.110 4.128 0.503 loud
6 Female 2.076 −14.320 9.609 0.282 3.867 0.463 loud
Children 2.424 −9.169 10.097 0.625 6.039 0.406 loud

References
1. Kuang, J.; Keating, P. Vocal fold vibratory patterns in tense versus lax phonation contrasts. J. Acoust. Soc. Am. 2014, 136, 2784–2797.
[CrossRef] [PubMed]
2. Yu, K.; Lam, H. The role of creaky voice in Cantonese tone perception. J. Acoust. Soc. Am. 2014, 136, 1320–1333. [CrossRef]
[PubMed]
3. Wang, C.; Tang, C. The Falsetto Tones of the Dialects in Hubei Province. In Proceedings of the 6th International Conference on
Speech Prosody, SP 2012, Shanghai, China, 22–25 May 2012; Volume 1.
4. Davidson, L. The versatility of creaky phonation: Segmental, prosodic, and sociolinguistic uses in the world’s languages.
Wiley Interdiscip. Rev. Cogn. Sci. 2021, 12, e1547. [CrossRef] [PubMed]
5. Gordon, M.; Ladefoged, P. Phonation types: A cross-linguistic overview. J. Phon. 2001, 29, 383–406. [CrossRef]
6. Sundberg, J. Vocal fold vibration patterns and modes of phonation. Folia Phoniatr. Logop. 1995, 47, 218–228. [CrossRef]
7. Borsky, M.; Mehta, D.D.; Van Stan, J.H.; Gudnason, J. Modal and non-modal voice quality classification using acoustic and
electroglottographic features. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2281–2291. [CrossRef]
8. Hillenbrand, J.; Cleveland, R.A.; Erickson, R.L. Acoustic correlates of breathy vocal quality. J. Speech Hear. Res. 1994, 37, 769–778.
[CrossRef] [PubMed]
9. Gowda, D.; Kurimo, M. Analysis of breathy, modal and pressed phonation based on low frequency spectral density. In Proceedings
of the Interspeech, Lyon, France, 25–19 August 2013; pp. 3206–3210. [CrossRef]
10. Kadiri, S.R.; Alku, P. Glottal features for classification of phonation type from speech and neck surface accelerometer signals.
Comput. Speech Lang. 2021, 70, 101232. [CrossRef]
11. Hsu, T.Y.; Ryherd, E.E.; Ackerman, J.; Persson Waye, K. Psychoacoustic measures and their relationship to patient physiology in
an intensive care unit. J. Acoust. Soc. Am. 2011, 129, 2635. [CrossRef]
12. Kadiri, S.R.; Alku, P.; Yegnanarayana, B. Analysis and classification of phonation types in speech and singing voice.
Speech Commun. 2020, 118, 33–47. [CrossRef]
13. Selamtzis, A.; Ternstrom, S. Analysis of vibratory states in phonation using spectral features of the electroglottographic signal.
J. Acoust. Soc. Am. 2014, 136, 2773–2783. [CrossRef] [PubMed]
14. Ternström, S.; Pabon, P. Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice. Appl. Sci. 2022,
12, 11353. [CrossRef]
15. Selamtzis, A.; Ternström, S. Investigation of the relationship between electroglottogram waveform, fundamental frequency, and
sound pressure level using clustering. J. Voice 2017, 31, 393–400. [CrossRef] [PubMed]
16. Ternstrom, S. Normalized time-domain parameters for electroglottographic waveforms. J. Acoust. Soc. Am. 2019, 146, EL65.
[CrossRef]
Appl. Sci. 2022, 12, 12092 27 of 27

17. Johansson, D. Real-Time Analysis, in SuperCollider, of Spectral Features of Electroglottographic Signals. Master’s Thesis,
KTH Royal Institute of Technology, Stockholm, Sweden, 2016. Available online: https://kth.diva-portal.org/smash/get/diva2:
945805/FULLTEXT01.pdf (accessed on 24 November 2022).
18. Pabon, J.P.H. Objective acoustic voice-quality parameters in the computer phonetogram. J. Voice 1991, 5, 203–216. [CrossRef]
19. Fant, G. The LF-model revisited. Transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep. Royal Inst. Tech.
Stockholm 1995, 36, 119–156.
20. Awan, S.N.; Solomon, N.P.; Helou, L.B.; Stojadinovic, A. Spectral-cepstral estimation of dysphonia severity: External validation.
Ann. Otol. Rhinol. Laryngol. 2013, 122, 40–48. [CrossRef]
21. Ternström, S.; Bohman, M.; Södersten, M. Loud speech over noise: Some spectral attributes, with gender differences. J. Acoust.
Soc. Am. 2006, 119, 1648–1665. [CrossRef]
22. Patel, R.R.; Ternstrom, S. Quantitative and Qualitative Electroglottographic Wave Shape Differences in Children and Adults
Using Voice Map-Based Analysis. J. Speech Lang. Hear. Res. 2021, 64, 2977–2995. [CrossRef]
23. Ternström, S.; Johansson, D.; Selamtzis, A. FonaDyn—A system for real-time analysis of the electroglottogram, over the voice
range. SoftwareX 2018, 7, 74–80. [CrossRef]
24. Damsté, P.H. The phonetogram. Pract. Otorhinolaryngol. 1970, 32, 185–187.
25. Schutte, H.K.; Seidner, W. Recommendation by the Union of European Phoniatricians (UEP): Standardizing voice area measure-
ment/phonetography. Folia. Phoniatr. 1983, 35, 286–288. [CrossRef] [PubMed]
26. Pabon, P. Mapping Individual Voice Quality over the Voice Range: The Measurement Paradigm of the Voice Range Profile.
Comprehensive Summary. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2018.
27. Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035.
28. Childers, D.G.; Lee, C.K. Vocal quality factors: Analysis, synthesis, and perception. J. Acoust. Soc. Am. 1991, 90, 2394–2410.
[CrossRef] [PubMed]
29. Pabon, P.; Ternstrom, S. Feature Maps of the Acoustic Spectrum of the Voice. J. Voice 2020, 34, 161.e1–161.e26. [CrossRef] [PubMed]
30. Stomeo, F.; Rispoli, V.; Sensi, M.; Pastore, A.; Malagutti, N.; Pelucchi, S. Subtotal arytenoidectomy for the treatment of laryngeal
stridor in multiple system atrophy: Phonatory and swallowing results. Braz. J. Otorhinolaryngol. 2016, 82, 116–120. [CrossRef]
[PubMed]
31. Kuang, J. Covariation between voice quality and pitch: Revisiting the case of Mandarin creaky voice. J. Acoust. Soc. Am. 2017,
142, 1693. [CrossRef] [PubMed]
32. Ladefoged, P. Cross-Linguistic Studies of Speech Production. In The Production of Speech; MacNeilage, P.F., Ed.; Springer:
New York, NY, USA, 1983; pp. 177–188.

You might also like