Professional Documents
Culture Documents
Abstract
The presented paper aims to establish a relationship between the identification of spoken
syllables, using statistical parameters obtained from the running autocorrelation function (r-ACF)
parameters. To accomplish this, six different syllables were recorded by twenty female voices
and twenty male voices. These recordings were normalized, and the r-ACF parameters were
calculated for each file in order to statistically analyse them. Final results show a significant
relationship between r-ACF parameters and the frequency spectrum of the vowels, which can
lead to consonants classification according to pitch and loudness characteristics, regardless the
speaker.
1 Introduction
Speech recognition is a methodology used to identify a message which is transmitted by spoken
words in a certain language. The goal of this method is to distinguish the information contained
in the speech communication regardless the identity of the speakers, just in the same way that it
happens in a dialogue between people.
Ando et al. developed a theory of auditory signal processing based on two auditory internal
representations, the monaural autocorrelation function (ACF), and the binaural interaural
correlation function (IACF) [1]. Features of the IACF correspond to spatial perceptual attributes
such as sound direction, apparent source width and subjective diffuseness. On the other hand,
the correlational features for primary speech recognition lie in the ACF parameters which
correspond to perceptual qualities related to pitch, timbre, duration and loudness.
Other studies have demonstrated that timbral distinctions can be produced both by changes in
the spectra of stationary sounds (i.e. vowels) and also by transients fluxes in amplitude,
frequency and phase (i.e. consonants) [2].
The objective of this research is to statistically study the temporal characteristics of the running
autocorrelation function (r-ACF) of single spoken vowels and consonant-vowels (CV) syllables
of different speakers, in order to stablish relationships between changes in the r-ACF orthogonal
parameters and changes in the pitch and timbre of the spoken vowels and CV syllables.
2
22nd International Congress on Acoustics, ICA 2016
Buenos Aires – 5 to 9 September, 2016
Figures 1 to 6 show the spectrogram for the one sample of each syllable used in this work. It
can be seen that the sound energy is similarly distributed within the same bandwidth except for
the first part that corresponds to the attack generated while pronouncing the consonant of the
syllable. In the case of the “sa”, the pronunciation of the “s” produces sound energy within a
large bandwidth of frequency with no significant tonal components.
3
22nd International Congress on Acoustics, ICA 2016
Buenos Aires – 5 to 9 September, 2016
4
22nd International Congress on Acoustics, ICA 2016
Buenos Aires – 5 to 9 September, 2016
3 Results
All statistical values of the r-ACF parameters calculated from the vowel “a” were compared to
the values obtained from the CV syllables in order to evaluate significant differences in their
global magnitudes and also in their temporal behaviour. As it is shown in Table 1, median
values of r-ACF parameters showed no significant differences between the “a” vowel and the
CV Syllables, with the exception of Φ(0) and τe . In these cases, the differences lie in the to
changes of the amplitude level of the spoken syllables since Φ(0) and τe are related to the
perception of loudness. Results point out that “ta” corresponds to the syllable with highest
loudness while “na” represents the syllable with lowest value.
5
22nd International Congress on Acoustics, ICA 2016
Buenos Aires – 5 to 9 September, 2016
On the other hand, Figure 7 shows temporal variations of the parameter τ1 obtained from the r-
ACF of the “a” vowel and the “la”, “ma” and “na” syllables. It can be seen that in all cases, τ1
shows a similar stationary pattern with the increase of the time. As it was mentioned in the
previous section, τ1 is highly related to the perception of the pitch. Therefore, the formant
frequencies of the vowel spectra can be identified with a stationary behaviour of the τ1.The tonal
components generated while pronouncing the consonants of the CVs “la”, “ma” and “na” also
show a stationary behaviour of τ1.
Figure 7: Temporal waveforms of τ1 corresponding to syllables “a”, “la”, “ma” and “na”.
Figure 8 and 9 show temporal representation of τ1 obtained from r-ACF of CV “sa” and “ta”
respectively. It can be seen that both waveforms show opposites behaviour as time increases.
In the case of “sa”, the initial low values of τ1 are related with pronunciation of the letter “s” which
involves a large frequency bandwidth with no significant tonal components. Syllable “ta” shows
an opposite temporal waveform of τ1, which may be generated by impulsive component of the
letter “t” pronunciation.
6
22nd International Congress on Acoustics, ICA 2016
Buenos Aires – 5 to 9 September, 2016
In all cases, the waveforms were plotted from averaged values of τ1 obtained from each
recording signals of “a”, “la”, “ma” and “na”.
4 Conclusions
Results point out that it is possible to identify characteristic temporal behaviours of τ1 with
different CV syllables that include the same single vowel regardless the speaker. It was also
possible to relate the temporal patterns of τ1 with spectrum characteristics of the consonant
involved in the CV syllable. This can lead to a primary classification of consonants according to
the temporal response of τ1, which is related to the pitch percept in the pronunciation of the
syllable. Furthermore, the parameters Φ(0) and τe also showed significant differences between
the syllables studied in this work, resulting in a possible classification of consonants according
to the different loudness characteristics generated during the pronunciation of syllables.
Future work may include further investigation of r-ACF parameters in the rest of the single
vowels, CV syllables with different consonants, and also study the syllables that feature more
than one vowel.
7
22nd International Congress on Acoustics, ICA 2016
Buenos Aires – 5 to 9 September, 2016
References
[1] Ando, Y.; Cariani, P.; Auditory and Visual Sensations; Springer, 2009.
[2] Ando, Y.; Autocorrelation-based features for speech representation, Journal of Acoustics Society of
America. 133-5-2, 2013, pp 1-26.
[3] Sato, S.; Wu, S.; Definition of the effective duration (τe) of the running autocorrelation function of
music signals; Acta Acustica united with Acustica; vol.97, 2010, pp 432-440.
[4] Walpole; Probabilidad y estadística para ciencias e ingeniería, Pearson Education, México, 2007.