Ica2016 0847 PDF

Buenos Aires – 5 to 9 September, 2016
Acoustics for the 21st Century…
PROCEEDINGS of the 22nd International Congress on Acoustics
Speech Communication: Paper ICA2016-847
Speech recognition through the analysis of spoken

syllables using autocorrelation function parameters
Alan Rubellin(a), Andrés Sabater(a)(b)

(a) Universidad Nacional de Tres de Febrero, Argentina, aurubellin@gmail.com
(b) BRUIT Engineering, Argentina, asabater@bruit-ing.com.ar
Abstract
The presented paper aims to establish a relationship between the identification of spoken
syllables, using statistical parameters obtained from the running autocorrelation function (r-ACF)
parameters. To accomplish this, six different syllables were recorded by twenty female voices
and twenty male voices. These recordings were normalized, and the r-ACF parameters were
calculated for each file in order to statistically analyse them. Final results show a significant
relationship between r-ACF parameters and the frequency spectrum of the vowels, which can
lead to consonants classification according to pitch and loudness characteristics, regardless the
speaker.
Keywords: ACF, speech recognition, speech communication, acoustics

22nd International Congress on Acoustics, ICA 2016
Speech recognition through the analysis of spoken

syllables using autocorrelation function parameters
1 Introduction
Speech recognition is a methodology used to identify a message which is transmitted by spoken
words in a certain language. The goal of this method is to distinguish the information contained
in the speech communication regardless the identity of the speakers, just in the same way that it
happens in a dialogue between people.
Ando et al. developed a theory of auditory signal processing based on two auditory internal
representations, the monaural autocorrelation function (ACF), and the binaural interaural
correlation function (IACF) [1]. Features of the IACF correspond to spatial perceptual attributes
such as sound direction, apparent source width and subjective diffuseness. On the other hand,
the correlational features for primary speech recognition lie in the ACF parameters which
correspond to perceptual qualities related to pitch, timbre, duration and loudness.
Other studies have demonstrated that timbral distinctions can be produced both by changes in
the spectra of stationary sounds (i.e. vowels) and also by transients fluxes in amplitude,
frequency and phase (i.e. consonants) [2].
The objective of this research is to statistically study the temporal characteristics of the running
autocorrelation function (r-ACF) of single spoken vowels and consonant-vowels (CV) syllables
of different speakers, in order to stablish relationships between changes in the r-ACF orthogonal
parameters and changes in the pitch and timbre of the spoken vowels and CV syllables.
2 Materials and methods

2.1 Voice recording
20 different male and 20 different female voices were recorded, all of them pronounced in
Spanish the vowel “a” and CV syllables “la”, “ma”, “na”, “sa” and “ta”. The speakers cooperated
with good diction and pronunciation. The recording session took place in a low background
noise and short reverberation time room, avoiding adverse situations such as low signal-to-
noise ratio and distortion in the record path. The signals were recorded at 16 bits with a sample
rate of 44.1 kHz, using the digital audio workstation Adobe Audition.
The recording equipment was composed of:
• Microphone Shure SM57, serial number: 5278LE
• Sound Card Focusrite Scarlett 8i6, serial number: UF6212413111
• Notebook Dell Inspiron 1545, serial number: 5D5S4N1
• Headphones Behringer 1545, serial number: G1437088223.
2
Figures 1 to 6 show the spectrogram for the one sample of each syllable used in this work. It
can be seen that the sound energy is similarly distributed within the same bandwidth except for
the first part that corresponds to the attack generated while pronouncing the consonant of the
syllable. In the case of the “sa”, the pronunciation of the “s” produces sound energy within a
large bandwidth of frequency with no significant tonal components.
Figure 1: Spectrogram of one sample of the syllable “a”.
Figure 2: Spectrogram of one sample of the syllable “la”.
3
Figure 3: Spectrogram of one sample of the syllable “ma”.
Figure 4: Spectrogram of one sample of the syllable “na”.
Figure 5: Spectrogram of one sample of the syllable “sa”.
4
Figure 2: Spectrogram of one sample of the syllable “ta”.
2.2 Data processing

All signals were passed through an A-weighted digital filter in order to approximate the effects of
sound transmission through the human ear. The recorded files were then processed with a
Matlab software in order to obtain the r-ACF parameters according to Sato and Wu method [3].
The software setup was: Integration interval: 0.1 s, Maximum time leg: 0.2 s, Running step: 0.01
s.
After computing the r-ACF software, five parameters were extracted for each file: Wφ(0), Φ(0),
φ1, τe and τ1. Where Wφ(0) is related to the perception of timbre, Φ(0) and τe with loudness, φ1
with pitch strength, and τ1 with pitch and timbre.
Statistical parameters were calculated then for each r-ACF parameter of each sound signal
using Microsoft Excel in order to obtain global values and to detect minimum and maximum
deviations. The statistical parameters used in this research were: median (MEDIAN), harmonic
mean (HARMEAN), average of absolute deviation (AVEDEV), the sum of squares of deviations
(DEV.SQ), and standard deviation (STDEV.P) [4].
3 Results
All statistical values of the r-ACF parameters calculated from the vowel “a” were compared to
the values obtained from the CV syllables in order to evaluate significant differences in their
global magnitudes and also in their temporal behaviour. As it is shown in Table 1, median
values of r-ACF parameters showed no significant differences between the “a” vowel and the
CV Syllables, with the exception of Φ(0) and τe . In these cases, the differences lie in the to
changes of the amplitude level of the spoken syllables since Φ(0) and τe are related to the
perception of loudness. Results point out that “ta” corresponds to the syllable with highest
loudness while “na” represents the syllable with lowest value.
5
Table 1: Median values of r-ACF parameters

Syllable Φ(0) τ1 φ1 τe Wφ(0)
[dB] [ms] [ms] [ms]
“a” 9.567 5.501 0.716 59.651 0.371
“la” 8.857 6.504 0.755 69.658 0.477
“ma” 2.684 6.878 0.737 77.634 0.656
“na” 2.272 6.646 0.731 81.644 0.485
“sa” 7.086 4.518 0.729 37.589 0.179
“ta” 15.105 5.562 0.768 59.53 0.404
On the other hand, Figure 7 shows temporal variations of the parameter τ1 obtained from the r-
ACF of the “a” vowel and the “la”, “ma” and “na” syllables. It can be seen that in all cases, τ1
shows a similar stationary pattern with the increase of the time. As it was mentioned in the
previous section, τ1 is highly related to the perception of the pitch. Therefore, the formant
frequencies of the vowel spectra can be identified with a stationary behaviour of the τ1.The tonal
components generated while pronouncing the consonants of the CVs “la”, “ma” and “na” also
show a stationary behaviour of τ1.
Figure 7: Temporal waveforms of τ1 corresponding to syllables “a”, “la”, “ma” and “na”.
Figure 8 and 9 show temporal representation of τ1 obtained from r-ACF of CV “sa” and “ta”
respectively. It can be seen that both waveforms show opposites behaviour as time increases.
In the case of “sa”, the initial low values of τ1 are related with pronunciation of the letter “s” which
involves a large frequency bandwidth with no significant tonal components. Syllable “ta” shows
an opposite temporal waveform of τ1, which may be generated by impulsive component of the
letter “t” pronunciation.
6
Figure 8: Temporal waveform of τ1 corresponding to syllable “sa”.
Figure 9: Temporal waveform of τ1 corresponding to syllable “ta”.
In all cases, the waveforms were plotted from averaged values of τ1 obtained from each
recording signals of “a”, “la”, “ma” and “na”.
4 Conclusions
Results point out that it is possible to identify characteristic temporal behaviours of τ1 with
different CV syllables that include the same single vowel regardless the speaker. It was also
possible to relate the temporal patterns of τ1 with spectrum characteristics of the consonant
involved in the CV syllable. This can lead to a primary classification of consonants according to
the temporal response of τ1, which is related to the pitch percept in the pronunciation of the
syllable. Furthermore, the parameters Φ(0) and τe also showed significant differences between
the syllables studied in this work, resulting in a possible classification of consonants according
to the different loudness characteristics generated during the pronunciation of syllables.
Future work may include further investigation of r-ACF parameters in the rest of the single
vowels, CV syllables with different consonants, and also study the syllables that feature more
than one vowel.
7
References
[1] Ando, Y.; Cariani, P.; Auditory and Visual Sensations; Springer, 2009.
[2] Ando, Y.; Autocorrelation-based features for speech representation, Journal of Acoustics Society of
America. 133-5-2, 2013, pp 1-26.
[3] Sato, S.; Wu, S.; Definition of the effective duration (τe) of the running autocorrelation function of
music signals; Acta Acustica united with Acustica; vol.97, 2010, pp 432-440.
[4] Walpole; Probabilidad y estadística para ciencias e ingeniería, Pearson Education, México, 2007.

Ica2016 0847 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ica2016 0847 PDF

Uploaded by

Copyright:

Available Formats

Buenos Aires – 5 to 9 September, 2016

Acoustics for the 21st Century…

PROCEEDINGS of the 22nd International Congress on Acoustics

Speech Communication: Paper ICA2016-847

Speech recognition through the analysis of spoken

Alan Rubellin(a), Andrés Sabater(a)(b)

Keywords: ACF, speech recognition, speech communication, acoustics

Acoustics for the 21st Century…

Speech recognition through the analysis of spoken

2 Materials and methods

Acoustics for the 21st Century…

Figure 1: Spectrogram of one sample of the syllable “a”.

Figure 2: Spectrogram of one sample of the syllable “la”.

Acoustics for the 21st Century…

Figure 3: Spectrogram of one sample of the syllable “ma”.

Figure 4: Spectrogram of one sample of the syllable “na”.

Figure 5: Spectrogram of one sample of the syllable “sa”.

Acoustics for the 21st Century…

Figure 2: Spectrogram of one sample of the syllable “ta”.

2.2 Data processing

Acoustics for the 21st Century…

Table 1: Median values of r-ACF parameters

Acoustics for the 21st Century…

Figure 8: Temporal waveform of τ1 corresponding to syllable “sa”.

Figure 9: Temporal waveform of τ1 corresponding to syllable “ta”.

Acoustics for the 21st Century…

You might also like