Lec9-10 Speech Processing

Lecture: 9-10
Acoustic Phonetics &

Time Domain Features
Dr. Shikha Tripathi,PES, Blr
¡  Model of Vocal Tract
§  LTI Model for speech production
§  Uniform lossless tube model
§  Concatenated tube model
§  Electric Circuit Model
§  Lip radiation
¡  Spectrograms
§  Narrowband
§  Wideband
¡  Acoustic Phonetics
30/08/2019 Dr.Shikha Tripathi@PESU Blr 2
¡  Acoustic Phonetics
¡  Auditory Perception
¡  Time domain methods of speech processing
§  ST Energy
§  ST Average Magnitude Function


In English, the combinations of features give 40 phonemes as follows
v v v
Phonemes in American English.

¡  Phoneme classifications: Vowels, Dipthongs, Semivowels
and Consonants
¡  Vowels (voiced segments)
§  Voiced component of sound(except when whispered)
§  The phonemes with the greatest intensity, and range in duration
from 50 to 400 ms in normal speech
§  Vowel energy is primarily concentrated below 1kHz and falls off
at about -6 dB/oct with frequency.
§  The excitation is periodic, generalized by fundamental frequency
of vocal cords and sound gets modulated while it passes through
vocal tract.
§  If these vowels are recorded & PSD plotted we get formant
frequencies
§  As different words are uttered containing vowels the position of
formants change depending on shape of VT and mouth cavity

Vowels:
Vocal tract profiles for vowels in American English.

Waveform, wideband spectrogram, and spectral slice of
narrowband spectrogram for two vowels: (a) /i/ as in “eve”;
(b) /a/ as in “father.”

Diphthongs:
¡  Two sounds or two tones: two adjacent vowel
sounds occurring within same syllable
¡  Also known as gliding vowel
¡  For ex: eye,hay,low,cow…the tongue moves
and these are said to contain diphthongs
¡  American English: 6 diphthongs
¡  Characterzed by time varying VT area function
that veries between 2 vowel configurations

Semi-Vowel:
¡  It is sound like /w/, /r/,/L/or /y/ (phonetically
similar to vowel sound but functions often as
syllable boundary). Called semi-vowel due to
their vowel like structure.
Consonants:
¡  Speech segment articulated using partial or
complete closure of VT. For ex: /p/ (lips
closed),/t/ (front part of tongue closed),/k/
(closure of back of tongue),/h/ (pronounced
in throat)

¡  The second large phoneme grouping is that of
consonants containing number of subgroups:
¡  Consonants:
1. Nasals
2. Stops/Plosives (voiced or unvoiced)
3. Fricatives (voiced or unvoiced)
4. Whispers
5. Affricates

1. Nasals (nasal stops or nasal continuant):
§  Produced by lowered velum.
§  Allows air to flow freely and escape through nose
▪  /m/ (constriction at lips)
▪  /n/ (constriction just at the back of teeth)

Nasals:
Vocal tract configurations for nasal consonants.
Wideband spectrograms of nasal consonants (a) /n/ in “no” and (b) /m/ in
“me.”

2. Stops:
§  Produced by stopping airflow in Vocal tract
▪  Plosives are stops with airflow out of mouth.
Sometimes nasal stops.
§  Voiced stop: /b/,/d/,/g/(properties depend on
vowel that follows them)
§  Unvoiced stop: /p/,/t/,/k/ (here during closure
vocal cords do NOT vibrate)
§  Plosives: Stops with air flow out of the mouth.
▪  Sometimes also includes nasal stops (airflow stopped in
the mouth and released through nose)
Plosives:
Vocal tract configurations for unvoiced and voiced plosive pairs.

A schematic representation of (a) unvoiced and (b) voiced plosives.
The Voice Onset Time is denoted by VOT.

Waveform, wideband spectrogram, and narrowband spectral slice of
voiced and unvoiced plosive pair: (a) /g/ as in “go” (Voice bar ); (b) /k/ as
in “key”(aspiration)
3. Fricatives:
§  Special class of consonants produced by forcing air
through a narrow channel like thing made by placing any
two articulators close together. Articulators are lower
lips and upper teeth as in /f/.
§  The turbulent air flow is called frication(ex: /s/, /z/, /f/)
§  Unvoiced: /f/,/s/,/sh/(produced by turbulence)
§  Voiced: /v/,/z/,/zh/. Two excitation sources are involved.
One is periodic due to vibration of VC and other is air
turbulence that gets generated due to constriction

Fricatives:
Vocal tract configurations for pairs of voiced and unvoiced fricatives.

Waveform, wideband spectrogram, and narrowband spectral
slice of voiced and unvoiced fricative pair: (a) /v/ as in “vote”;
(b) /f/ as in “for.”
4. Whispers:
§  Special case of speech. There is no fundamental
frequency. First formant frequency is perceived by us.
5. Affricates(combination of stop and fricatives):

§  Consonants such as /pf/ and /kx/. They start and stop
like /t/ or /d/ but sometimes release as fricatives as
in /s/ or /z/

Prosody: Melody of speech
§  Long time variations (changes extending over more
than one phoneme) i.e, in pitch (intonation),
amplitude (loudness) and timing (articulation rate or
rhythm), follow rules of prosody of a language
(speaker dependant)

¡  Speech segment:
§  Silence: Speech appears like grass ; no or very less
energy equivalent to small noise
§  Unvoiced: Somewhat higher amplitude than
silence part
§  Voiced: Higher energy than unvoiced
EV > EU > ES
¡  Generally log of energy is computed for about
20 ms

¡  Acoustic properties of speech sounds essential
for phoneme discrimination by auditory system
§  Study of Acoustic aspects of speech sound that act as
perceptual cues
¡  Goal : preserve such properties in speech
processing

¡  Acoustic Cues: Acoustic components of
speech used by listener to correctly perceive
the phoneme

§  Vowels
§  Consonants

¡  Vowels: Characterized by Formant frequencies
§  Vowel identification is mapped to F1 and F2
§  Higher formants also contribute
¡  As formant frequencies scale with tract length
formant location is normalized in phoneme
recognition
§  Or relative formant spacing is used as essential
feature
¡  Nasalization: Cued primarily by bandwidth
increase of F1 and introduction of zeros.
¡  Consonant: Identification depends on:
§  Formant of consonant
§  Formant transition into formants of following
vowel
§  Voicing(unvoicing) of vocal folds during or near
consonant production
§  Relative timing of the consonant and the onset of
following vowel


¡  Consider plosive voiced consonants /b/,/d/
and /g/ and their unvoiced counterparts /p/,/
t/ and /k/
¡  Each plosive is characterized by
§  Formant locus: spectrum determined by vocal
tract configuration in front of closure
§  Formant transition: Movement from formant locus
spectrum to spectrum of following vowel
configuration
¡  Perception of consonants depends on
formant locus and formant transitions

¡  Consider discrimination between /b/ and /d/
followed by /a/ as in “ba” and “da”
¡  Two perceptual cues are
§  F1 of the formant locus is lower in /b/ than in /d/
§  F2 transitions are upward from /b/ to the following
vowel and downward from /d/ to the following
vowel
¡  As with plosive consonants, for fricative
consonants also speech spectrum during
frication noise and transition into following
vowel are perceptual cues
¡  Time between initial burst and vowel onset to
discriminate between voiced and unvoiced
consonant
¡  For both plosive and fricative consonants:
presence/lack of voicing also acts as a perceptual
cue for voiced Vs unvoiced consonants
¡  In addition to direction of formant transition,
rate of transition also matters
§  Consider /b/ in “be” If the transition rate increases
beyond 30 ms “be” is transformed into “we” (semi-
vowel)
¡  Physiological correlation of acoustic
attributes
§  Mechanism in brain that detects the acoustic
features ultimately leading to meaning of the
message
¡  Motor theory Model
§  Acoustic features map to articulatory features
§  We create in our brains a picture of the basic
articulatory movements responsible for acoustic
feature

¡  Articulatory variability suggests that speakers
and listeners rely on acoustic representation
than articulatory representation
¡  A combination of acoustic features and
articulatory mapping is better in speech
perception
¡  Feature detection is carried out by auditory
centers of brain: formant energy, bursts,
voice onset times, formant transitions and
presence (or lack) of voicing
¡ Need to consider how changes in
temporal and spectral properties
of speech waveform affect
perception

¡  Time domain processing
¡  Short Time zero crossing rate
¡  Time domain methods of speech processing
§  Speech and Silence Descrimination
▪  Algorithm to find start and end of word
§  Pitch Period estimation



Lec9-10 Speech Processing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec9-10 Speech Processing

Uploaded by

Copyright:

Available Formats

Lecture: 9-10

Acoustic Phonetics &

30/08/2019 Dr.Shikha Tripathi@PESU Blr 3

Phonemes in American English.

30/08/2019 Dr.Shikha Tripathi@PESU Blr 5

Vocal tract profiles for vowels in American English.

30/08/2019 Dr.Shikha Tripathi@PESU Blr 7

30/08/2019 Dr.Shikha Tripathi@PESU Blr 8

30/08/2019 Dr.Shikha Tripathi@PESU Blr 10

30/08/2019 Dr.Shikha Tripathi@PESU Blr 11

Vocal tract configurations for nasal consonants.

30/08/2019 Dr.Shikha Tripathi@PESU Blr 12

Vocal tract configurations for unvoiced and voiced plosive pairs.

30/08/2019 Dr.Shikha Tripathi@PESU Blr 14

30/08/2019 Dr.Shikha Tripathi@PESU Blr 15

30/08/2019 Dr.Shikha Tripathi@PESU Blr 17

Vocal tract configurations for pairs of voiced and unvoiced fricatives.

30/08/2019 Dr.Shikha Tripathi@PESU Blr 18

5. Affricates(combination of stop and fricatives):

30/08/2019 Dr.Shikha Tripathi@PESU Blr 21

30/08/2019 Dr.Shikha Tripathi@PESU Blr 22

30/08/2019 Dr.Shikha Tripathi@PESU Blr 23

30/08/2019 Dr.Shikha Tripathi@PESU Blr 24

30/08/2019 Dr.Shikha Tripathi@PESU Blr 26

30/08/2019 Dr.Shikha Tripathi@PESU Blr 27

30/08/2019 Dr.Shikha Tripathi@PESU Blr 30

30/08/2019 Dr.Shikha Tripathi@PESU Blr 32

30/08/2019 Dr.Shikha Tripathi@PESU Blr 33

You might also like