You are on page 1of 61

# Tamil C T il Computational Linguistics workshop t ti l Li i ti k h

## R. R Rani Technical Officer CAS in Linguistics Annamalai University

Parameters to be discussed Periodic signal, aperiodic signal, complex signal Fourier Analysis Speech Waveform Formants Spectrum Spectrogram Fundamental frequency(F0) Intensity LPC

A sound wave is a cycle of pressure fluctuations (compression, rarefaction) brought about by an oscillating ( g y g (sound) source. )

The frequency (f) of a sound wave, which is perceived as its tonal height, or pitch, is the number of regular pressure fluctuations which occur within a given time interval. It is measured in Hertz (Hz) as cycles per second (cps), every cycle corresponding to the oscillation pattern between two consecutive maxima (or between two minima). ( ) Frequency can be calculated by counting how many periods occur per time unit (1 second) or by dividing the standard time unit (1 s) by the duration of the period (T which is measured in s): (T, f = 1/T (with 1 referring to 1 second) The frequency of a sound signal is the number of repetitive patterns of pressure oscillations (periods) in a sound wave per second. Frequency affects our perception of tonal height of a sound. Frequency and period are inversely proportional: the shorter the period, the higher the frequency.

Periodic signals show (by graphically reproducing them, for instance by means of an oscillogram) a repeating pattern in their waveform: the period (T). Besides sine tones, also the so called complex signals are periodic. tones periodic aperiodic signals are signals such as noise and impulse (i.e. the noise , whose wave form is not characterised by any repeating pattern. Complex signals result from the addition of several signals. It was just mentioned introducing the concept of phase that sound signals can be combined and that the obtained signal results from adding amplitude, frequency and phase of the signal sound waves The duration of a cycle of pressure fluctuations (thus considering the time dimension, i seconds) is called period (T) of the sound wave. It i i di i in d )i ll d i d f th d is inversely l proportional to the frequency value.

Fourier analysis

Fourier analysis (named after Jean Baptiste Joseph Fourier) decomposes any periodic complex (non sinusoidal) signal into its sine components (or Fourier components), each having a given frequency, amplitude and phase

The term fundamental frequency (F0) the lowest frequency of a signal is indicated. indicated In the case of a sine tone (not a complex signal, such as the ones we are dealing with now) just one frequency component is present, it is thus not necessary to indicate it as F0. In complex signal, there are more frequency components. The lowest of those frequency components is called fundamental frequency (F0), and it is the minumum common denominator between the frequency comopnents. For instance, the F0 of a complex signal composed by sine tones with, respectively, a 50, 100 and 150 Hz is 50 Hz.

Two sound waves whose frequency is, respectively, of 200 and 400 Hz

The amplitude (A) of a sound wave is the degree of change in air pressure p ( ) g g p (due to compression and rarefaction of the air molecules), which is measured in Pascal. The higher the pressure changes caused by the sound wave the higher the wave, amplitude, and the louder we perceive the sound.

HARMONICS

When Wh our vocal folds vibrate, th result i a complex wave, l f ld ib t the lt is l consisting of the fundamental frequency (which you just measured) plus other higher frequencies, called HARMONICS In a complex signal, harmonics are frequency components that are either the fundamental frequency or integer multiples of F0. The term overtones instead just refers to the F0 multiples. To see these, we need to look at a NARROW-BAND spectrogram, which is more precise along the frequency domain than the default WIDE-BAND spectrogram.

F0=H1

H2

F1

F2

F1

F2

F3

## Figure : Determining the harmonics by spectral slice

Spectrum

The term spectrum is often used referring to a property of the signal (such as for instance by saying that a signal has a steep descending spectrum). In a spectrum we get information about the frequency of the components that make up the signal (as we usually deal with complex signals) and about the energy carried by those frequency components (their intensity).

Formant Within the spectrum of a sound signal, a formant is a concentration of energy over a given range of frequency components components. Formants correspond to resonances in the vocal tract. We W are able t di ti bl to distinguish among (voiced) phones, i particular i h ( i d) h in ti l among vowels, by exploiting the formant pattern of the acoustic signal. Formants are brought about by the filtering effect that the articulatory organs exert on the source signal. The articulators act as filters on the source signal by dampening some of its frequency components and by strengthening other ones (in terms of energy). Every vowel has its peculiar formant pattern as a result of this phenomenon.

Whi h parameters characterise a f Which t h t i formant? Wh t i th diff t? What is the difference between formants and filters? As filters, a formant is characterised by information in terms of y frequency and amplitude. Filters and formants are both described in terms of their frequency bandwidth (or of their cutoff frequencies) and of the steepness of their slope (dB per octave). However, these terms are certainly not synonyms: in the analysis of speech, formants are the outcome of the filtering effect of the articulators on the source signal, while filters are instruments able to modify the spectrum of a sound signal.

## What is the relation between formants and harmonics?

Harmonics are multiples of the fundamental frequency, and are thus relative to the source signal. g Formants, instead, are a consequence of the filtering phenomenon on the source signal. There is no relation between them. Formants may accidentally happen to fall in correspondence with harmonics, but this is just by chance.

In a spectral representation providing information about the harmonics, how can we calculate the fundamental frequency of the signal? There are two ways of calculating the F0 in this kind of spectrum: either by considering the frequency of the first harmonic, or by dividing the frequency range of a given i t di idi th f f i interval b th number of l by the b f harmonics that are present in that interval.

A spectral envelope shows th f t l l h the formant contour, and provides t t d id information about frequency and energy. However, it does not provide information about the harmonics of a p signal (those would be represented by vertical lines), and thus about the fundamental frequency.

Does the formant pattern change if the fundamental frequency of the signal changes? h ? The fundamental frequency does not have any influence on the formant pattern. F0 is relative to the source signal, while the formant pattern depends on the filtering process of the vocal cavities and of the articulators. They are thus virtually independent from one another. Moreover, a difference in F0 determines the accuracy of the estimated spectral envelope. p p The spectral envelope of a signal with a higher F0 (less harmonics in the spectrum) is less accurate than that of a signal with a higher F0 (more harmonics). harmonics)

Are formants (and their influence over timbre) better defined in low pitched sounds or in high pitched sounds?

In order to identify phones, listeners exploit formant and harmonics (F0 multiples) information information. The lower the F0, the more the harmonics, and, on the contrary, the higher the F0, the less harmonics are available to the listener's g perception. Moreover, in low pitched sounds (lower F0), formants are more likely to coincide with the harmonics and the influence of the formants in harmonics, terms of timbre is more evident.

Linear Predictor Coding LPC (Linear Predictor Coding) is a method to represent and analyze human speech. h h The idea of coding human speech is to change the representation of the speech. Representation when using LPC is defined with LPC coefficients and an error signal, instead of the original speech signal. The LPC coefficients are found by LPC analysis which describes the inverse transfer function of the human vocal tract and the error signal is found by using LPC estimation based on the LPC coefficients from y g LPC analysis.

Range of human speech f0min(Hz) Men woman 80 150 f0max(Hz) 200 350

## Working with Spectra

Sometimes, you need specific details about the frequencies and individual harmonics in a sound at a given moment in time, and examining a narrowband spectrogram alone does not provide sufficient information.

## In these cases, youll need to take a spectral slice for analysis.

Spectral slices (also referred to as FFTs or spectra) are the result of a fast fourier transform done on a very small portion of the sound, providing you with very specific information about the frequencies present in the sound and their relative amplitudes.

Spectral slices are useful for a variety of measures of F0, nasality, creak, breathiness, and spectral tilt, and are a crucial part of many measurement workflows.

LPC technique is used for encoding (through an LPC analysis), analysis) transmitting and decoding (LPC synthesis) a digital signal by reducing redundant information. This is achieved through an estimation of the signal .

An LPC analysis window is usually 0.025 s long, as this allows at least 2 periods to fit in it. The period length in male, female and infant speakers are, respectively, 0.005-0.01 s, 0.003-0.005 s and 0.002-0.003 s.

linear prediction LPC analysis is based on a linear prediction which allows to estimate the amplitude values of the signal samples within the analysis window. The digitized signal is made up of a series of samples (characterised by given frequencies and amplitude values). The amplitude prediction is performed by exploiting the values of some samples (th preceding ones) i order t compute th t of th l (the di ) in d to t that f the subsequent ones. In this way it is not necessary to assess all sample values within the y y p window, but by using the values of some of them, that of the rest can be computed. The main focus however does not lie in the prediction itself but rather focus, however, itself, in the collection of parameters that allow to perform the prediction.

How are the formants distributed over the speech spectrum (in female and male speakers)?

In the spectrum of female speech a formant is to be identified every 1100 Hz, while in male speech about every 1000 Hz. The first 5 formants are the most important ones (and also the more clear ones in the spectrum), which is why the default frequency range of speech analysis programs is usually 0 5500 H (1100 H * f h l i i ll 0Hz Hz 5). p g However, in children speech the formants are higher than in adult individuals and the frequency ceiling should be higher.

What is a coefficient in LPC analysis? How many coefficients describe a formant? A coefficient is a numerical value computed by the LPC analysis in order to predict amplitude and frequency values of the signal's filter. This value is obtained by regression from the original signal. Through the coefficients, the frequency and bandwidth of the signal formants are predicted. 2 coefficients describe a formant. The Th number of coefficients determines th number of f b f ffi i t d t i the b f formants t t to be identified by LPC. The number of set coefficients should at least be twice the number of formants, as each formant is described by 2 coefficients. If the number of set coefficients is too large, non-existing formants will be identified in the LPC analysis, while if it is too analysis small, not all of the formants will be recognised.

LPC analysis LPC analysis decomposes speech sounds into two parts: 1. a filter function consisting of LPC coefficients: based on the assumption that the source has been filtered through a single variable-cross-section tube (like the vocal tract);

2.

a source function: which can either be: a. the original signal inverse filtered through the filter function b. a stylised version of (a) consisting of white noise (for unvoiced speech) or a pulse train (buzz, at the appropriate pitch for voiced speech).

Vowel distinction

voiced/unvoiced speech
Speech can be divided into two classes, voiced and unvoiced. The difference between the two signals is the use of the vocal cords and vocal tract(mouth and lips). lips) When voiced sounds are pronounced you use the vocal cords and the vocal tract. Because of the vocal cords, it is possible to find the fundamental frequency of the speech. In contrast to this, the vocal cords are not used when pronouncing unvoiced sounds. Because the vocal cords are not used, is it not possible to find a fundamental frequency in unvoiced speech. In general all vowels are voiced sounds. Examples of unvoiced sounds are /sh/ /s/ and /p/ There are different ways to detect if voice are voiced or unvoiced. The fundamental frequency can be used to detect the voiced and unvoiced parts of speech. Another way is to calculate the energy in the signal (signal frame). There are more energy in a voiced sound than in a unvoiced sound.

## -1 gl 0.04 -2 2 0.00505 a 0.42 0.3 0.6 0.9 gl aa 0.54 1.2 1.364

Time (s) spectrogram, pitch, intensity, formants of a,aa 5000 4000 3000 2000 1000 0 0.00505

0.3

## 0.6 Time (s)

0.9

1.2

1.364

lpc, spectrum of a 60
Sou pressure level (dB/Hz) und B

lpc, spectrum of aa 60
Sou pressure level (dB/Hz) und B

542 1160

633.1 1209

40 40

40 40

3661 3030.5

## 20 20 0 0 1000 2000 3000 Frequency (Hz) 4000

2651
20 20 0 0 1000 2000 3000 Frequency (Hz)

3403.8

5000

4000

5000

waveform of i, ii (male) 1

-1 gl i 0.416 -2 0 0.2 0.4 0.6 Time (s) 0.8 1 1.2 1.313 gl ii 0.633

5000 4000 3000 2000 1000 0 0 0.2 0.4 0.6 Time ( ) Ti (s) 0.8 1 1.2 1.313

60 40

## Spectrum, LPC of vowel ii

Sound pressure level (dB/Hz) B

## Soun pressure level (dB/Hz) nd

293.7 2230.5

271.3

60 40 40 20 20 0 0 1000

2326 3707

3684.5

40 20

## 2000 3000 Frequency (Hz)

4000

5000

waveform of u, uu 1

-1 gl l u 0.412 -2 0.007703 0.3 0.6 gl l 0.03 0.9 Time (s) spectrogram, pitch, intensity, formants of u, uu uu 0.62 1.2 1.5 1.665

0.3

0.6

## 0.9 Time (s)

1.2

1.5

1.665

lpc, spectrum of u

lpc, spectrum of uu

311.2
Sound pressure level (dB/Hz)
40

354.6
Sound pressure level (dB/Hz)
60 40

885.3 2720

20

40 20

3653.4
4000

## 20 0 0 1000 2000 3000 Frequency (Hz) 4000 5000

waveform of a, aa 1

## -1 gl a 0.236 -2 0.0007707 0.2 0.4 gl aa 0.348 0.6 0.7168

Time (s) Ti ( ) spectrogram ,pitch,intensity,formants of a, aa 5000 4000 3000 2000 1000 0 0.0007707

## 0.2 Time (s)

0.4

0.6

0.7168

lpc, spectrum of a
Sound pressu level (dB/Hz) ure

lpc, spectrum of aa
Sound pressu level (dB/Hz) ure

921.6

914.5

60 40

60

1317.8

3277

3807.5 4309.7

40 20

40

## waveform of i, ii i_ii_femala-1_ 0.00504706816 1 0.845728909

-1 gl i 0.27 -2 0.003656 0.2 0.4 Time (s) gl ii 0.4 0.6 0.8 0.8457

spectogram,intensity,pitch, formants of i, ii

5000

0.2 0 2

0.6 0 6

## 0.8 0.8457 0 8 0 8457

lpc, spectrum of i 60 60

lpc, spectrum of ii

0

40

40

20

20

1000

4000

5000 5500

1000

4000

5000 5500

5000

0.2

## 0.4 Time (s)

0.6

0.8

0.8554

lpc,spectrum of u

lpc,spectrum of uu

40

0

40

20

20

## 0 0 1000 2000 3000 Frequency (Hz) 4000 5000 5500

Male data
vowels Duration insecs in secs 0.42 0.54 0.416 0.633 0.412 0.62 Intensity indB in dB 67.7 69.8 69.5 69.7 69.6 73.3 Pitch(f0) inHz in Hz 119.2 121.9 127.4 126.8 125.9 132.5 FormantsinHz F2 F3 1210.2 1157.8 2249.4 2337.5 830.9 858.39 2764.9 2866.8 2999.6 3412.8 2779.09 2716.6

## F4 3570.4 3452.5 3799.5 3810.5 3888.8 3817.08

a aa i ii u uu

Female data

vowels

Duration insecs

Intensity indB

Pitch(f0) inHz

FormantsinHz F1 875.37 850.2 292.16 276.35 344.4 344 4 404.9 F2 1367.0 1231.7 2528.6 2850.4 806.1 806 1 762.2 F3 3354.3 3402.0 3361.1 3338.7 2953.1 2953 1 3024.6 F4 4165.6 3837.3 4031.9 3998.2 4301.8 4301 8 4113.0

a aa i ii u uu

## 236 230.7 278.4 256.8 292.9 292 9 256.8

Word contrast

waveform of alai 1

## -1 alai gl -2 2 0.001672 gl-a a 0.1 l la 0.2 Time (s) a ai 0.3 i 0.3959

spectrogram of alai

5000

0.1 0 1

0.3 0 3

0.3959 0 3959

lpc of a in alai

651.3
60

40

## 20 0 1000 2000 Frequency (Hz) 3000 4000 5000

waveform of ilai 1

-1 ilai gl -2 0.003842 g i g-i i 0.1 i l i-l l 0.2 Time (s) l a l-a a 0.3 a i a-i 0.4 i 0.4754

## Formant frequency (Hz)

4000 0 3000 2000 -1 ilai 1000 gl -2 0 0.003842 g-i i 0.1 i-l l 0.2 Time (s) l-a a 0.3 a-i 0.4 i 0.4754

lpc of i in ilai

368.38
60

40

## waveform of ulai ulai 1

-1 ulai gl g-u -2 0.002981 u 0.1 u-l l 0.2 Time (s) l-a 0.3 a a-i 0.4 i 0.5 0.5409

5000

0.1

0.3

0.4

0.5

0.5409

lpc of u in ulai

60

40

20

1000

## 2000 Frequency (Hz)

3000

4000

5000

Words

Duration Insec

Intensity indB

Pitch(F0) inHz F1

FormantsinHz F2 1363.6 2027.8 2027 8 884.5 F3 2821.5 3144.3 3144 3 2600.2 F4 3683.3 3789.5 3789 5 3701.0

## 581.1 374.7 374 7 338.1

waveform ofpala 1

-1 1 pala VOT p-a -2 0.004528 a 0.1 a-l l 0.2 Time (s) l-a 0.3 a 0.4 0.4833

5000

0.1

0.2

0.4

0.4833

60

689.9

1300.8

2728.3 3725

40

## 20 0 1000 2000 Frequency (Hz) 3000 4000 5000

waveform of kala 1

-1 kala VOT -2 0 0.1 0.2 Time (s) 0.3 0.4 0.4787 a a-l l l-a a

5000 4000 3000 2000 1000 0 0 0.1 0.2 Time (s) 0.3 0.4 0.4787

## lpc of medial a in kala

696.3
Sound pressure level (dB/Hz)
60

1410.

2811.8

3718.5

40

20

1000

## 2000 Frequency (Hz)

3000

4000

5000

Words Medialvowel

Duration insecs

Intensity indB

Pitch inHz F1

Pala VOT0.020

0.95

78.2

125.6

648.0

Kala VOT0.033

0.094

76.9

124.1

659.4

1423.9

2783.1

3664.0

Puli VOT0.054

0.074 0 074

61.6 61 6

136.1 136 1

381.1 381 1

1036.5 1036 5

2433.1 2433 1

3573.9 3573 9

60

60

40

40

Duration insecs

Intensity indB

Pitch inHz F1

## FormantsinHz F2 1650.5 F3 3070.7 F4 3750.9

0.21

80.8

159.9

690.0

kala

0.197

80.3

160.4

686.7

1570.0

2969.69

4051.8

waveform of pa,pi,pu 1

## -1 pa VOT -2 0 0.5 1 Time (s) 1.5 1.876 a VOT pi i VOT pu u

5000 4000 3000 2000 1000 0 0 0.5 0 5 1 Time (s) 1.5 1 5 1.876 1 876

## Formant frequency (Hz)

lpc ofa in pa

lpc ofa in pi

lpc ofu in pu

F1
Sound pressure le (dB/Hz) evel
60

60

F2

F3

F2 F3 F4

F1

F4

F1 F2
60

F3 F4

40

40

40

## Words Vowelafter voicelessstop Pa

VOT0.04

Duration insecs

Intensity indB

Pitch(F0) inHz

F1

FormantsinHz F2 F3

F4

## 3652.4 3665.6 3555.6

Pi
VOT0.05

Pu
VOT0.039

wavefor of icai
1

waveform of ticai

-1

## icai gl 0.003881 i ic c 0.2 Time (s) ca a ai i 0.4 0.4802

vot -2 0.002689 i ic 0.2 c

## ticai ca a ai 0.4 Time (s) i 0.4839

5000
Formant frequency (Hz) y
Formant frequency (H Hz)

0.4

0.4839

0.4

0.4802

60

60

40

40

20

1000

2000

4000

5000

5500

F1 345.4 319.2

## FormantsinHz F2 F3 2587.2 2531.9 3255.93 3168.2

F4 4497.3 4426.3

waveform of puli 1

-1 puli p -2 0 0.1 0.2 Time (s) 0.3 0.4 0.5 0.5273 p-u u u-l l l-i i

5000 4000 3000 2000 1000 0 0 0.1 0.2 Time (s) 0.3 0.4 0.5 0.5273

## F Formant frequency (Hz)

lpc of u in puli 60

lpc of i in puli

40

## Sound pressure level (dB/Hz)

0 1000 2000 3000 Frequency (Hz) 4000 5000

40

20

20

Words

Puli puli

a 0.2381

glide 0.02

aa o.3421 0.7365

Time (s)

5000

4000

3000

2000

1000

0 0.001838

0.1

0.2

0.3

## 0.4 Time (s)

0.5

0.6

0.7 0.7365

LPC-analysis LPC analysis is used to construct the LPC coefficients for the inverse transfer function of the vocal tract tract. The standard methods for LPC coefficients estimation have the assumption that the input signal is stationery. Quasi stationery signal is obtain by framing the input signal which is often done in frames in length of 20 ms. A more stationery signal result in a better LPC analysis because the signal is better described by the LPC coefficients and therefore minimize the residual signal. The residual signal also called the error signal.

LPC-estimation LPC estimation calculates an error signal from the LPC coefficients from LPC analysis. This error signal is called the residual signal which could not be modelled by the LPC analysis. This i Thi signal i calculated b filt i l is l l t d by filtering th original signal with th the i i l i l ith the inverse transfer function from LPC analysis. y q If the inverse transfer function from LPC analysis is equal to the vocal tract transfer function then is the residual signal from the LPC estimation equal to the residual signal which is put in to the vocal tract. In that case is the residual signal equal to the impulses or noise from the human speech production.

LPC-synthesis LPC synthesis is used to reconstruct a signal from the residual signal and the transfer function of the vocal tract. Because the vocal tract transfer function is estimated from LPC analysis can this be used combined with the residual / error signal from LPC estimation t construct the original signal. ti ti to t t th i i l i l Reconstruction of the original signal is done by filtering the error signal with the vocal tract transfer function. y g g

Cepstral signal analysis is one out of several methods that enables us to fi d t h th t find out whether a signal contains periodic elements. i l t i i di l t The method can also be used to determine the pitch of a signal. The cepstral coefficients describe the periodicity of the spectrum. A peak in the cepstrum denotes that the signal is a linear combination of multiples of the pitch frequency. The pitch period can be found as the number of the coefficient where the peak occurs. h h k

The spectrum of the signal is periodic due to the spikes with equal distances in the spectrum. q p However it is difficult to determine the pitch frequency since the different harmonics not necessarily is present in the spectrum. spectrum

## The cepstrum consists of two elements.

An element from the excitation sequence (a pulse train for voiced speech) in the higher quefrencies.

The other element originates from the vocal tract impulse response and is present in the lower quefrencies.

What is the nyquist frequency? The nyquist frequency is half the sampling rate. In order to prevent frequency components that are not present in the original signal to arise in the digitized signal, besides chosing an appropriate sampling rate (at least twice the highest frequency component in the original signal), the original signal is filtered, beforehand, in order to get rid of the frequencies that are higher than the nyquist frequency (half the sampling rate).

What is quantization noise? How can it be avoided? Quantization noise is the noise that arises by quantizing the amplitude of a signal d i di iti ti i l during digitization. It is a consequence of the jump from one discrete amplitude level to the other. g (the relative loudness of q quantization noise) can be ) The signal to noise ratio ( made lower by using more bits for the quantization.

Source-filter model According to the shape, size, morphological configuration and material of the vocal cavities these cavities together act as a filter on the source sound cavities, sound. The acoustic filter changes its spectrum by letting some frequency components pass and by attenuating other ones (in terms of amplitude, energy). The so called source filter model (first developed by Gunnar Fant) illustrates how the speech signal (as it is spread by the lips) is the combination of the source sound and the filtering effect that resonance chamber and articulators have on it. In fi I figure. b l below we see, f from l ft to right, th (estimated) source signal left t i ht the ( ti t d) i l (spectrum and oscillogram), the filtering effect of the vocal resonance cavity, and the resulting speech wave as it is spread by the lips (spectrum and g ) oscillogram).

The source-filter model: from the left, source signal (spectrum and oscillogram) + resonance filter = final signal spread at the lips (spectrum and oscillogram).

While the spectrum of the source signal has a descending slope of 12 dB per octave, the spectrum's slope of the final signal is not that steep. When the signal is radiated at the lips it undergoes an amplitude increase of 6 dB per octave, as higher frequency tones are more easily radiated than lower frequency ones. The 12 dB/octave descendent spectrum of the source signal is combined with the 6 dB increase per octave when the signal is radiated from the lips. lips The resulting spectrum falls off by 6 dB per octave.

Several factors are involved in the resonance process and influence the resulting sound signal, among which the size and the shape of the cavities, as well as their material. As a matter of fact, we can create all the different vowels by displacing the jaw, the tongue and lips, thus the shape of the oral cavity. While the sound source of vowels is the larynx (the glottal folds) and voiceless consonants are produced in the oral cavity (by different articulators), voiced consonants ( ), (such as [ ] and [ ]) [v] [z]) are brought about by two source signals: the one produced by the vocal folds' oscillation, on the one hand, and the noisy friction produced in the mouth.