You are on page 1of 16

Int. J. Bioinformatics Research and Applications, Vol. 11, No.

5, 2015 417

Acoustic analysis of speech under stress

Savita Sondhi*
Department of Electrical, Electronics & Communication Engineering,
ITM University,
Gurgaon 122017, Haryana, India
Email: savita_sondhi@yahoo.com
Email: savitasondhi@itmindia.edu
*Corresponding author

Munna Khan
Department of Electrical Engineering,
Jamia Millia Islamia (Central University),
New Delhi 110025, India
Email: mkhan4@jmi.ac.in

Ritu Vijay
Department of Electronics,
Banasthali University,
Banasthali 304022, Rajasthan, India
Email: rituvijay1975@yahoo.co.in

Ashok K. Salhan and Satish Chouhan


Biomedical Instrumentation Division,
Defence Institute of Physiology and Allied
Sciences (DIPAS), DRDO, Timarpur,
New Delhi 110054, India
Email: ashoksalhan@yahoo.com
Email: satish020684@gmail.com

Abstract: When a person is emotionally charged, stress could be discerned in


his voice. This paper presents a simplified and a non-invasive approach to
detect psycho-physiological stress by monitoring the acoustic modifications
during a stressful conversation. Voice database consists of audio clips from
eight different popular FM broadcasts wherein the host of the show vexes the
subjects who are otherwise unaware of the charade. The audio clips are
obtained from real-life stressful conversations (no simulated emotions).
Analysis is done using PRAAT software to evaluate mean fundamental
frequency (F0) and formant frequencies (F1, F2, F3, F4) both in neutral and
stressed state. Results suggest that F0 increases with stress; however, formant
frequency decreases with stress. Comparison of Fourier and chirp spectra of
short vowel segment shows that for relaxed speech, the two spectra are similar;
however, for stressed speech, they differ in the high frequency range due to
increased pitch modulation.

Copyright © 2015 Inderscience Enterprises Ltd.


418 S. Sondhi et al.

Keywords: voice stress analysis; stressed speech; fundamental frequency;


formant frequency; FFT; chirp transform; bioinformatics.

Reference to this paper should be made as follows: Sondhi, S., Khan, M.,
Vijay, R., Salhan, A.K. and Chouhan, S. (2015) ‘Acoustic analysis of speech
under stress’, Int. J. Bioinformatics Research and Applications, Vol. 11, No. 5,
pp.417–432.

Biographical notes: Savita Sondhi is working as Assistant Professor in


Electrical, Electronics & Communication Engineering Department at ITM
University, Gurgaon, India. She is doing her PhD from Banasthali University,
Banasthali, Rajasthan, India. Her field of specialisation is signal processing.
She has many papers in national and international conferences and journals to
her credit.

Munna Khan has been working as Professor at Electrical Engineering


Department, JMI Central University, New Delhi. He worked as Director of
Mewat Engineering College (Wakf), Haryana, Assistant Professor at IIT
Guwahati, and as visiting Professor at Wright State University, Dayton, USA.
He received his PhD degree in Biomedical Engineering from IIT Delhi, India,
in 2002. He has published 90 research papers in international/national journals
and conferences. He is an Associate Fellow of Aerospace Medical Association,
USA, and life member of ISAM, BMES, ISTE, India. He is a recipient of
Dogra Endowment Gold Medal Award (2002) at IIT, New Delhi, India.

Ritu Vijay is Associate Professor and Head of the Electronics Department,


Banasthali University, Banasthali, India. She has acquired her doctorate in
Electronics Engineering and her field of specialisation is digital signal
processing and embedded system design. She has many papers in national and
international journals and conferences. She is guiding many research scholars.

Ashok K. Salhan is Ex Additional Director at Defence Institute of Physiology


and Allied Sciences (DIPAS), DRDO, New Delhi, India. He has the degree of
MBBS and has a vast knowledge and experience in the field of bio-medical
instrumentation and analysis. He has many papers in national and international
journals to his credit.

Satish Chouhan received BE in Biomedical Engineering from SGSITS, Indore,


India, in 2006. Currently he is working with Defence Institute of Physiology
and Allied Sciences (DIPAS), DRDO, New Delhi, India, as Scientist ‘C’. His
current research interests are in field of Biomedical Instrumentation, High
Altitude Engineering and Cognitive Neuroscience.

1 Introduction

Acoustic indications of psychological stress have since long been the focus of research in
the field of speech synthesis. When a person experiences stress, there are predictable
changes in his voice parameters. Thus, the psychological state of the subject, i.e. whether
the subject is under stress or is relaxed, can be understood by voice stress analysis.
Stress in voice can be detected by carefully monitoring and measuring the variations in
the acoustic parameters. Previous studies have shown that detailed analysis of human
acoustic cues, i.e. fundamental frequency (F0), intensity, duration, vocal tract spectrum
Acoustic analysis of speech 419

and spectral tilt, can detect emotions like fear, happiness, anger, anxiety, fatigue,
depression, deception, etc. The fundamental frequency is an important feature that
determines the emotional state of the speakers. It is the frequency of vibration (opening–
closing) of vocal folds (vocal cords) per second (Titze, 1989). Formants represent the
spectral structure as a function of time. Each formant corresponds to the natural
resonance of the vocal tract. When the effect of these resonances reflects in the sound
spectrum, the spectral peaks obtained are called as formants. They basically represent the
concentration of acoustic energy around a particular frequency in the voice. Formants
occur at roughly 1000 Hz intervals, i.e. there is one in each 1000 Hz band. Research till
date suggests that under the influence of any psycho-physiological stress, fundamental
frequency of human voice deviates from its baseline. Thus, the mean fundamental
frequency (also referred as mean pitch) and formant structure of speech are considered to
be reliable indicators of stress (Tolkmitt and Scherer, 1986; Protopapas and Lieberman,
1995; Scherer et al., 2002; Sigmund, 2007; Ling et al., 2011; Salhan et al., 2012; Sondhi
et al., 2012). Scherer et al. (2002) reported that highly stressed subjects also show
increase in the mean value of amplitude and decrease in the duration of spoken words.
Streeter et al. (1983) analysed tape recordings of telephone conversation between system
operator (SO) and his immediate superior (CSO) one hour prior to 1977 New York
blackout. It was reported that with increasing situational stress the CSO’s vocal pitch
increased, whereas the SO’s pitch decreased. The analysis generated correct results for
CSO but not for SO (SO was responsible for monitoring and controlling the network) as
his pitch decreased over the hour of conversation. Ruiz et al. (1990) proposed that when
the speaker is in psychological stress there is a shift in the fundamental frequency (F0) of
voice. Protopapas and Liberman (1995) investigated the variations in F0 on the perceived
emotional stress. For this study, the authors compared the voice of male helicopter pilot
during routine communication with the control tower with the voice samples obtained
shortly after the pilot lost control of the helicopter. It was reported that mean and max F0
correlated highly to the stress. Ruiz et al. (1996) analysed acoustic features like pitch and
spectral data of speech in two different stressful situations, the laboratory conducted
stroop test and the cockpit voice recording of a crashed aeroplane (real case corpus).
Variation of acoustic features under the influence of increasing stress was monitored.
Distortions due to the background noise and variability due to environmentally induced
stress degrade the efficiency of speech recognition algorithms operating various
equipment and gadgets. Hansen (1996) addresses to this issue by analysing and
modelling the speech characteristics due to workload/task stress, emotional state of the
speaker or speech produced in noisy environment (Lombard effect). Through this paper,
author proposes three novel approaches for signal enhancement, stress equalisation and a
robust speech recognition algorithm based on source generator theory. Burkhardt and
Sendlmeier (2000) developed a speech synthesiser for emotional speech and proposed
that pitch irregularity is significant for distinguishing between boredom and sadness.
Sigmund (2007) analysed stress in phonetically rich sentences from the exam stress
corpus. Author detected stress from short segments of vowels by comparing Fourier and
chirp transforms. It was reported that speech signals carry information regarding the
physiological state of the speaker and by comparing the results of the Fourier transform
and the chirp transform; detection of stress in speech was possible. Scherer et al. (2008)
proposed an approach to detect the amount of stress in the speech signal close to real
time using recurrent neural network. Subjects had to take verbal quiz while they played
an air controller simulation. The simulator was designed to gradually induce more stress
on the subjects by becoming more difficult to control. It was concluded that automatic
420 S. Sondhi et al.

recogniser outdid the human baseline. Frampton et al. (2010) applied binary logistic
regression to analyse automatically extracted acoustic features from speech generated
under stress with 76.24% accuracy. The database consisted of task-based human–human
conversations in which a time limit is unexpectedly introduced after halfway. Ling et al.
(2011) compared Empirical Model Decomposition (EMD) of speech with the average
spectral energy in the speech spectrograms to classify between stress and emotions in
natural speech. Natural speech from two different databases, namely SUSAS data
(annotated with three different levels of stress) and Oregon Research Institute (ORI)
data (annotated with five different levels of emotions: neutral, anxious, angry, dysphoric
and happy), were used for the experiment. It was reported that speech spectrograms were
more adequate for stress recognition, whereas emotion recognition needs further
improvement. Bageshree et al. (2012) analysed 800 sentences with seven emotions of ten
actors and ten actresses from the TU Berlin database, and reported that pitch and formant
frequencies were capable of classifying the three different emotional state of a person,
namely happy, sad and neutral. Johnson et al. (1979) investigated stress related to visual
stimuli of potentially stressing situations. It was reported that potentially stressful visual
stimuli were capable of eliciting a stress. Mohanty and Jena (2011) analysed the stressed
speech samples of final year students appearing for exam and observed that stress had
significant effects on the vowel duration rather than consonant. Sigmund (2012) analysed
the short-time spectra of three cardinal Czech vowels /a/, /i/ and /u/ both in neutral and
stressed speech. Formants, anti-formants and inflexion points of spectral envelope were
analysed. It was concluded that speaker dependent stressed speech estimator would be
more appropriate as speakers display an individual trend rather than a uniform pattern.
Sigmund (2013) investigated the effect of exam stress on the distributions of F0
and multiple frames having constant F0 (i.e. twins and triples) and reported that it was
the best feature for speaker-independent stress detection. Mongia and Sharma (2014)
performed speech signal analysis for stress detection using Surrey Audio-Visual
Expressed Emotion (SAVEE) database. A total of 51 acoustic parameters (pitch,
intensity, jitter shimmer, autocorrelation, energy entropy (EE), short-time energy (STE),
zero crossing rate (ZCR), spectral roll off (SR), spectral centroid (SC), spectral flux (SF),
formant frequencies (F1, F2, and F3), formant amplitudes (A1, A2, and A3), and formant
bandwidths (B1, B2, and B3) were evaluated and compared for positive stress, negative
stress, and neutral state. The acoustic features like pitch, autocorrelation, HNR, formant
frequency (F1, F2, F3), formant amplitude (A1, A2, A3), and formant bandwidth
(B1, B2, B3) were concluded as potential parameters for stress detection among all
speakers.

1.1 Related work


Previous studies have examined stress in voice based on laboratory induced stress
(simulated situations), namely subjects trying to deceive the investigator (Streeter et al.,
1977; Pollina et al., 1998; Patil et al., 2013), stress due to work overload (Scherer et al.,
2002), deprived sleep, fatigue (Bagnall, 2007), stressful visual stimuli (Johnson et al.,
1979), unpleasant slide shows, pictures of humans suffering with skin diseases or severe
accident injuries (Tolkmitt and Scherer, 1986), voice of pilot just prior to a fatal aircraft
crash or the classic Hindenburg radio announcement (Williams and Stevens, 1972;
Williams and Stevens, 1969), recordings of pilot connecting with the base station prior to
the loss of control of the helicopter and shortly thereafter (Protopapas and Lieberman,
1995), phonetically rich sentences from the exam stress corpus (Sigmund, 2007),
Acoustic analysis of speech 421

answering verbal quiz while simultaneously playing simulated air controller designed to
gradually get complex (Scherer et al., 2008), database consisting of voiced segments
of five vowels (Sigmund et al., 2008), database consisting of task-oriented dialog
(in a limited time frame) between subjects (Frampton et al., 2010), investigations based
on the (SUSAS) database (Hansen and Bou-Ghazale, 1997; Hansen and Womack, 1996;
Casale et al., 2007) and SAVEE database (Mongia and Sharma, 2014), EMOVO corpus:
an Italian emotional speech database (Costantini et al., 2014; Mencattini et al., 2014) and
the TU Berlin database, containing 800 emotional sentences uttered by actors and
actresses (Bageshree et al., 2012). Unlike previous studies, emphasis here is to analyse
voice of subjects under real stressful situation. The database for the present study consists
of episodes from popular FM broadcast (real stressful situation) wherein the host of the
show intentionally vexes the subjects, who are otherwise completely unaware of the
hoax. Moreover, ‘long sound files’ (raw audio) are analysed to monitor the pitch
variation of the subject in the whole utterances. Spectra of short vowel segments are also
analysed to detect the presence of stress in voice. Following this introduction, Section 2
describes the materials and method used for the study. Results of the voice analysis are
given in the form of table and graphs in Section 3. Finally, concluding remarks are
presented in Section 4.

2 Materials and methods

Subject: Eight different episodes of the popular FM radio broadcast were obtained from
Radio Mirchi 98.3 FM and analysed to detect the psychological state of the subjects. In
each of these episodes the host of the show gathers personal information about general
people either from some of their friends/relatives or through data uploaded by these
people on certain websites. The host then calls them on their telephone and talks to them
based on the gathered information. The conversation begins at a relaxed tone but
gradually the host starts vexing the called person who in turn gets irritated and stressed
which reflects in his/her voice. (Henceforth, the word called party will be used in this
paper to refer to the person who receives the telephone call from the host.) Changes in
the psychological state (as reflected in the voice of the host and the called party) during
the conversation are observed. Thus, the database contains 17 audio files, i.e. separate
audio files of the host and the called parties of each of the eight episodes. In one episode
there were three subjects, the results of which are presented in this paper. This paper
presents analysis of the episode in which the host of the show pretends to be a relative of
the wife and calls up her husband and enquires about the cause of daily domestic unrest.
Husband starts explaining his point and as he learns that his wife is also connected on
line and is listening to everything, he changes his tone and tries to mellow down. Thus,
all three parts of conversation of the husband – first when he is talking to his wife’s
relative (which is the host of the show), second when he talks to his wife (after learning
that his wife has heard everything) and third when he again talks to the host – are
extracted and independently analysed. Thus, in the present study, audio of each subject
are extracted for both the neutral (relaxed) conversation and stressed conversation.
Materials: The materials used to perform this study are: FM audio clips and Acer
laptop. Goldwave software v5.58 (2010) (it is a freeware) to extract the voice of each
subject from the FM audio clips. Audacity 2.0.3 (Mazzoni, 2013) for normalising the
422 S. Sondhi et al.

volume and changing it to .wav format. Acoustic analysis was performed using speech
analysis software PRAAT 5.356 (Boersma and Weenink, 2010). MATLAB 7.10.0.499
(R2010a) was used for plotting Fourier and chirp spectrums.
The recordings at FM station were done using a DYNAMIC CARDIOID RE 27 N/D
microphone and Telos 2×12, Audio Arts Console D 70 and Vegas 7.0 (software for
recording listener voice) in a sound proof recovery studio.
Data analysis: The entire episode which is discussed in this paper is about the
conversation between subject 1(host of the show), subject 2 (husband) and subject 3
(wife) related to some domestic issues between subject 2 and subject 3. The first task was
to extract the voice of individual subjects using Goldwave software. The episodes of FM
broadcast were in .mp3 format. The sound clips were normalised using Audacity and then
exported to .wav format. The original sound file in .mp3 format had the following
parameters: MPEG Audio layer-3, 32,000 Hz, 96 kbps, stereo. The new sound file in
.wav format had parameters like: WAVE PCM signed 16 bit, 32,000 Hz, 1024 kbps,
stereo. The length of audio clip of subject 2 was 122 second wherein he first talks with
subject 1 for 62.5 second then with subject 3 for 57.6 second and then again with subject
1 for 9.6 second. Each of the audio clips was analysed using a long-term pitch analysis
PRAAT script using autocorrelation method to obtain mean fundamental frequency (F0).
First four formant frequencies (F1, F2, F3 and F4) for each subject were also calculated
using PRAAT software.

3 Result

3.1 Pitch-based analysis


When subject 2 receives the call from subject 1, i.e. the host of the show, he talks
normally as he would answer to his regular calls. But as subject 1 introduces himself as
subject 2’s wife’s uncle and urges to enquire about the pretext of daily domestic tension,
subject 2 (husband) gets emotionally worked up as he explains everything to subject 1.
Here, subject 2 feels aggressive as he is talking to his ‘in-laws’. This aggression is clearly
visible in the pitch contour of Figure 1(a). Once subject 3 (wife) comes into picture,
subject 2 (husband) gets shocked and sounds defensive as is reflected in the pitch contour
given in Figure 1(b). In third part, he is trying to save face and as he feels that subject 3 is
not excessively angry and he may be forgiven, some confidence comes back in his voice
which can be clearly seen in the pitch contour shown in Figure 1(c) Thus, all the three
pitch contour of subject 2 obtained from different segments of the conversation are
distinctly different from each other. Figures 1(d) and 1(e) display the pitch contours of
subject 1 (host) while he is talking to subject 2 (husband) and subject 3 (wife),
respectively. Figures 1(f) and 1(g) are the pitch contours of the conversation of subject 3
with subject 2 and subject 1 (host), respectively. As subjects 1 and 3 were part of the
charade and were not under any stress, so they were in relaxed state which is very well
reflected in their respective contours. Thus, emotions are highly related to F0 as are
clearly visible in pitch contours given in Figure 1. Average of the mean fundamental
frequency (F0) and first four formants (F1, F2, F3 and F4) of the called parties of all the
eight episodes during relaxed, angry and embarrassed states are given in Table 1 and that
of the host of the show is given in Table 2. A significant upshift in F0 for the stress
related to anger or irritation and subsequent downshift in F0 for embarrassment was
observed.
Acoustic analysis of speech 423

Table 1 Mean fundamental frequency (F0) and formants F1 to F4 of the called party

Neutral Angry Embarrassed


F0 (Hz) 208.97 271.13 188.65
F1 (Hz) 647.88 611.99 608.41
F2 (Hz) 1620.70 1599.52 1585.08
F3 (Hz) 2462.47 2438.56 2519.74
F4 (Hz) 3250.92 3230.35 3336.15

Table 2 Mean fundamental frequency (F0) and formants F1 to F4 of the host of the show

Neutral conversation During charade


F0 (Hz) 119.71 131.64
F1 (Hz) 595.65 587.20
F2 (Hz) 1790.94 1762.91
F3 (Hz) 2792.94 2751.40
F4 (Hz) 3852.17 3868.16

Figure 1 (a) Pitch contour of subject 2 (male) while talking to subject 1 (male); (b) pitch contour
of subject 2 (male) while talking to subject 3 (female); (c) pitch contour of subject 2
(male) while talking to subject 1 (male) again; (d) pitch contour of subject 1 (male)
while talking to subject 2 (male); (e) pitch contour of subject 1 (male) while talking to
subject 3 (female); (f) pitch contour of subject 3 (female) while talking to subject 2
(male); (g) pitch contour of subject 3 (female) while talking to subject 1 (male)
(see online version for colours)

(a)

(b)
424 S. Sondhi et al.

Figure 1 (a) Pitch contour of subject 2 (male) while talking to subject 1 (male); (b) pitch contour
of subject 2 (male) while talking to subject 3 (female); (c) pitch contour of subject 2
(male) while talking to subject 1 (male) again; (d) pitch contour of subject 1 (male)
while talking to subject 2 (male); (e) pitch contour of subject 1 (male) while talking to
subject 3 (female); (f) pitch contour of subject 3 (female) while talking to subject 2
(male); (g) pitch contour of subject 3 (female) while talking to subject 1 (male)
(see online version for colours) (continued)

(c)

(d)

(e)
Acoustic analysis of speech 425

Figure 1 (a) Pitch contour of subject 2 (male) while talking to subject 1 (male); (b) pitch contour
of subject 2 (male) while talking to subject 3 (female); (c) pitch contour of subject 2
(male) while talking to subject 1 (male) again; (d) pitch contour of subject 1 (male)
while talking to subject 2 (male); (e) pitch contour of subject 1 (male) while talking to
subject 3 (female); (f) pitch contour of subject 3 (female) while talking to subject 2
(male); (g) pitch contour of subject 3 (female) while talking to subject 1 (male)
(see online version for colours) (continued)

(f)

(g)

3.2 Comparing two emotions ‘relaxed’ and ‘stressed’ using


FFT, chirp transform and spectrogram
As the pitch of the speech is fast changing, therefore the harmonic structure of the speech
gets distorted mainly at high frequencies. Therefore, to enhance the spectral structure and
to obtain fine harmonics, chirp spectrum was plotted.
The short-time chirp transform is defined as
N 1 k 1  ˆ m  n  N   n
S  m, k   x  n  mM  w  n  e 2 j (1)
n 0 N
Where x[m] is the discrete signal, w is the analysis window of length N, M is the time
step and k is the discrete frequency.
Value of k ranges from

k    K ˆ m  , ,0, K ˆ m 


426 S. Sondhi et al.

N /2
where K  ˆ m  
1  ˆ m N
The frequency variation rate ˆ m for each segment m with a mean frequency f0 is:

f 0  m  2  f n  f n 1 
ˆ m   (2)
Mf 0  m  M  f n  f n 1 

Correct estimation of frequency variation rate ˆ m provides time-frequency resolution of


short-time chirp transform. Here, fn and fn–1 are the values of mean frequency f0 obtained
from the present and previous segments.
Thus, we see that Fourier transform is a special case of chirp transform. where α = 0.
The chirp transform covers the whole time frequency space.
Comparison between FFT and chirp spectrum is shown in Figure 2(a–b). These
spectra are computed from phoneme ‘a’ extracted from identical word spoken by
subject 2 (husband) under relaxed state and under stressed state. Both the relaxed and
stressed speech frames are of 30 minutes.

Figure 2 (a) FFT and chirp spectra of phoneme /a/ obtained from relaxed (neutral) speech;
(b) FFT and chirp spectra of phoneme /a/ obtained from stressed speech (see online
version for colours)

(a)
Acoustic analysis of speech 427

Figure 2 (a) FFT and chirp spectra of phoneme /a/ obtained from relaxed (neutral) speech;
(b) FFT and chirp spectra of phoneme /a/ obtained from stressed speech (see online
version for colours) (continued)

(b)

Figure 3 (a) Spectrogram of neutral speech; (b) spectrogram of stressed speech

(a)
428 S. Sondhi et al.

Figure 3 (a) Spectrogram of neutral speech; (b) spectrogram of stressed speech (continued)

(b)

4 Discussion

Voice stress analysis was done to determine whether a subject is relaxed (neutral),
stressed or nervous during a conversation. Eight different FM episodes of different
subjects engaged in real-life stressful discussion were analysed using PRAAT software.
Mean fundamental frequency of voice (F0) and formant frequency F1, F2, F3, and F4
were evaluated both in the neutral state and stressed state. Average of the mean
fundamental frequency (F0) and formant frequency F1, F2, F3, and F4 of called party
from all eight episodes were listed in Table 1. It was observed that fundamental
frequency (F0) of the called party increased by 30% in state of anger as compared to
neutral state. On learning about the prank, the called party felt embarrassed or guilty
which results in downshift in F0. Formant frequencies, on the other hand, decreased in
the state of anger. The amount of decrease in the formant frequency F1 and F2 in the
state of anger is more than that of F3 and F4 for most of the cases (Table 1). The host of
the show was leading the conversation and was aware of the charade, so there was no
significant change in his mean fundamental frequency (F0) or any of the formant
frequencies during the entire conversation as shown in Table 2. However, in one episode
57% increase in F0 of the host was noted. This episode was about the conversation
between the host and a common man wherein the man objects to the public demand to
hang men who commit heinous crime towards women. The conversation begins at a
normal pace where the man expresses his point to forgive those men who commit crimes
against women but the moment he learns that the host has made a derogatory statement
towards his own daughter he gets violent and offensive. This was evident by sudden
increase in F0 value of the man. However, as the conversation gets offensive, the
percentage increase in F0 of the host was significantly higher than that of the man. This
was probably because the host rebukes at the man to make him realise his indifferent
Acoustic analysis of speech 429

attitude for other girls outside and for his own daughter in particular for the same
offence. Thus, this was the only episode where percentage increase in F0 of the host was
57% as he gets hyper. On realising that he was being broadcast live, the man feels
ashamed and embarrassed which results in downshift in the value of F0. Thus, it was
clearly noted that anger resulted in upshift of F0, whereas embarrassment or guilt
resulted in downshift of F0. Obtained results confirm that fundamental frequency (F0)
and formants F1 and F2 are reliable indicators of vocal stress and agree with the findings
of Demenko and Jastrzebska (2012), Protopapas and Lieberman (1997), Sigmund (2012,
2013), Sigmund et al. (2008) and Mohanty and Jena (2011). Pitch contours of each
subject were also plotted. These contours were distinct and highly correlate with the state
of mind of each subject over the entire span of the conversation (Figure 1(a–g)). FFT and
chirp transforms confirmed that in case of relaxed speech, envelop of both the spectra
was similar (Figure 2(a)), whereas for stressed speech two spectra were different in range
of higher frequencies (Figure 2(b)). The obtained FFT and chirp spectra under stress
greatly correlate with the findings of Sigmund (2009) and Mohanty and Jena (2011).
Spectrograms of relaxed and stressed speech were also distinct (Figure 3(a–b)). Thus,
statistical analysis of the audio clips and the changes in their spectrum successfully
distinguishes the stressed speech from the relaxed speech. This study proposes a non-
invasive technique to detect speech under stress. No electrodes or wires are connected to
the body of the subject, as such accessories add to the hassle. All that is needed was a
laptop with software’s installed. Thus, it simplified our set-up.

5 Conclusion

1 Mean fundamental frequency increases under stress.


2 Formants F1 and F2 decrease under stress. Formants F3 and F4 remain marginally
the same.
3 Spectrograms of relaxed and stressed speech are different.
4 Fourier and chirp spectra of short vowel segments show that for relaxed speech the
two spectra are similar; however, for stressed speech they differ in the high
frequency range due to increased pitch modulation.
In voice controlled applications which are sensitive to stress, continuous monitoring of
the psychological and physiological state of the operator is necessary to prevent failure
resulting in human and material loss. Thus, our next goal is to design a framework to
automatically detect stress in voice of the operator. Detecting stress and fatigue in voice
and comparing it with the baseline can also be helpful in many applications requiring
voice activated control of equipment, law and administration, employment screening and
detection of deception.

Acknowledgements

Authors are thankful to Sachin Tagra, Senior Vice President, Radio Mirchi 98.3, for
providing the audio clips of popular FM broadcast for offline analysis of voice stress
analysis.
430 S. Sondhi et al.

References
Bageshree, V., Pathak, S. and Panat, A.R. (2012) ‘Extraction of pitch and formants and its analysis
to identify 3 different emotional states of a person’, International Journal of Computer
Science Issues, Vol. 9, No. 4, pp.296–299.
Bagnall, A.D. (2007) The Impact of Fatigue on the Voice, PhD Thesis, University of
South Australia, Adelaide, South Australia.
Boersma, P. and Weenink, D. (2010) Praat: doing phonetics by computer (version 5.3.56).
Available online at: http://www.praat.org/ [computer program].
Burkhardt, F. and Sendlmeier, W.F. (2000) ‘Verification of acoustical correlates of emotional
speech using formant synthesis’, Proceedings of the ISCA Workshop on Speech and Emotion,
ISCA, Belfast, Ireland, pp.151–156.
Casale, S., Russo, A. and Serrano, S. (2007) ‘Multistyle classification of speech under stress using
feature subset selection based on genetic algorithms’, Speech Communication, Vol. 49,
pp.801–810.
Costantini, G., Iaderola, I., Paoloni, A. and Todisco, M. (2014) ‘EMOVO corpus: an Italian
emotional speech database’, Proceedings of the Ninth International Conference on Language
Resources and Evaluation (LREC’14), European Language Resources Association (ELRA),
Paris, pp.3501–3504.
Demenko, G. and Jastrzebska, M. (2012) ‘Analysis of natural speech under stress’, Acta Physica
Polonica-Series: A General Physic, Vol. 121, No. 1, pp.A92–A95.
Frampton, M., Sripada, S., Augusto, R., Bion, H. and Peters, S. (2010) ‘Detection of time-pressure
induced stress in speech via acoustic indicators’, Proceedings of 11th Annual Meeting of the
Special Interest Group on Discourse and Dialogue, SIGDIAL 2010, 24–25 September, The
University of Tokyo, pp.253–256.
Hansen, J.H.L. (1996) ‘Analysis and compensation of speech under stress and noise for
environmental robustness in speech recognition’, Speech Communication, Vol. 20, Nos. 1–2,
pp.151–173.
Hansen, J.H.L. and Bou-Ghazale, S. (1997) ‘Getting started with SUSAS: a speech under simulated
and actual stress database’, Proceedings of Fifth European Conference on Speech
Communication and Technology, 22–25 September, EUROSPEECH, Rhodes, Greece.
Hansen, J.H.L. and Womack, B.D. (1996) ‘Feature analysis and neural network-based
classification of speech under stress’, IEEE Transactions on Speech and Audio Processing,
Vol. 4, No. 4, pp.307–313.
Johnson, J.B., Pinkham, J.R. and Kerber, P.E. (1979) ‘Stress reactions of various judging groups to
the child dental patient’, Journal of Dental Research, Vol. 58, No. 7, pp.1664–1671.
Ling, H., Margaret, L., Namunu, C., Maddage, N. and Allen, B. (2011) ‘Study of empirical mode
decomposition and spectral analysis for stress and emotion classification in natural speech’,
Biomedical Signal Processing and Control, Vol. 6, pp.139–146.
Mazzoni, D. (2013) Audacity(R): A Free, Cross-Platform Digital Audio Editor, (version 2.0.3).
Available online at: http:// www: audacity.sourceforge.net/
Mencattini, A., Martinelli, E., Costantini, G., Todisco, M., Basile, B., Bozzali, M. and Di Natale,
C. (2014) ‘Speech emotion recognition using amplitude modulation parameters and a
combined feature selection procedure’, Knowledge-Based Systems, Vol. 63, pp.68–81.
Mohanty, M.N. and Jena, B. (2011) ‘Analysis of stressed human speech’, International Journal of
Computational Vision and Robotics, Vol. 2, No. 2, pp.180–187.
Mongia, P.K. and Sharma, R.K. (2014) ‘Estimation and statistical analysis of human voice
parameters to investigate the influence of psychological stress and to determine the vocal tract
transfer function of an individual’, Journal of Computer Networks and Communications,
Vol. 2014, Article ID 290147.
Acoustic analysis of speech 431

Patil, V.P., Nayak, K.K. and Saxena, M. (2013) ‘Voice stress detection’, International Journal of
Electrical, Electronics and Computer Engineering, Vol. 2, No. 2, pp.148–154.
Pollina, D.A., Vakoch, D.A. and Wurm, L.H. (1998) ‘Formant structure of vowels spoken during
attempted deception’, Polygraph, Vol. 27, No. 2, pp.96–107.
Protopapas, A. and Lieberman, P. (1995) ‘Effects of vocal F0, manipulations on perceived
emotional stress’, Proceeding of ESCA – NATO Tutorial and Research Workshop on Speech
under Stress, Lisbon, Portugal, pp.l–4.
Protopapas, A. and Lieberman, P. (1997) ‘Fundamental frequency of phonation and
perceived emotional stress’, Journal of the Acoustical Society of America, Vol. 101, No. 4,
pp.2267–2277.
Ruiz, R., Absil, E., Harmegnies, B. and Legros, C. (1996) ‘Time- and spectrum-related variability’s
in stressed speech under laboratory and real conditions’, Speech Communication, Vol. 20,
Nos. 1–2, pp.111–129.
Ruiz, R., Legros, C. and Guell, A. (1990) ‘Voice analysis to predict the psychological or physical
state of a speaker’, Aviation Space and Environmental Medicine, Vol. 61, No. 3, pp.266–271.
Salhan, A., Khan, M., Sondhi, S. and Vijay, R. (2012) ‘Online offline voice stress analyzer’,
Aviation Space and Environmental Medicine, Vol. 83, No. 3, p.309.
Scherer, K., Grandjean, D., Johnstone, T., Klasmeyer, G., Bänziger, T., Hansen, J.H. and Pellom, B.
(2002) ‘Acoustic correlates of task load and stress’, ICSLP2002 – Interspeech 20027th
International Conference on Spoken Language Processing, ISCA Archive, Denver, USA,
pp.2017–2020.
Scherer, S., Hofmann, H., Lampmann, M., Pfeil, M., Rhinow, S., Schwenker, F. and Palm, G.
(2008) ‘Emotion recognition from speech: stress experiment’, Proceedings of International
Conference on Language Resources and Evaluation LREC-08, Marrakech, Morocco,
pp.1325–1330.
Sigmund, M. (2007) ‘Spectral analysis of speech under stress’, ICSNS International Journal of
Computer Science and Network Security, Vol. 7, No. 4, pp.170–72.
Sigmund, M. (2012) ‘Influence of psychological stress on formant structure of vowels’,
Elektronika ir Elektrotechnika, Vol. 18, No. 10, pp.45–48.
Sigmund, M. (2013) ‘Statistical analysis of fundamental frequency based features in speech under
stress’, Information Technology and Control, Vol. 42, No. 3, pp.286–291.
Sigmund, M. and Brabec, Z. (2009) ‘Analysis of speech under stress based on short-time
spectrum of vowel phonemes’, Proceedings of IASTED International Multi-Conference on
Applied Informatics, Part Artificial Intelligence and Applications, IASTED, Innsbruck,
pp.77–81.
Sigmund, M., Prokes, A. and Brabec, Z. (2008) ‘Statistical analysis of glottal pulses in speech
under psychological stress’, Proceedings of the 16th European Signal Processing Conference
(EUSIPCO 2008), 25–29 August, Lausanne, Switzerland.
Sondhi, S., Khan, M., Vijay, R., Salhan, A. and Vashisth, S. (2012) ‘Real time speech analysis for
detection of stress using autocorrelation function’, Proceeding of 11th International
Conference on Information Technology and Telecommunication, 29–30 March, Cork Institute
of Technology, Cork, Ireland, pp.38–44.
Streeter, L.A., Krauss, R.M., Geller, V., Olson, C. and Apple, W. (1977) ‘Pitch changes
during attempted deception’, Journal of Personality and Social Psychology, Vol. 35, No. 5,
pp.345–350.
Streeter, L.A., Macdonald, N.H., Apple, W., Krauss, R.M. and Galottin, K.M. (1983) ‘Acoustic
and perceptual indicators of emotional stress’, Journal of Acoustical Society of America,
Vol. 73, No. 4, pp.1354–1360.
432 S. Sondhi et al.

Titze, I.R. (1989) ‘Physiologic and acoustic differences between male and female voices’, Journal
of the Acoustical Society of America, Vol. 85, No. 4, pp.1699–1707.
Tolkmitt, F.J. and Scherer, K.R. (1986) ‘Effect of experimentally induced stress on vocal
parameters’, Journal of Experimental Psychology, Vol. 12, No. 3, pp.302–313.
Williams, C.E. and Stevens, K.N. (1969) ‘On determinating the emotional state of pilots during
flight: an exploratory study’, Aerospace Medicine, Vol. 40, pp.1369–1372.
Williams, C.E. and Stevens, K.N. (1972) ‘Emotions and speech: some acoustical correlates’,
Journal of the Acoustical Society of America, Vol. 52, pp.1238–1250.

You might also like