Anxiety 2312.15272

Detecting anxiety from short clips of free-form speech
Prabhat Agarwal Akshat Jindal Shreya Singh

Stanford University Stanford University Stanford University
prabhat8@stanford.edu akshatj@stanford.edu ssingh16@stanford.edu
Abstract code is publicly available on Github1 .

The paper is outlined a follows: Section 2 gives
Barriers to accessing mental health assessments details about the existing literature in this prob-
including cost and stigma continues to be an im- lem domain. In Section 3, we describe the dataset
arXiv:2312.15272v1 [cs.CL] 23 Dec 2023
pediment in mental health diagnosis and treat- used for training our models along with the task
ment. Machine learning approaches based on description. In Section 4, we provide a detailed
speech samples could help in this direction. In
explanation of our five models implemented for the
this work, we develop machine learning solu-
tions to diagnose anxiety disorders from au- task. Section 5 outline the evaluation routines and
dio journals of patients. We work on a novel presents the study results. Finally, we conclude our
anxiety dataset (provided through collaboration findings in Section 6.
with Kintsugi Mindful Wellness Inc.) and ex-
periment with several models of varying com- 2 Related Works
plexity utilizing audio, text and a combination
of multiple modalities. We show that the multi- Liu et. al. (Low et al., 2020) provides a system-
modal and audio embeddings based approaches atic review of studies using speech for automated
achieve good performance in the task achieving
assessments across a broader range of psychiatric
an AUC ROC score of 0.68-0.69.
disorders and summarizes the different datasets,
features, and methods that have been employed in
1 Introduction this area. The authors summarize the key acous-
tic features used in these models across disorders.
Mental and physical health are equally important
Many studies (Horwitz et al., 2013; Quatieri and
components of overall health but there are many
Malyska, 2012; Cummins et al., 2015; Singh et al.,
barriers to accessing mental health assessments
2019) have observed that depression is linked with
including cost and stigma. Even when individu-
a decrease in f0 and f0 range and also leads to jitter
als receive professional care, assessments are in-
and shimmer with an increase in depression sever-
termittent and may be limited partly due to the
ity. While jitter and shimmer are also correlated
episodic nature of psychiatric symptoms. There-
with anxiety (Özseven et al., 2018; Silber-Varod
fore, machine-learning technology using speech
et al., 2016), many studies (Gilboa-Schechtman
samples obtained in the clinic, remotely over tele-
et al., 2014; Özseven et al., 2018; Weeks et al.,
phone or through recording in personal mobile de-
2012; Galili et al., 2013) have found a significant
vices could help improve diagnosis and treatment.
increase in mean f0 in patients with an anxiety
In this work, we study the problem of detecting
disorder. Most studies surveyed by the authors em-
anxiety disorders through speech. Since this is
ployed simple machine learning models like SVM,
a relatively under-studied problem, the dataset is
logistic regression and very few studies used more
quite small and hence we plan to explore differ-
complex networks with transfer from speech recog-
ent approaches to deal with the problem including
nition tasks (Singh et al., 2018). We plan to utilise
using hand-crafted features motivated by psycho-
these features which have been well studied both in
logical studies on anxiety and pretrained embed-
the psychology and machine learning assessment
dings models. The dataset provided by Kintsugi
of anxiety and depression in our work and compare
provides an interesting opportunity in this area as
the recordings are self-recorded in a comforting 1
https://github.com/prabhat1081/Anxiety-Detection-
environment as opposed to a clinical setting. The from-free-form-audio-journals
them against deep learning models both in terms of viduals undergoing depression screening through a
performance and generalizability. human-controlled virtual agent which prompts each
State-of-the-art works on detecting mental dis- individual with a subset of 170 possible queries that
orders or emotional states (e.g., anxious vs. calm) included direct questions (e.g. ‘How are you?’, ‘Do
from audio data use supervised learning ap- you consider yourself to be an introvert?’), and di-
proaches, which must be “trained” from examples alogic feedback ( e.g. ‘I see’, ‘that sounds great’).
of the sound to be detected (Kumari and Singh, The data was from the publicly available distress
2017). In general, learning such classifiers re- analysis and interview corpus (DAIC) and contains
quires “strongly labeled” annotated data, where audio and text transcriptions of the spoken interac-
the segments of audio containing the desired vocal tions (Rajan et al., 2023; Rastogi et al., 2023).
event and the segments not containing that event The authors modeled these interactions with au-
are clearly indicated. However getting strongly dio and text features in Long-Short Term Memory
labelled data is tough an in our project too, we (LSTM) neural network model to detect depres-
only have weakly supervised data. Salekin et. sion and through various experiments and ablation
al. (Salekin et al., 2018; Sebastian et al., 2019) studies showed that not only did a combination of
presents a weakly-supervised framework for de- modalities provide additional discriminative power,
tecting symptomatic individuals from their voice but that they contained complementary information
sample. The idea here is to collect long speech and hence it is quite useful to model both.
audio samples from individuals already diagnosed
with or high in symptoms of specific mental disor-
ders from situations that may heighten expression
of the symptoms of respective disorders. This type
of data is considered “weakly labeled”, meaning
that although they provide information about the
presence or absence of disorder symptoms, they do
(a) Anxiety score distribution
not provide additional details such as the precise
times in the recording that indicate the disorder etc.
For handling weakly labelled data, they pro-
pose a new feature modelling technique called
“NN2Vec” to generate low-dimensional, continu-
ous, and meaningful representation of speech from
long weakly labelled audio data as the other con-
ventional methods work well only with strongly
labelled data. To complement that, they present
a Multiple instance learning paradigm which is a
slight adaptation to classic supervised learning lead-
ing to better results. Their NN2Vec and Bi-LSTM
MIL approach achieved an F-1 score of 90.1% and
(b) Anxiety level distribution
90% accuracy in detecting speakers high versus
low in social anxiety symptoms. This F-1 scores is Figure 1: Distribution of raw anxiety score and inferred
20.7% higher than of the best baselines they tried. anxiety levels
Recently multi-modal approaches (Siriwardhana
et al., 2020; Al Hanai et al., 2018) to the speech
3 Dataset and Task Description
analysis have shown promising results. Hanai et.
al. (Al Hanai et al., 2018; Mohammed Abdulla For our project, we are working with the Gen-
et al., 2019) developed an automated depression- eralized Anxiety Disorder (GAD) dataset from
detection algorithm that models interviews between Kintsugi Mindful Wellness, Inc. The dataset con-
an individual and agent and learns from sequences sists of 2263 self-recorded audio journals of users
of questions and answers without the need to per- along with corresponding score of the user on the
form explicit topic modeling of the content (Jindal GAD-7 questionnaire taken by the user after record-
et al., 2023). The authors used the data of 142 indi- ing the journal. Each of these files is annotated with
an unnormalized anxiety GAD-7 score and a buck- ety and non-anxiety classes. We classify an audio
eted score. The unnormalized GAD-7 score ranges file/datapoint ‘anxiety’ if it belongs to the buckets,
from 0 to 21 and Fig. 1a shows its frequency distri- mild, moderate or severe, and consequently, ‘non-
bution. These raw anxiety scores are further split anxiety’ are those datapoints which fall in the None
into 4 buckets of anxiety levels, None, mild, mod- bucket (from Fig.1b).
erate and severe, distribution of which is shown in
Fig. 1b. The journals are of varying lengths with Anxiety
800 No anxiety
most being around 1-2 minutes long. The distri-
bution of the duration of the audio journals in the 700
dataset is shown in Fig. 2. The dataset does not con- 600
500
Frequency
400
800 300
200
600
Frequency
100
400
0
negative positive
sentiment
200
(a) Sentiment distribution in anxiety and non–anxiety
classes
0
0 25 50 75 100 125 150 175 200
Duration in seconds 1000 Anxiety
No anxiety
Figure 2: Distribution of the duration of audio files in 800

the dataset
600
Frequency
tain human transcripts for the journals and hence

we use the state-of-the-art Wav2Vec2 (Baevski 400
et al., 2020) to transcribe the given audio files and
analyze the textual content. In the transcriptions 200
generated by this model, there are a total of 244K
words and 18K unique words. The average num- 0
ber of words per conversation is 108. Fig.3 shows anger fear joy sadness
emotion
the frequency distribution of the transcribed words. (b) Emotion distribution in anxiety and non–anxiety
We further use T5 (Raffel et al., 2019) base model classes
Figure 4: Sentiment and Emotion distribution in anxiety

and non–anxiety classes
3.1 Task Description

In this work, we focus on the binary task of de-
Figure 3: Transcribed words distribution tecting whether a user has anxiety disorder or not
given their journal recordings. We group the all
fine-tuned on emotion recognition and sentiment levels of anxiety, namely mild, moderate and se-
analysis tasks on the transcribed text using Hug- vere as anxiety class for this work. Table 1 shows
gingface (Wolf et al., 2019). The emotion recog- the train, validation and test split statistics of the
nition model classifies each transcribed audio file dataset for the binary classification task (Anxiety
into one of the 5 emotions, namely, anger, fear, joy, vs Non-Anxiety).
love, sadness while the sentiment analysis model
4 Approach
classifies each transcript into two classes, positive
and negative. In this study, we sought to model self-recorded
Fig. 4 shows visualizes the distribution of differ- hournals of users in order to detect if the indi-
ent sentiments and emotion in transcripts of anxi- vidual has an anxiety disorder based on GAD-7
Class Train Valid Test Total Wave2vec2-Base model (Baevski et al., 2020) to
transcribe the given audio data. Upon qualitatively
No anxiety 840 149 173 1162
evaluating the generated transcripts, we see that it
Anxiety 790 139 166 1095
is a very close transcription with some grammatical
Total 1630 288 339 2257
errors. Some examples of the transcripts are shown
in table 2.
Table 1: Data statistics for different splits
I think the best thing that happened for the past
score (Spitzer et al., 2006). We conducted four week is a being able to get out a for litbit a be-
sets of experiments with different audio, and text sides work i think me magrofere mostly been in
features extracted from the audio data to predict our house and having enjoyed the outdoors ...
anxiety. I feel eh but lest less and demotivated there’s lot
of work that i need to do and i folting the ent on
1. Model based hand-crafted audio features
my responsibility is at work sometimes i think
2. Model based on transcript embeddings in my personal life as well is also just general ...
3. Model based on transcript embeddings with Table 2: Some snippets of generated transcripts from
sample weights the audio data using wav2vec2
4. Model based on low-dimensional audio em- We embed the generated transcripts in 768 low-
beddings using wav2vec dimensional space using the pretrained sentence
5. Multi-modal model based both both audio and embedding model Sentence-BERT (Reimers and
transcript features Gurevych, 2019) which is a RoBERTa model fine–
tuned for sentence similarity task. A Gradient
Details of each approach is described below and Boosting Classifier (GBC) model is trained on
our code for the experiments can be found online. these embeddings to classify presence of anxiety.
4.1 Hand-crafted audio features 4.3 Transcript embeddings with sample
Work on voice sciences over recent decades has weights
led to a development of acoustic features for differ- In our dataset, in addition to having the binary la-
ent speech analysis tasks. GeMAPS (Eyben et al., bels of anxiety we also have the raw GAD-7 scores
2015) standard acoustic feature set for various areas for the users which range between 0 and 21. We ex-
of automatic voice analysis, such as paralinguistic plore in this model how using these score to assign
or clinical speech analysis and has proven to be different weights to different examples might help
effective in various standard benchmarks. Hence the a better anxiety prediction model. We utilize
we use the 88 function features described in the a simple linear weighting strategy and define the
GeMAPS extracted using the Opensmile tool (Ey- weights as
ben et al., 2010). Based on our analysis of the data, si + 1
wi =
we further extend this feature set with 2 features 22
(emotion and sentiment) extracted from the gener- where si is the raw GAD-7 score for the ith dat-
ated transcripts as described in section 3. We train apoint in the dataset and wi is its corresponding
a L1-regularized logistic regression model on these weight. Intuitively, by giving a less sample weight
90 features extracted from the data to predict if the to a ‘non-anxiety’ datapoint and giving a high sam-
individual has anxiety or not. ple weight to ‘high-anxiety’ datapoint, we are bi-
asing the model to be reduce the number of false
4.2 Transcript embeddings negatives at the cost of potentially producing more
In this model, we only use the transcripts gener- false positives.
ated from the audio and use them for training out
anxiety classifier. We don’t use any audio features 4.4 Wav2vec audio embeddings
in this baseline and solely rely on the textual fea- Unsupervised representations have shown to be
tures, thereby converting this problem to a natural quite effective in various machine learning tasks
language domain problem. Specifically, we use across different domain like image, text and audio.
Hence we explore a model for our task using one
of the best unsupervised representation model for
speech data. We use Wav2Vec model (Schneider
et al., 2019) to generate 512 dimensional embed-
dings for the user recordings. The model takes raw
audio at sample rate of 16kHz as input and outputs
a sequence of 512 dimensional embeddings for the
input at two levels, z and c as shown in figure 5.
We take the mean of this sequence of embeddings
at the z level to construct our final 512 dimensional
embedding for the input audio file. We then train a
SVM classifier on these embeddings for our anxiety
prediction task.
Figure 6: Schematic for Multi-modal anxiety classifica-

tion model
the quantized audio tokens are fed into a

pre-trained RoBERTa model, trained via self-
supervision on masked token prediction using
Figure 5: Illustration of the model architecture of quantized vector sequences from Librispeech-
wav2vec (Schneider et al., 2019). We use the mean 960. This gives us dense representations of
of the z-layer embeddings for our anxiety model.
the audio signal (We only retain the CLS (call
it CLSspeech ) token embedding which is the
4.5 Multi-modal model with audio and text first token embedding) and has been shown to
For our final model, we decided to develop a multi- perfrom well for sequence classification tasks.
modal classifcation model combining the informa-
• Transcript embedder: RoBERTa (Liu et al.,
tion from both the audio and the text transcriptions.
2019): The text features are generated by to-
Inspired by the multi-modal model for speech emo-
kenizing and feed-forwarding the transcrip-
tion recognition in Siriwardhana et. al. (Siriward-
tions through a standard RoBERTa model pre-
hana et al., 2020) we generate audio features and
trained on the masked LM task as described
text features and use their concatenation to to train
in (Liu et al., 2019). This gives us dense repre-
a classifier to make predictions for our task. A over-
sentations of the transcriptions (We only retain
all schematic representation of the model is shown
the CLS (call it CLStext ) token embedding
in figure 6. Details of the different components of
which is the first token embedding).
the model is described below.
• Audio quantizer: VQ-Wav2Vec • SVM Classifier We concatenate CLStext and
(Baevski et al., 2019): An extension CLSspeech and then train an SVM classifier
of Wav2Vec (Schneider et al., 2019), on these features to solve the binary classifi-
VQ-Wav2Vec also learns continuous repre- cation task.
sentations by optimizing the CPC objective
(Contrastive Predictive Coding) (Oord et al., 5 Experiments
2018) and then quantizing it either via a
The results of the experiments, as well as a random
Gumbell Softmax or a K-means approach.
baseline are displayed in Table 3.
For our purposes, we use the pretrained
K-means quantizer model. 5.1 Hand-crafted audio features
• Quantized audio sequence embedder: The results of the logistic regression model with
Speech-RoBERTa (Liu et al., 2019): Then, L1 regularization is shown in Table 3. The model
Model Precision Recall F1 AUROC
Random baseline 0.50 0.48 0.49 0.50
Audio features 0.63 0.53 0.58 0.66
Transcipt features 0.64 0.57 0.60 0.68
Transcript features with sample weights 0.61 0.55 0.58 0.59
Wav2Vec features 0.66 0.61 0.64 0.69
Multi-modal model 0.66 0.60 0.61 0.68
Table 3: Performance of different models
improves upon the random baseline by around 20%. space and train an SVM classifier to predict anxiety.
We analyse the effect of different features on the We tune the different hyperparameters using the
model predictions using SHAP (Lundberg and Lee, validation set and find C = 10 for perform the
2017) and a visualization of the same is shown in best. The results of the best model on the test
Fig. 7. As can be seen from the figure, the top set are shown in Table 3. The ROC curve and the
features are loudness, F0, emotion, sentiment, etc. Precision-Recall curve for the model’s performance
which have been shown in clinical literature to be on the test set is shown in figure 8. We can see that
signals in patients suffering from anxiety. the model trained on Wav2vec features improved
about 30% over the random baseline. We also
analyse the ROC and precision-recall curve for the
model and it is shown in Fig. 8. As we can see that
the precision is almost the same as recall increases
from 0.2 to 0.6 signifying model’s robustness.
1.0
0.8
True Positive Rate
0.6
Figure 7: Feature importance of different features for 0.4
anxiety prediction as obtained using logistic regression 0.2

model.
0.0 (AUC = 0.69)
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
5.2 Transcript embeddings (a) ROC curve
The results of the GBC model trained on transcript
embeddings are shown in Table 3. The model per- 1.0
forms similar to the model based on audio features 0.9

signifying the importance and effectiveness of text 0.8
Precision
features in mental health related voice classification

0.7
systems.
The results of adding sample weights to the 0.6
model is also shown in Table 3 and does not im- 0.5 (AP = 0.66)
prove the performance but rather degrades model 0.0 0.2 0.4 0.6 0.8 1.0
Recall
performance. Hence we did not experiment with
sample weights for our other models. (b) Precision Recall curve
Figure 8: ROC curve and Precision Recall curve for the

5.3 Wav2vec audio embeddings Wav2Vec features model
As described in section 4, we embed the audio
recording into 512 low dimensional embedding
5.4 Multi-modal model with audio and text nature of anxiety and depression disorders, we be-
For this experiment, as described in Section 4, we lieve this approach will help us build more powerful
firstly used a pre-trained VQ-Wav2Vec model (on models for anxiety detection which are prone to
Librispeech-960), with a K-means quantizer as pro- overfitting.
vided in the code associated with (Siriwardhana
et al., 2020) to quantize our audio signal. Then References
we used their pre-trained speech-Roberta model
which has the same architecture as the standard Tuka Al Hanai, Mohammad M Ghassemi, and James R
Glass. 2018. Detecting depression with audio/text
RoBERTa-base model (Liu et al., 2019) to obtain sequence modeling of interviews. In Interspeech,
embeddings of size 768 with a max sequence length pages 1716–1720.
of 2048. We retain the first token CLSspeech em-
Alexei Baevski, Steffen Schneider, and Michael Auli.
bedding. We then create feature representations 2019. vq-wav2vec: Self-supervised learning of
for text by tokenizing and passing it through a discrete speech representations. arXiv preprint
roBERTa-large (Liu et al., 2019) model to gener- arXiv:1910.05453.
ate embeddings of size 1024 with a max sequence
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
length of 512. We combine the CLStext embed- and Michael Auli. 2020. wav2vec 2.0: A framework
ding from to CLSspeech to create a 1792 dimen- for self-supervised learning of speech representations.
sional embedding. We then train an SVM classifier arXiv preprint arXiv:2006.11477.
on these features to solve the binary classification Nicholas Cummins, Stefan Scherer, Jarek Krajewski,
task. The results are shown in Table 3. The model Sebastian Schnieder, Julien Epps, and Thomas F
does not improve much over the wav2vec embed- Quatieri. 2015. A review of depression and suicide
dings and understanding the reason and improving risk assessment using speech analysis. Speech Com-
munication, 71:10–49.
the model is a direction to explore in the future.
Florian Eyben, Klaus R Scherer, Björn W Schuller,
6 Conclusions Johan Sundberg, Elisabeth André, Carlos Busso,
Laurence Y Devillers, Julien Epps, Petri Laukka,
Mental Health is a very important component of Shrikanth S Narayanan, et al. 2015. The geneva min-
overall health which receives very less attention imalistic acoustic parameter set (gemaps) for voice
research and affective computing. IEEE transactions
and importance from the masses. People are more on affective computing, 7(2):190–202.
focused towards physical health and ignore their
mental well-being which can be disastrous if not Florian Eyben, Martin Wöllmer, and Björn Schuller.
2010. Opensmile: the munich versatile and fast open-
dealt with in the earlier stages. In this project, we
source audio feature extractor. In Proceedings of the
aim to marry ML-powered speech processing tech- 18th ACM international conference on Multimedia,
niques with mental health speech data to automati- pages 1459–1462.
cally detect anxiety from short clips of free-form
Lior Galili, Ofer Amir, and Eva Gilboa-Schechtman.
speech. Our learnings are two fold: (i) Firstly, we 2013. Acoustic properties of dominance and request
implemented various ML powered models of vary- utterances in social anxiety. Journal of social and
ing complexity on the speech dataset. We start with clinical psychology, 32(6):651–673.
a random model which gives us about 50% ROC
Eva Gilboa-Schechtman, Lior Galili, Yair Sahar, and
and treat that as our baseline to improve upon. Our Ofer Amir. 2014. Being “in” or “out” of the game:
subsequent models are a mix of only text, only subjective and acoustic reactions to exclusion and
audio and a combination of both. (ii) We also popularity in social anxiety. Frontiers in human neu-
gained the experience of working with a very small- roscience, 8:147.
sized dataset which can lead to overfitting problems. Rachelle Horwitz, Thomas F Quatieri, Brian S Helfer,
Hence, our modelling approaches had to be made Bea Yu, James R Williamson, and James Mundt.
powerful enough to learn features from a small 2013. On the relative importance of vocal source,
system, and prosody in human depression. In 2013
dataset, while also being robust against overfitting. IEEE International Conference on Body Sensor Net-
In the future, we would like to apply techniques works, pages 1–6. IEEE.
like transfer learning for our task. Our idea is
Akshat Jindal, Shreya Singh, and Soham Gadgil. 2023.
to train our models for depression detection task Classification for everyone: Building geography ag-
(which is a larger dataset) and fine-tune it on the nostic models for fairer recognition. arXiv preprint
anxiety dataset. Due to similarities between the arXiv:2312.02957.
SDGV Akanksha Kumari and Shreya Singh. 2017. Par- Asif Salekin, Jeremy W Eberle, Jeffrey J Glenn,
allelization of alphabeta pruning algorithm for en- Bethany A Teachman, and John A Stankovic. 2018.
hancing the two player games. Int. J. Advances Elec- A weakly supervised learning framework for detect-
tronics Comput. Sci, 4:74–81. ing social anxiety and depression. Proceedings of the
ACM on interactive, mobile, wearable and ubiquitous
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- technologies, 2(2):1–26.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Steffen Schneider, Alexei Baevski, Ronan Collobert,
Roberta: A robustly optimized bert pretraining ap- and Michael Auli. 2019. wav2vec: Unsuper-
proach. arXiv preprint arXiv:1907.11692. vised pre-training for speech recognition. CoRR,
abs/1904.05862.
Daniel M Low, Kate H Bentley, and Satrajit S Ghosh.
2020. Automated assessment of psychiatric disorders Abraham Gerard Sebastian, Shreya Singh, P. B. T.
using speech: A systematic review. Laryngoscope Manikanta, T. S. Ashwin, and G. Ram Mohana
Investigative Otolaryngology, 5(1):96–116. Reddy. 2019. Multimodal group activity state de-
tection for classroom response system using con-
Scott M Lundberg and Su-In Lee. 2017. A unified ap- volutional neural networks. In Recent Findings in
proach to interpreting model predictions. In I. Guyon, Intelligent Computing Techniques, pages 245–251,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, Singapore. Springer Singapore.
S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages Vered Silber-Varod, Hamutal Kreiner, Ronen Lovett,
4765–4774. Curran Associates, Inc. Yossi Levi-Belz, and Noam Amir. 2016. Do social
anxiety individuals hesitate more? the prosodic pro-
G Mohammed Abdulla, Shreya Singh, and Sumit Borar. file of hesitation disfluencies in social anxiety disor-
2019. Shop your right size: A system for recom- der individuals. Speech Prosody 2016, pages 1211–
mending sizes for fashion products. In Companion 1215.
Proceedings of The 2019 World Wide Web Confer-
ence, WWW ’19, page 327–334, New York, NY, Loveperteek Singh, Shreya Singh, Sagar Arora, and
USA. Association for Computing Machinery. Sumit Borar. 2019. One embedding to do them all.
arXiv preprint arXiv:1906.12120.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Representation learning with contrastive predictive Shreya Singh, G Mohammed Abdulla, Sumit Borar, and
coding. arXiv preprint arXiv:1807.03748. Sagar Arora. 2018. Footwear size recommendation
system. arXiv preprint arXiv:1806.11423.
Turgut Özseven, Muharrem Düğenci, Ali Doruk, and Hi-
Shamane Siriwardhana, Andrew Reis, Rivindu
lal İ Kahraman. 2018. Voice traces of anxiety: acous-
Weerasekera, and Suranga Nanayakkara. 2020.
tic parameters affected by anxiety disorder. Archives
Jointly fine-tuning” bert-like” self supervised models
of Acoustics, pages 625–636.
to improve multimodal speech emotion recognition.
Thomas F Quatieri and Nicolas Malyska. 2012. Vocal- arXiv preprint arXiv:2008.06682.
source biomarkers for depression: A link to psy- Robert L Spitzer, Kurt Kroenke, Janet BW Williams,
chomotor activity. In Thirteenth Annual Conference and Bernd Löwe. 2006. A brief measure for assessing
of the International Speech Communication Associa- generalized anxiety disorder: the gad-7. Archives of
tion. internal medicine, 166(10):1092–1097.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Justin W Weeks, Chao-Yang Lee, Alison R Reilly, Ash-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ley N Howell, Christopher France, Jennifer M Kowal-
Wei Li, and Peter J Liu. 2019. Exploring the limits sky, and Ashley Bush. 2012. “the sound of fear”:
of transfer learning with a unified text-to-text trans- Assessing vocal fundamental frequency as a physio-
former. arXiv preprint arXiv:1910.10683. logical indicator of social anxiety disorder. Journal
of anxiety disorders, 26(8):811–822.
Charles Rajan, Nishit Asnani, and Shreya Singh. 2023.
Shaping political discourse using multi-source news Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
summarization. arXiv preprint arXiv:2312.11703. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
Chetanya Rastogi, Prabhat Agarwal, and Shreya et al. 2019. Huggingface’s transformers: State-of-
Singh. 2023. Exploring graph based approaches the-art natural language processing. arXiv preprint
for author name disambiguation. arXiv preprint arXiv:1910.03771.
arXiv:2312.08388.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:

Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.

Anxiety 2312.15272

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anxiety 2312.15272

Uploaded by

Copyright:

Available Formats

Detecting anxiety from short clips of free-form speech

Prabhat Agarwal Akshat Jindal Shreya Singh

Abstract code is publicly available on Github1 .

dataset is shown in Fig. 2. The dataset does not con- 600

Figure 2: Distribution of the duration of audio files in 800

tain human transcripts for the journals and hence

Figure 4: Sentiment and Emotion distribution in anxiety

3.1 Task Description

Figure 6: Schematic for Multi-modal anxiety classifica-

the quantized audio tokens are fed into a

Table 3: Performance of different models

Figure 7: Feature importance of different features for 0.4

anxiety prediction as obtained using logistic regression 0.2

forms similar to the model based on audio features 0.9

features in mental health related voice classification

Figure 8: ROC curve and Precision Recall curve for the

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:

You might also like