Professional Documents
Culture Documents
pediment in mental health diagnosis and treat- used for training our models along with the task
ment. Machine learning approaches based on description. In Section 4, we provide a detailed
speech samples could help in this direction. In
explanation of our five models implemented for the
this work, we develop machine learning solu-
tions to diagnose anxiety disorders from au- task. Section 5 outline the evaluation routines and
dio journals of patients. We work on a novel presents the study results. Finally, we conclude our
anxiety dataset (provided through collaboration findings in Section 6.
with Kintsugi Mindful Wellness Inc.) and ex-
periment with several models of varying com- 2 Related Works
plexity utilizing audio, text and a combination
of multiple modalities. We show that the multi- Liu et. al. (Low et al., 2020) provides a system-
modal and audio embeddings based approaches atic review of studies using speech for automated
achieve good performance in the task achieving
assessments across a broader range of psychiatric
an AUC ROC score of 0.68-0.69.
disorders and summarizes the different datasets,
features, and methods that have been employed in
1 Introduction this area. The authors summarize the key acous-
tic features used in these models across disorders.
Mental and physical health are equally important
Many studies (Horwitz et al., 2013; Quatieri and
components of overall health but there are many
Malyska, 2012; Cummins et al., 2015; Singh et al.,
barriers to accessing mental health assessments
2019) have observed that depression is linked with
including cost and stigma. Even when individu-
a decrease in f0 and f0 range and also leads to jitter
als receive professional care, assessments are in-
and shimmer with an increase in depression sever-
termittent and may be limited partly due to the
ity. While jitter and shimmer are also correlated
episodic nature of psychiatric symptoms. There-
with anxiety (Özseven et al., 2018; Silber-Varod
fore, machine-learning technology using speech
et al., 2016), many studies (Gilboa-Schechtman
samples obtained in the clinic, remotely over tele-
et al., 2014; Özseven et al., 2018; Weeks et al.,
phone or through recording in personal mobile de-
2012; Galili et al., 2013) have found a significant
vices could help improve diagnosis and treatment.
increase in mean f0 in patients with an anxiety
In this work, we study the problem of detecting
disorder. Most studies surveyed by the authors em-
anxiety disorders through speech. Since this is
ployed simple machine learning models like SVM,
a relatively under-studied problem, the dataset is
logistic regression and very few studies used more
quite small and hence we plan to explore differ-
complex networks with transfer from speech recog-
ent approaches to deal with the problem including
nition tasks (Singh et al., 2018). We plan to utilise
using hand-crafted features motivated by psycho-
these features which have been well studied both in
logical studies on anxiety and pretrained embed-
the psychology and machine learning assessment
dings models. The dataset provided by Kintsugi
of anxiety and depression in our work and compare
provides an interesting opportunity in this area as
the recordings are self-recorded in a comforting 1
https://github.com/prabhat1081/Anxiety-Detection-
environment as opposed to a clinical setting. The from-free-form-audio-journals
them against deep learning models both in terms of viduals undergoing depression screening through a
performance and generalizability. human-controlled virtual agent which prompts each
State-of-the-art works on detecting mental dis- individual with a subset of 170 possible queries that
orders or emotional states (e.g., anxious vs. calm) included direct questions (e.g. ‘How are you?’, ‘Do
from audio data use supervised learning ap- you consider yourself to be an introvert?’), and di-
proaches, which must be “trained” from examples alogic feedback ( e.g. ‘I see’, ‘that sounds great’).
of the sound to be detected (Kumari and Singh, The data was from the publicly available distress
2017). In general, learning such classifiers re- analysis and interview corpus (DAIC) and contains
quires “strongly labeled” annotated data, where audio and text transcriptions of the spoken interac-
the segments of audio containing the desired vocal tions (Rajan et al., 2023; Rastogi et al., 2023).
event and the segments not containing that event The authors modeled these interactions with au-
are clearly indicated. However getting strongly dio and text features in Long-Short Term Memory
labelled data is tough an in our project too, we (LSTM) neural network model to detect depres-
only have weakly supervised data. Salekin et. sion and through various experiments and ablation
al. (Salekin et al., 2018; Sebastian et al., 2019) studies showed that not only did a combination of
presents a weakly-supervised framework for de- modalities provide additional discriminative power,
tecting symptomatic individuals from their voice but that they contained complementary information
sample. The idea here is to collect long speech and hence it is quite useful to model both.
audio samples from individuals already diagnosed
with or high in symptoms of specific mental disor-
ders from situations that may heighten expression
of the symptoms of respective disorders. This type
of data is considered “weakly labeled”, meaning
that although they provide information about the
presence or absence of disorder symptoms, they do
(a) Anxiety score distribution
not provide additional details such as the precise
times in the recording that indicate the disorder etc.
For handling weakly labelled data, they pro-
pose a new feature modelling technique called
“NN2Vec” to generate low-dimensional, continu-
ous, and meaningful representation of speech from
long weakly labelled audio data as the other con-
ventional methods work well only with strongly
labelled data. To complement that, they present
a Multiple instance learning paradigm which is a
slight adaptation to classic supervised learning lead-
ing to better results. Their NN2Vec and Bi-LSTM
MIL approach achieved an F-1 score of 90.1% and
(b) Anxiety level distribution
90% accuracy in detecting speakers high versus
low in social anxiety symptoms. This F-1 scores is Figure 1: Distribution of raw anxiety score and inferred
20.7% higher than of the best baselines they tried. anxiety levels
Recently multi-modal approaches (Siriwardhana
et al., 2020; Al Hanai et al., 2018) to the speech
3 Dataset and Task Description
analysis have shown promising results. Hanai et.
al. (Al Hanai et al., 2018; Mohammed Abdulla For our project, we are working with the Gen-
et al., 2019) developed an automated depression- eralized Anxiety Disorder (GAD) dataset from
detection algorithm that models interviews between Kintsugi Mindful Wellness, Inc. The dataset con-
an individual and agent and learns from sequences sists of 2263 self-recorded audio journals of users
of questions and answers without the need to per- along with corresponding score of the user on the
form explicit topic modeling of the content (Jindal GAD-7 questionnaire taken by the user after record-
et al., 2023). The authors used the data of 142 indi- ing the journal. Each of these files is annotated with
an unnormalized anxiety GAD-7 score and a buck- ety and non-anxiety classes. We classify an audio
eted score. The unnormalized GAD-7 score ranges file/datapoint ‘anxiety’ if it belongs to the buckets,
from 0 to 21 and Fig. 1a shows its frequency distri- mild, moderate or severe, and consequently, ‘non-
bution. These raw anxiety scores are further split anxiety’ are those datapoints which fall in the None
into 4 buckets of anxiety levels, None, mild, mod- bucket (from Fig.1b).
erate and severe, distribution of which is shown in
Fig. 1b. The journals are of varying lengths with Anxiety
800 No anxiety
most being around 1-2 minutes long. The distri-
bution of the duration of the audio journals in the 700
500
Frequency
400
800 300
200
600
Frequency
100
400
0
negative positive
sentiment
200
(a) Sentiment distribution in anxiety and non–anxiety
classes
0
0 25 50 75 100 125 150 175 200
Duration in seconds 1000 Anxiety
No anxiety
3. Model based on transcript embeddings with Table 2: Some snippets of generated transcripts from
sample weights the audio data using wav2vec2
4. Model based on low-dimensional audio em- We embed the generated transcripts in 768 low-
beddings using wav2vec dimensional space using the pretrained sentence
5. Multi-modal model based both both audio and embedding model Sentence-BERT (Reimers and
transcript features Gurevych, 2019) which is a RoBERTa model fine–
tuned for sentence similarity task. A Gradient
Details of each approach is described below and Boosting Classifier (GBC) model is trained on
our code for the experiments can be found online. these embeddings to classify presence of anxiety.
4.1 Hand-crafted audio features 4.3 Transcript embeddings with sample
Work on voice sciences over recent decades has weights
led to a development of acoustic features for differ- In our dataset, in addition to having the binary la-
ent speech analysis tasks. GeMAPS (Eyben et al., bels of anxiety we also have the raw GAD-7 scores
2015) standard acoustic feature set for various areas for the users which range between 0 and 21. We ex-
of automatic voice analysis, such as paralinguistic plore in this model how using these score to assign
or clinical speech analysis and has proven to be different weights to different examples might help
effective in various standard benchmarks. Hence the a better anxiety prediction model. We utilize
we use the 88 function features described in the a simple linear weighting strategy and define the
GeMAPS extracted using the Opensmile tool (Ey- weights as
ben et al., 2010). Based on our analysis of the data, si + 1
wi =
we further extend this feature set with 2 features 22
(emotion and sentiment) extracted from the gener- where si is the raw GAD-7 score for the ith dat-
ated transcripts as described in section 3. We train apoint in the dataset and wi is its corresponding
a L1-regularized logistic regression model on these weight. Intuitively, by giving a less sample weight
90 features extracted from the data to predict if the to a ‘non-anxiety’ datapoint and giving a high sam-
individual has anxiety or not. ple weight to ‘high-anxiety’ datapoint, we are bi-
asing the model to be reduce the number of false
4.2 Transcript embeddings negatives at the cost of potentially producing more
In this model, we only use the transcripts gener- false positives.
ated from the audio and use them for training out
anxiety classifier. We don’t use any audio features 4.4 Wav2vec audio embeddings
in this baseline and solely rely on the textual fea- Unsupervised representations have shown to be
tures, thereby converting this problem to a natural quite effective in various machine learning tasks
language domain problem. Specifically, we use across different domain like image, text and audio.
Hence we explore a model for our task using one
of the best unsupervised representation model for
speech data. We use Wav2Vec model (Schneider
et al., 2019) to generate 512 dimensional embed-
dings for the user recordings. The model takes raw
audio at sample rate of 16kHz as input and outputs
a sequence of 512 dimensional embeddings for the
input at two levels, z and c as shown in figure 5.
We take the mean of this sequence of embeddings
at the z level to construct our final 512 dimensional
embedding for the input audio file. We then train a
SVM classifier on these embeddings for our anxiety
prediction task.
improves upon the random baseline by around 20%. space and train an SVM classifier to predict anxiety.
We analyse the effect of different features on the We tune the different hyperparameters using the
model predictions using SHAP (Lundberg and Lee, validation set and find C = 10 for perform the
2017) and a visualization of the same is shown in best. The results of the best model on the test
Fig. 7. As can be seen from the figure, the top set are shown in Table 3. The ROC curve and the
features are loudness, F0, emotion, sentiment, etc. Precision-Recall curve for the model’s performance
which have been shown in clinical literature to be on the test set is shown in figure 8. We can see that
signals in patients suffering from anxiety. the model trained on Wav2vec features improved
about 30% over the random baseline. We also
analyse the ROC and precision-recall curve for the
model and it is shown in Fig. 8. As we can see that
the precision is almost the same as recall increases
from 0.2 to 0.6 signifying model’s robustness.
1.0
0.8
True Positive Rate
0.6
model is also shown in Table 3 and does not im- 0.5 (AP = 0.66)
prove the performance but rather degrades model 0.0 0.2 0.4 0.6 0.8 1.0
Recall
performance. Hence we did not experiment with
sample weights for our other models. (b) Precision Recall curve