You are on page 1of 1

DepressionDetect kkiefer@umich.

edu

A Machine Learning Approach for Audio based Depression Classification kykiefer


Ky Kiefer kykiefer
Galvanize Data Science Immersive

Introduction Methods Model Architecture Conclusion


This effort addresses an automated device for Underlying differences in prosody (rhythm, • 6-layer CNN The model has an AUC score of 0.58, with state
detecting depression from acoustic features in intonation, stress, etc.) exist between depressed of the art emotion detection models ~0.7 using
• 2 convolution layers with max-pooling
speech. DepressionDetect is a tool aimed at individuals and non-depressed individuals. lower level audio representations (i.e. MFCCs).
lowering the barrier of entry in seeking help • 2 fully connected layers
The participants’ speech was segmented from
for potential mental illness and supporting
silence and the virtual interviewer using a
medical professionals' diagnoses.
Support Vector Machine (SVM).
Early signs of depression are difficult to detect The resulting segmented speech was represented
and quantify. These early signs have a as a spectrogram in the frequency-time domain
promising potential to be quantified by via a short-time Fourier transform (STFT).
machine learning algorithms that could be Figure 5: DepressionDetect CNN architecture.
implemented in a wearable artificial Figure 6: ROC curve on test set.
intelligence or home device.
Results Results suggest a promising, new direction in
using spectrograms for depression detection.
Data The network was trained on 3 hours of audio.
Spectrograms undoubtedly have powerful,
An equal number of spectrograms, from each class learnable features but require larger sample
• 50 hours of audio interviews from USC’s sizes to diminish effects of noise (which lower
(depressed, not depressed) and each speaker, was
DAIC-WOZ database. level representations seek to eliminate).
fed to the network to prevent model bias.
• 189 virtual interview sessions averaging 16 To combat limited sample size, a web app was
The test (holdout) set was composed of 14
minutes. Figure 3: Spectrogram of a plosive (a consonant sound formed built accepting online audio donations for use in
participants with 40 spectrograms per subject
by completely stopping airflow), followed by a second of
silence, and the spoken words, “Welcome to DepressionDetect”.
(representative of 160s of audio). periodic model re-training.
Predictions were first made on each of the 4 Visit DataStopsDepression.com or scan the QR
second spectrograms (Table 1). Then, a majority code below to become a data donor! It only
Convolution Neural Networks vote from the 40 spectrograms per participant was takes a couple minutes to help fight depression!
Convolutional Neural Networks (CNNs) take utilized to determine a participant’s depression
images as input. label based on160s of subject audio. (Table 2).

Figure 1: Virtual Interview with Ellie. A convolutional filter (kernel) is slid over each Actual: Yes Actual: No
spectrogram input and prosodic features of
• Depression classification derived from a Predicted: Yes 174 (TP) 106 (FP)
PHQ-8 psychiatric survey.
speech are learned. Tech Stack
Predicted: No 144 (FN) 136 (TN)
A spectrogram with 513 frequency bins x 125
• Scores range from 0 to 24 (>9 labeled as time bins (spanning 4 seconds) is passed as input Table 1: Test set predictions on 4 second spectrograms.
“depressed”). to the network.
Actual: Yes Actual: No
Predicted: Yes 4 (TP) 2 (FP)

Predicted: No 3 (FN) 5 (TN)


Table 2: Test set predictions on 14 participants using
majority vote.

Figure 4: General CNN architecture. Predictive power seems to increase with majority
Figure 2: Distribution of PHQ-8 scores.
vote, but small sample size.

You might also like