Professional Documents
Culture Documents
Depression Prediction
1 Introduction
2 Related Work
The creation of objective AI tools for automatic depression diagnosis has been
actively researched in recent years using physiological markers such self-reported
symptoms, facial expressions, audio characteristics, and text data from conversation.
Conventional methods [8-9] select one of these physiological data as unimodal input
to analyze the depression status. However, depression is a complex disorder that can
manifest in many different ways. Using only one type of input may not provide
enough information to make an accurate diagnosis. For example, relying solely on
audio characteristics may not capture the full extent of a patient's condition.
Therefore, the use of multimodal data can help capture the complexity of depression
by integrating information from multiple sources.
A.H. Jo et al [10] employed audio and text information from the interview to
diagnose depression. The authors extracted the audio features using bidirectional long
short-term memory (Bi-LSTM) and CNN-based transfer learning models. Meanwhile,
word encoding was adopted to tokenize the text. Finally, the outputs of audio and text
models were integrated via the SoftMax function to produce results.
L. Zhou et al [11] extracted contextual information of audio and visual sequential
data using CAM-BiLSTM module. Next, the authors proposed the local information
fusion module and the global information interaction module to capture a fine-grained
relationship between multimodal features and global relationship from local and
global views. These features were then concatenated and passed through the fully
connected layer (FCL) to classify the depressed or normal person.
C. Lin et al [12] adopted CNN and pre-trained BERT models to capture image
and text features posted by users on social media. Then, the extracted features were
concatenated and fed into deep visual-textual multimodal learning module to classify
the normal or depressed user.
3 Proposed Method
In this section, we describe the details of our proposed method, as shown in Figure 1.
Our proposed method comprises three main components: (1) temporal feature
learning, (2) multimodal representation fusion, and (3) depression prediction. The
audio and visual extracted features are two input modalities of the proposed method.
First, the TCN-based temporal feature learning module process sequential data using
one-dimensional causal and dilated convolutional layers. Thanks to the use of causal
convolution, the output at time t is only convolved with the input elements that
occurred before t. Meanwhile, the purpose of dilation is to increase the receptive field
without increasing the number of parameters. The causal and dilated convolution are
visually shown in Figure 2.
After extracting the temporal feature representation from each modality, we fed
them into the cross-attention module. Different to previous cross-attention modules,
which commonly calculate attention using query, key and value vectors for each
modality, our proposed module does not require query weight for each modality. We
set key of audio modality as query of visual modality and vice versa (Equation (3)).
Afterward, we apply the row-wise and column-wise SoftMax function multiplied with
the corresponding value to generate complemented features representations Z av and Z-
va, respectively. Lastly, two complemented feature representations Zav and Zva are
fused via channel-wise concatenation to produce the fused feature representation.
Given the temporal feature representations Zaudio and Zvisual, the cross attention can be
expressed as follows:
(3)
Finally, the fused feature representation Z is passed through the FCLs followed by
the SoftMax function to return probability of depression status. It is formed as
follows:
Output = SoftMax (FCL(Z)) (6)
4.1 Materials
In this study, we use D-Vlog dataset [13] to conduct comprehensive experiments. The
D-Vlog dataset, which was collected from YouTube, consists of 961 vlog videos with
Table 1. The number of samples in the training, validation, and testing sets.
#samples (depression/non-depression)
Training 647 (375/272)
Validation 102 (57/45)
Testing 212 (123/89)
555 depressed and 406 non-depressed samples from 816 different people. Due to
personal privacy, the D-Vlog dataset only provides the extracted audio and visual
extracted features for each sample. The visual extracted features are 68 coordinates of
facial landmarks for each frame. Therefore, there are 136 elements (x and y
coordinates) in each timepoint. Meanwhile, the audio features are extracted from the
OpenSmile library that extracts 25 low-level audio features such as Mel-frequency
cepstral coefficients (MFCCs), loudness, spectral flux, etc. The length of timepoint is
from 24 to 3969.
(7)
(8)
(9)
(10)
Table 2. Performance comparison between the proposed method and the competing methods.
Table 3. Ablation study on different combinations of proposed modules. (CA: our proposed
cross-attention module; TCN: temporal convolutional network)
In this section, we present the results of our proposed method and the previous
method using the D-Vlog dataset. As can be seen in Table 2, the proposed method
outperforms the other competing methods in terms of all the three metrics. The
proposed method enhances 2.04-5.10%, 3.33-4.5%, and 1.42-2.83% compared to the
other methods in terms of weighted average F1, weighted average precision, and
weighted average recall, respectively. These improvements demonstrate that the
combination of temporal feature learning and multimodal fusion modules produces
the fused feature representation that can identify the patterns and relationships
between the features and the classes better than the other methods.
In addition, we conducted an experiment to analyze the impact of the proposed
multimodal fusion learning module that based on the cross-attention mechanism. As
seen in Table 3, the combination of CA and TCN modules has higher results in terms
of all three evaluated metrics.
4 Conclusion
Depression has a great attention from many researchers worldwide due to its
mental health impacts. In particular, the use of vlog data, for example the audio and
visual features that are extracted from the vlog, from social media network provide a
fast way to predict. In this paper, we proposed the multimodal fusion technique that
uses the attention-based mechanism to predict depression status. Our model consists
of three components, namely, temporal feature learning component, multimodal
representation fusion component, and prediction component. Experimental results
show that our proposed model outperforms all the competing models in terms of all
the evaluation metrics.
Acknowledgements
1. World Health Organization. (2017). Depression and other common mental disorders: global
health estimates (No. WHO/MSD/MER/2017.2). World Health Organization.
2. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. Empirical evaluation of gated recurrent
neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
3. Hochreiter, S., & Schmidhuber, J. Long short-term memory. Neural computation, 9(8),
1735-1780 (1997)
4. Bai, S., Kolter, J. Z., & Koltun, V. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
5. Wei, X., Zhang, T., Li, Y., Zhang, Y., & Wu, F. Multi-modality cross attention network for
image and sentence matching. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (pp. 10941-10950) (2020)
6. Hori, C., Hori, T., Lee, T. Y., Zhang, Z., Harsham, B., Hershey, J. R., ... & Sumi, K.
Attention-based multimodal fusion for video description. In Proceedings of the IEEE
international conference on computer vision (pp. 4193-4202) (2017).
7. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., & Sun, C. Attention bottlenecks
for multimodal fusion. Advances in Neural Information Processing Systems, 34, 14200-
14213 (2021)
8. Choi, D., Zhang, G., Shin, S., & Jung, J. Decision Tree Algorithm for Depression Diagnosis
from Facial Images. In 2023 IEEE 2nd International Conference on AI in Cybersecurity
(ICAIC) (pp. 1-4). IEEE.
9. Tadesse, M. M., Lin, H., Xu, B., & Yang, L. Detection of depression-related posts in reddit
social media forum. IEEE Access, 7, 44883-44893. (2019)
10. Jo, A. H., & Kwak, K. C. Diagnosis of Depression Based on Four-Stream Model of Bi-
LSTM and CNN from Audio and Text Information. IEEE Access. (2022)
11. Zhou, L., Liu, Z., Yuan, X., Shangguan, Z., Li, Y., & Hu, B. CAIINET: Neural network
based on contextual attention and information interaction mechanism for depression
detection. Digital Signal Processing, 137, 103986 (2023)
12. Lin, C., Hu, P., Su, H., Li, S., Mei, J., Zhou, J., & Leung, H. Sensemood: depression
detection on social media. In Proceedings of the 2020 international conference on
multimedia retrieval (pp. 407-411).
13. Yoon, J., Kang, C., Kim, S., & Han, J. D-vlog: Multimodal Vlog Dataset for Depression
Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No.
11, pp. 12226-12234) (2022)
14. Kingma, D. P., & Ba, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
15. Zhou, L., Liu, Z., Shangguan, Z., Yuan, X., Li, Y., & Hu, B. TAMFN: Time-
aware Attention Multimodal Fusion Network for Depression Detection. IEEE
Transactions on Neural Systems and Rehabilitation Engineering. (2022)