Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces

IISE Transactions on Healthcare Systems Engineering
ISSN: 2472-5579 (Print) 2472-5587 (Online) Journal homepage: https://www.tandfonline.com/loi/uhse21
Detect depression from communication: how

computer vision, signal processing, and sentiment
analysis join forces
Aven Samareh, Yan Jin, Zhangyang Wang, Xiangyu Chang & Shuai Huang
To cite this article: Aven Samareh, Yan Jin, Zhangyang Wang, Xiangyu Chang & Shuai Huang
(2018) Detect depression from communication: how computer vision, signal processing, and
sentiment analysis join forces, IISE Transactions on Healthcare Systems Engineering, 8:3,
196-208, DOI: 10.1080/24725579.2018.1496494
To link to this article: https://doi.org/10.1080/24725579.2018.1496494
Accepted author version posted online: 24

Jul 2018.
Published online: 03 Dec 2018.
Submit your article to this journal
Article views: 157
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=uhse21
IISE TRANSACTIONS ON HEALTHCARE SYSTEMS ENGINEERING
2018, VOL. 8, NO. 3, 196–208
https://doi.org/10.1080/24725579.2018.1496494
Detect depression from communication: how computer vision, signal processing,

and sentiment analysis join forces
Aven Samareha , Yan Jinb, Zhangyang Wangc, Xiangyu Changd, and Shuai Huanga
a
Industrial & Systems Engineering Department, University of Washington, Seattle, Washington, USA; bResearch Engineer, JD.com, Inc., San
francisco, California, USA; cDepartment of Computer Science and Engineering, Texas A&M University, College Station, Texas, USA; dSchool of
Management, Xi’an Jiaotong University Shaanxi, P.R. China
ABSTRACT KEYWORDS
Background: Depression is a common illness worldwide. Traditional procedures have generated Depression recognition;
controversy and criticism, such as accuracy and agreement on consistency of depression diagnosis audio biomarkers; video
and assessment among clinicians. More objective biomarkers are needed for better treatment biomarkers; text biomarkers;
multi-modality
evaluation and monitoring.
Hypothesis: Depression will leave recognizable markers in a patient’s acoustic, linguistic, and facial
patterns, all of which have demonstrated increasing promise for more objectively evaluating and
predicting a patient’s mental state.
Methods: We applied a multi-modality fusion model to combine the audio, video, and text modalities
to identify the biomarkers that are predictive of depression with consideration of gender differences.
Results: We identified promising biomarkers from a successive search on feature extraction analysis
for each modality. We found that gender disparity in vocal and facial expressions plays an import-
ant role in detecting depression.
Conclusion: Audio, video and text biomarkers provided the possibility of detecting depression in
addition to traditional clinical assessments. Biomarkers detected for gender-dependent analysis
were not identical, indicating that gender can affect the depression manifestations.
1. Introduction the subsequent decision making will be tailored to a correct

understanding of the patient’s health problem (Holmboe
Depression has been recognized as a significant health
and Durning, 2014). It is desirable to develop more bio-
concern worldwide. According to the World Health
markers that can be automatically extracted from objective
Organization (WHO), more than 300 million people suffer
measurements. As audio, video, and text modalities are very
from depression in their daily lives.1 Psychiatric diagnoses
useful data for characterizing human interaction and com-
have traditionally been made based on clinical criteria. For
example, the Diagnostic and Statistical Manual of Mental munications, they have been used in many applications,
Disorders (DSM) is a standard diagnostic tool published by such as speech processing (Rivet et al., 2014), safety and
the American Psychiatric Association APA, and is used by security applications (Beal et al., 2003) and human-machine
healthcare professionals as an authoritative guide to the interaction (Shivappa et al., 2010). Also, since depression is
diagnosis of mental disorders (Mitchell et al., 2009). DSM a complex disease that manifests in all of these modalities,
contains descriptions, symptoms, and many other criteria many research efforts have been directed toward using these
for diagnosing a mental disorder. Alternatively, clinicians datasets for depression prediction and evaluation. Studies
can also monitor the severity of depression using the Patient show that depression will leave recognizable markers in a
Health Questionnaire depression scale (PHQ-8) question- patient’s vocal acoustic, linguistic and facial patterns, all of
naire, a self-administered tool consisting of eight symptom which have demonstrated increasing promise in the object-
questions (Kroenke et al., 2009). However, such traditional ive evaluation and prediction of a patient’s mental state
procedures have generated controversy and criticism about (Cummins et al., 2015; Darby et al., 1984; France et al.,
their as accuracy and agreement consistency among clini- 2000). Central to the success of these models, extracting use-
cians due to their subjective nature (Kim, 2007). Therefore, ful quantitative biomarkers from the audio, video, and text
determining an accurate and feasible screening tool in the data has resulted in the discovery of many interest-
general population remains important. ing biomarkers.
An accurate diagnosis that is made in a timely manner Audio biomarkers are powerful tools to assist in detecting
has the best opportunity for a positive health outcome, since depression severity and monitoring over the course of a
CONTACT Aven Samareh asamareh@uw.edu University of Washington Box 352650, Seattle, WA 98195-2650, USA
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/uhse.
1
http://www.who.int/mediacentre/factsheets/fs369/en/
ß 2018 “IISE”
IISE TRANSACTIONS ON HEALTHCARE SYSTEMS ENGINEERING 197
depressive disorder. For example, Alpert et al. suggested that simultaneously with other cues (e.g., speech, and text).
intra-personal pause duration and speaking rate are closely Hence, eye, eyebrow, mouth and head movement bio-
related to change in depression severity over time markers were used for recognizing depression. In the text
(Alpert et al., 2001). Several studies have found distinguish- analysis, we captured valence ratings of the text using the
able audio biomarkers such as pitch (Rabiner, 1977; Rajni tool of AFINN sentiment analysis. In addition, we identified
and Das 2013), decreased speech intensity (Yang et al., depression-related words, a total number of words and sen-
2013), loudness (S. Kuty., Stassen, H., 1993), and energy (He tences, which were nonverbal manifestations of depression.
et al., 2010), which are useful for detecting depression sever- The combination of these modalities and their interac-
ity. There have been few studies documenting the relation- tions with gender information were also investigated. A
ship between audio biomarkers and depression severity recent study showed that gender plays an important role in
using clinical subjective ratings (France et al., 2000; Bettes, depression severity assessment (Stratou et al., 2015).
1988; Cannizzaro et al., 2004). Overall, they found that Similarly, in our study, we found that biomarkers detected
audio biomarkers discriminate between individuals with and for gender-dependent analysis were not identical, indicating
without depression. On the other hand, previous studies that gender can affect manifestations of depression. This
have shown that word- and phrase-related biomarkers can result is consistent with previous findings on gender differ-
influence the classification performances (Alghowinem et al., ences which show that depression in women may be more
2013a; Moore et al., 2014). For example, in Newman and likely to be detected than depression in men (Nolen-
Mather (1938), they showed that selecting a few phrases can Hoeksema, 1987). We found that this might be related to
improve depression prediction. Similarly, facial-based video the theory that women are more likely to amplify their
biomarker sets extracted from dynamic analysis of facial mood. For example, women are likely tending to employ
expressions were also identified to be predictive for depres- positive emotions, including expressing joy or laughter
sion (Alghowinem et al., 2016; Schwartz et al., 1976; Cheng (Shields, 2000). Research has found that facial movement
et al., 2017). Video biomarkers can be extracted via head dynamics are discriminative for females and males. For
pose movement analysis (Ooi et al., 2011), automatic facial example, men’s movements may be more asymmetrical
action coding system (Hamm et al., 2011), etc. To date, (Alford, 1983) and women’s more animated (Hall, 1990).
there has been no systematic study that investigates bio- Hence, there is a need to investigate a subset of biomarkers
markers with respect to all of these modalities together, nor that contribute significantly to the target class (depressed)
has any study attempted to further investigate their clinical and are unique to women and men. Furthermore, existing
interpretations and interactions with gender. works demonstrated that including gender information is
Depression is a complex mental disorder that could not effective in improving the prediction performance (Huang
be solely captured from one single modality. Modalities that et al., 2016a; Pampouchidou et al., 2016; Stratou et al., 2013;
integrate acoustic, textual and visual biomarkers to analyze Alghowinem et al., 2013b). This suggests that it is reason-
psychological distress have shown promising performances able to assess the performance of the multi-modality fusion
model based on gender-bias. In the present work, a gender
(Ghosh et al., 2014; Ringeval et al., 2017). In these mecha-
based classification system was implemented by building
nisms, each acquisition framework is denoted as a modality
two separate classifiers, training two gender-specific models
and is associated with one dataset. In a recent study
for male and female separately. In addition, we investigated
(Samareh et al., 2017), researchers have investigated the per-
significant biomarkers that had dominant importance for
formance of each of the modalities (audio, linguistic, and
prediction in audio, video and text modalities using bio-
facial) for the task of depression severity evaluation by a
marker selection methods. Finally, we investigated the clin-
multi-modal mechanism. Further, these authors improved
ical implications of these observations and their
the results by using a confidence-based fusion mechanism to
relationships with respect to depression detection.
combine all three modalities. Experiments on the recently
This article is structured in the following manner. Section
released AVEC 2017 (Ringeval et al., 2017) depression data-
2 describes the process of extracting biomarkers from three
set have verified the promising performance of the proposed
modalities. The results of biomarkers that are predictive to
model in Samareh et al. (2017). In this work, we extended
depression along with their performances are shown in
Samareh et al. (2017), and identified unique biomarkers Section 3. Discussions and a conclusion are drawn in
from audio, linguistic, and facial biomarkers that were pre- Section 4.
dictive of depression. Compared with Samareh et al. (2017),
in this article, we identified a much more comprehensive
array of biomarkers in all three modalities and further 2. Methods
studied their clinical interpretations. For example, we cap-
2.1. Data collection
tured prosodic characteristics of the speaker and quality of
voice in both time and frequency domains. We learned dis- The database is part of a larger corpus, the Distress Analysis
criminative audio biomarkers such as low-pitched voice and Interview Corpus-Wizard of Oz (DAIC-WOZ) which that
multiple pauses during a speech that were predictive to contains clinical interviews designed to support the diagno-
depression. For video biomarkers, because of the unavail- sis of psychological distress conditions such as anxiety and
ability of raw video recordings, we extracted facial bio- depression. The dataset includes interviews, conducted by an
markers as a complementary cue that was measured animated virtual interviewer called Ellie, controlled by a
198 A. SAMAREH ET AL.
2.3. Biomarkers
To detect biomarkers that are predictive of depression, we
adopted a separate processing pipeline for each modality,
followed by a multi-modality methodology to predict the
result using the proposed method in Samareh et al. (2017).
The settings of individual prediction models and the multi-
modal fusion approach are discussed in the next section.
2.3.1. Overview of the multi-modality fusion model

In this work, we used a multi-modal fusion approach to
combine information across different modalities. Here, we
used modality loosely to refer to different biomarkers
Figure 1. Distribution of the depression severity on the training and develop-
ment set.
extracted from audio, video and text datasets. The audio
recordings include soundtracks with significant communica-
human interviewer in another room, and the participants tive cues that are symptomatic of depression. The 2D facial
include both distressed and non-distressed individuals. landmarks are indicative of changes in facial expression,
Participants were drawn from two distinct populations living which could be further analyzed for depression. In addition,
in the greater Los Angeles metropolitan area and from the people’s emotional status may change from sentence to sen-
tence, and from word to word. The fusion of these multiple
general public. Participants were coded for depression based
modalities can provide complementary information and
on PHQ-8.
increase the accuracy of the overall decision-making process.
Of the 140 participants, 76 were male, 63 were female,
However, the benefit of multi-modal fusion comes with a
and 1 did not report his/her gender. The mean age of this
certain cost and complexity in the analysis process. This is
group of participants was 39.34 (SD ¼12.52). The collected
due to the fact that different modalities are captured in dif-
data include audio and video recordings and extensive ques-
ferent formats and at different rates. Also, different modal-
tionnaire responses. During the interview, the participant
ities usually have varying confidence levels. Therefore,
was recorded by a camera and a close-talking microphone.
building a robust multi-modality framework that is capable
Data were transcribed and annotated for a variety of verbal
of dealing with complex datasets is essential.
and non-verbal features combining both participant and
Existing multi-modality fusion techniques can be catego-
interviewer. However, the focus of this article is on the par-
rized as a biomarker-level fusion and a decision-level fusion
ticipant speech, and the transcription part of Ellie was not
(Zahavy et al., 2016). A biomarker-level fusion learns shared
considered. Information on how to obtain shared data is
or similarity-regularized biomarkers across different modal-
available.2 Data are freely available for research purposes.
ities and then performs classification based on a multi-
modal representation vector. On the other hand, a decision-
2.2. Study population level fusion learns an input-specific classifier for each
modality and finds a decision rule to aggregate decisions
The Distress Analysis Interview Corpus database consists of made based on a single modality. In the present work, we
the word-level human transcriptions, including auxiliary found that each modality contained highly varying data vol-
speech behaviors, raw audio recordings and facial land- ume and sample dimensionality; therefore, we chose separate
marks, along with the gender of participants provided in biomarker processing pipelines for each modality, followed
two partitions: training set (n ¼ 107) and development set by a decision-level fusion module to predict the final result.
(n ¼ 36). Socioeconomic background was not provided by For each modality, a random forest model was built to
the dataset. The multi-modality fusion model was employed convert the predictive information of these biomarkers into
on the training set, and we validated the model with the scores, and further combined them into a confidence-based
given extracted biomarkers on the development set. For the fusion method to make a final prediction on the PHQ-8
generalization purpose, we used 10-fold cross-validation in scores. Random Forest is an ensemble of unpruned classifi-
all of the experiments. The average depression severity on cation created by using bootstrap samples of the training
the training and development set is M ¼ 6.67 (SD ¼ 5.75) data and random biomarker selection in tree induction.
out of a maximum score of 24. The level of depression is Usually, the prediction of random forest is made by aggre-
labeled with a single value per recording using a standar- gating (majority vote or averaging) the predictions of the
dized, self-assessed, subjective depression questionnaire, the ensemble (Svetnik et al., 2003). However, the dataset is com-
PHQ-8 score, which was recorded prior to every human- ing from three different modalities (e.g., audio, video, and
agent interaction. The distribution of the depression severity text), hence compromising them by aggregating the predic-
binary labels (PHQ8 Scores 10) on the training and devel- tions from all of the trees is not recommended. Therefore,
opment set for females and males can be seen in Fig. 1. for each modality we calculated the standard deviation of
predictions from all the trees, defined as the modality-wise
2
http://dcapswoz.ict.usc.edu confidence score.
Figure 2. Fusing audio, facial and textural biomarkers with the help of a multi-modality fusion model.
Let us assume a set of M randomized decision trees 2.3.2. Audio biomarkers extracted by signal processing
within a random forest fuL;hm jm ¼ 1; :::; Mg all learned on Audio biomarkers discriminate between individuals with
the same data L (e.g., audio, video, or text). In order to and without depression. The first step in the speech recogni-
improve the classification accuracy over decision trees, the tion system is to extract biomarkers; i.e., to identify the
individual trees in a random forest need to be different, components of the audio signal that have relations in a clin-
which can be achieved by introducing randomness in the ical context, and to distinguish other signals which carry
generation of the trees. The randomness is influenced by the information like background noise, pause, etc. In most audio
seed hm for each model. Ensemble methods work by com- processing and analysis techniques, it is common to divide
bining the predictions of these models into a new ensemble the input audio signal into short-term frames before bio-
model, denoted by w marker extraction. In particular, the audio signal is broken
L;h1 ;:::;hm . The randomized models are
combined into an ensemble by computing the standard into non-overlapping, short frames and a set of biomarkers
deviation (S) of the predictions to form the final prediction are extracted for each frame. An audio signal is changing
as: very rapidly; hence, we assume that a signal doesn’t change
statistically (i.e., statistically stationary) in short-term frames.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uP 2 We therefore broke each audio signal into 20-40 ms frames.
u M u w
t m L;hm L;h1 ;:::;hn ðxÞ The result of this procedure was a sequence of biomarker
S¼ : (1) vectors per audio signal.
M1
A discriminative biomarker can be learned from either
Here, w is the average prediction, which is the time domain or the frequency domain. For example,
L;h1 ;:::;hn ðxÞ
defined as: analyzing audio signals with respect to time provides infor-
mation with regards to the value of the signal at any given
1XM instance. Hence, time domain is useful when amplitude or

w L;h1 ;:::;hn ðxÞ ¼ u (2) energy of a signal needs to be examined. On the other hand,
M m L;hm :
time domain doesn’t convey information with respect to the
rate at which the signal is varying. Therefore, representing
After conducting several different strategies, the winner- the signal in a domain where the frequency of a signal is
take-all strategy (i.e., picking the single-modality prediction described is needed. Frequency-based biomarkers (spectral
with the highest confidence score as the final result), seemed biomarkers) are obtained by converting a time-based signal
to be the most effective and reliable one in our setting. The into a frequency domain by using Fourier transform. For
overview of the multi-modal fusion framework is shown in example, one of the most dominant methods used to extract
Fig. 2. We identified many biomarkers that were predictive frequency-based biomarkers is calculating Mel-Frequency
of depression with the help of signal processing on the Cepstral Coefficients (MFCC). MFCC biomarkers are
audio files, computer vision on 2D facial coordinates and derived from a cepstral representation of an audio clip (a
sentiment analysis on individuals transcript files. We will Fourier transform of the power spectrum of a voice signal).
next introduce how we extracted biomarkers based on the MFCC of a signal extracts a biomarker from the speech
given data for each modality. which approximates the human auditory system’s response,
Table 1. Description of audio biomarkers used in a time domain.

Audio biomarkers Description No. of biomarkers
Modulation of amplitude It is used to find the amplitude of two signals that are multiplied by the superimposed signals. 1
Envelope It represents the varying level of an audio signal over time. 1
Autocorrelation It shows the repeating patterns between observations as a function of the time lag between them. 1
Onset detector It is used to detect a sudden change in the energy or any changes in the statistical properties of a signal. 1
Entropy of energy It is a measure of abrupt changes in the energy level of an audio signal. 1
Tonal power ratio It is obtained by taking the ratio of the tonal power of the spectrum components to the overall power. 1
RMS power Root mean square (RMS) approximates the volume of an audio frame. 1
ZCR Zero Crossing Rate (ZCR) is the number of times the signal changes sign in a given period of time. 1
while at the same time deemphasizing all other information. the emotional state of a person. We split the facial land-
Another example of the frequency-based technique is marks into three groups of different regions: the left eye
Perceptual Linear Prediction (PLP). PLP biomarkers are and left eyebrow, the right eye and right eyebrow, and
obtained based on the concept of the psychophysics of hear- the mouth. The pairwise Euclidean distance between
ing and they are also used to derive an estimate of the coordinates of the landmarks was calculated, as well as
human auditory spectrum. Differences between PLP and the angles (in radians) between the points, resulting in
MFCC lie in the intensity-to-loudness conversion of 92 biomarkers. In detail, for the left and right eye, the
a sound. distance between points {37, 40} and {43, 46} was meas-
In our setting, we extracted a set of validated and tested ured as horizontal distance, and the distance between
biomarker extraction techniques that aim to capture pros- {38, 42} and {44, 48} was measured as vertical distance
odic characteristics of the speaker as well as the quality of respectively. For the mouth, the distance; between
voice in both the time and frequency domains. Descriptions points {52, 58} was measured as vertical distance; an
of audio biomarkers in both time and frequency domains average of the distance between two pairs of points {55,
are shown in Tables 1 and 2. A total of 35 audio biomarkers 49}, {65, 61} was measured as horizontal distance.
were used. Usually, eyebrow movements occur simultaneously.
Hence, we took the average distance between two pairs,
using points {22, 23} and {27, 18} as horizontal meas-
2.3.3. Visual biomarkers extracted by computer vision ures, and {31, 25}, {31, 20} as vertical distances. We cal-
Several biomarkers were extracted from the set of the 2D culated the difference between the coordinates of the
coordinates of the 68 points (Baltrusaitis et al., 2016) on the landmarks, and those from the mean shape, and finally
face provided by the AVEC’17, despite the unavailability of calculated the Euclidean distance (L2-norm) between
video recordings. Hence, a preprocessing on the 2D facial the points for each group.
landmarks was done to obtain head and distance bio-
markers. Overall, we obtained 133 video biomarkers using
both head motion and distance with complementary statis- 2.3.4. Text biomarkers extracted by sentimental analysis
tical measures. A set of text biomarkers were calculated based on the indi-
vidual’s transcript file. The original transcript file included
1. Head biomarkers: Initially, we drew the 2D polygons of translated communication content between each participant
facial landmarks using patch function in MATLAB. Patch and the animated virtual interviewer (named ‘Ellie’ who is
function creates several polygons using elements of 2D controlled by a human interviewer), as well as the duration
coordinates for each vertex, which connects the vertices (start time and stop time) for each round of communication.
to specify them. Such polygons, which encode facial The focus of this article deals solely with participant speech.
landmark regions as well as temporal movements, have The transcription part of Ellie was not considered. The first
resulted in different video frames. Landmark motion part of text biomarkers was about basic statistics of words
history using patch function is shown in Fig. 3a. The or sentences from the transcription file. This included the
points, which represent head motions and have minimal number of sentences over the duration, the number of
involvement in facial movement, were chosen. Points 2, words, and the ratio of the number of instances of laugher
4, 14, and 16 are among these landmarks, as shown in over the number of words. The use of these biomarkers is
Fig. 3b (Pampouchidou et al., 2016b). In addition, given supported by the literature, indicating that slowed and
that the regions between eyes and mouth are the most reduced speech, elongated speech pauses, and short answers
stable regions due to their minimal involvement in are nonverbal manifestations of depression (Elgring, 2007).
facial expression, we followed the approach in Yang The second category of text biomarkers concerned depres-
et al. (2016), and calculated the mean shape of 46 stable sion-related words; namely, the ratio of depression-related
points without regard to gender. For each of these facial words over the total number of words over the duration.
points, statistical measures such as mean and median Words relating to depression were identified from a diction-
were calculated, resulting in 41 biomarkers. ary of more than 200 words, which can be downloaded
2. Distance biomarkers: Both head motion and facial from the Depression Vocabulary Word List.3 In addition,
expression can be an indicator of a person’s behavior,
3
and may convey sets of essential information regarding https://myvocabulary.com/word-list/depression-vocabulary/
Table 2. Description of audio biomarkers used in a frequency domain.

Audio biomarkers Description No. of biomarkers
PLP It is a technique to minimize the differences between speakers. 9
MFCC It is a representation of the short-term power spectrum of an audio signal. 12
Spectral decrease It computes the steepness of the decrease of the spectral envelope. 1
Spectral rolloff It can be treated as a spectral shape descriptor of an audio signal. 1
Spectral flux It is a measure of spectral change between two successive frames. 1
Spectral centroid It is a measure to characterize the center mass of the spectrum. 1
Spectral slope It is the gradient of the linear regression of a spectrum. 1
Spectral autocorrelation It is a function that measures the regular harmonic spacing in the spectrum of the speech signal. 1
Figure 3. Facial landmark motion and facial landmark numbering. (a) Landmark motion history using patch function (b) Facial landmark numbering.
we introduced a new set of text sentiment biomarkers,

obtained using the tool of AFINN sentiment analysis
(Nielsen, 2011), which would represent the valence of the
current text by comparing it to an existing word list with
known sentiment labels. The outcome of AFINN sentiment
analysis is an integer between minus five (negative) and plus
five (positive), where a negative or positive number corre-
sponds, in turn, to a negative or positive sentiment. These
ratings are calculated according to the psychological reaction
of a person to a specific word, where each word is catego-
rized for valence with numerical values. In addition, the
mean, median, min, max, and standard deviation of the sen-
timent analysis outcomes (as a time series) were measured.
A total of eight biomarkers were extracted.
3. Experimental results
3.1. Identifying biomarkers that are predictive
to depression
Figure 4 Scoring history over number of trees in prediction with
audio biomarkers.
To discover promising biomarkers, we studied the details of
the multi-modality fusion model proposed in Samareh et al.
(2017) and compared the results to single modalities. Baseline (yellow), as a function of the number of trees for audio dataset.
scripts provided by the AVEC have been made available in the Each random forest model aggregates the outcomes of 50 indi-
data repositories, where depression severity was computed vidual trees. According to our experiments, if only using audio
using random forest regressor. We used the training and the biomarkers, then the performance of the random forest model
validation set to pick the optimal number of trees. The model gets saturated with around 25 trees. After blending video and
deviance was calculated based on the mean squared error text modalities, the random forest model demands a larger
between the training and the validation set. Figure 4 shows the learning capacity and grows to 50 trees, until it meets another
model deviance on both training set (blue) and validation set performance plateau.
Table 3. Performance comparison among single modalities and multi-modality Table 4. Performance comparison among single modalities and multi-modality
fusion model. by training gender-specific models.
‘Development’ ‘Train’ ‘Development’ ‘Train’
Biomarkers used
Biomarkers used RMSE MAE RMSE MAE RMSE MAE RMSE MAE
The baseline provided by Female
the AVEC organizer Visual only 6.35 5.01 5.69 4.87
Visual only 7.13 5.88 5.42 5.29 Audio only 6.45 5.38 5.03 4.66
Audio only 6.74 5.36 5.89 4.78 Text only 6.41 5.68 5.06 4.65
Audio & Video 6.62 5.52 6.01 5.09 Multi-modality fusion model 5.09 4.35 4.89 4.23
The model that doesn’t Male
include gender variable Visual only 6.44 5.18 5.92 4.89
Visual only 6.67 5.64 6.13 5.08 Audio only 5.77 5.26 4.78 4.02
Audio only 6.00 5.25 5.62 4.89 Text only 5.76 4.79 4.91 4.36
Text only 5.95 5.21 5.68 5.17 Multi-modality fusion model 4.95 4.32 4.91 4.02
Multi-modality fusion model 5.12 4.12 4.25 4.54
The model that includes
the gender variable investigated the effect of gender by comparing the perform-
Visual only 5.65 4.87 4.99 4.46
Audio only 5.89 5.18 5.66 5.06
ance of the model both with and without inputting subject
Text only 5.86 4.88 5.67 4.96 gender. The multi-modality mechanism further improved
Multi-modality fusion model 4.78 4.05 4.35 3.69 the performance of depression severity prediction signifi-
cantly with RMSE of 4.78 and MAE 4.05 attained on the
We employed single modalities as well as the multi- development set with gender included as a variable shown
modality fusion model proposed in Samareh et al. (2017) on in Table 3. Besides the biomarkers extracted from the three
the training set and validated the model using the develop- modalities, the gender of subjects was found to be highly
ment set. The depression severity baseline was computed correlated with the depression severity prediction. We pro-
using a random forest regressor Ringeval et al., 2017). posed to include gender information, by training two gen-
Baseline performance was provided by the AVEC to show der-specific models for male and female separately.
the state-of-the-art prediction performance AVEC can Detailed results of gender-specific models can be found
achieve on this dataset. In the baseline, the fusion of audio in Table 4. The multi-modality fusion model could attain
and video modalities was performed by averaging the regres- RMSE of 5.09 and MAE 4.35 and RMSE of 4.95 and MAE
sion outputs of the unimodal random forest regressors. In 4.32 for females and males, respectively. Although there are
addition, the root mean square error (RMSE) and mean marginal differences in the multi-modality methods used
absolute error (MAE) averaged over all observations were between females and males, the performance of gender-spe-
used to evaluate the models, while the best model was cific models demonstrates variation in prediction. The per-
selected based on RMSE. Table 3 reports the performance of formance of gender-specific models shows that individual
single modalities, proposed multi-modality fusion model audio and text modalities (except video) have a marginally
and the baseline for development and training set. The better performance for males than females. This might be
results show that fusing the audio, video and text bio- due to the gender-based differences in audio and text bio-
markers is powerful in detecting depression with RMSE of markers. For example, the vocal characteristics of females
5.12 and MAE 4.12, attained on development set compared and males are different, and detecting speech patterns might
with single modalities. In addition, the multi-modality be affected by variation in pitch, loudness, and the rhythm
fusion model had relatively better performance compared of speech.
with the baseline results of RMSE 6.62 and MAE 5.52.
The performance of the multi-modality fusion model
3.3. Investigating dominant biomarkers and their
indicates promising values of capturing depression by inte-
significance
gration of different biomarkers. The marginal differences
between the multi-modality mechanism and single modal- We extracted significant biomarkers from each modality to
ities might be due to many reasons. For example, the RSME investigate gender differences. Selecting a subset of bio-
and MAE of audio and text biomarkers are 6.00, 5.25 and markers that significantly contribute to the target class
5.95 and 5.21, respectively. However, the performance for (depressed) can reduce the length of the process. Vocal, lin-
both RMSE and MAE of video for development set are 6.67 guistic and facial significant biomarkers were attained from
and 5.64, respectively. This indicates that fusing the video a successive search on biomarkers during the extraction ana-
with other modalities might not have been very helpful. lysis. Significant biomarkers were extracted by computing
Moreover, the current dataset includes biomarkers and gen- estimates of predictor importance based on permutation. In
der information only, and doesn’t include other measures. the following, we provide brief details of how it works. The
predictor importance measures how influential a candidate
predictor variable is at the predicting response variable. The
3.2. Investigating the gender-information on biomarkers
influence of a predictor increases with the value of this
Existing works demonstrated that including gender informa- measure, hence larger values indicate predictors that have a
tion is effective in improving the prediction performance greater influence on the prediction. For example, if a pre-
(Huang et al., 2016a; Pampouchidou et al., 2016). Thus, we dictor is influential in prediction, then permuting its value
Figure 5. Top five most important biomarkers that have dominant importance for the prediction in audio, video and text modalities. (a) Audio biomarkers for females.
(b) Audio biomarkers for males. (c) Video biomarkers for females. (d) Video biomarkers for males. (e) Text biomarkers for females. (f) Text biomarkers for males.
should affect the error of the model. If a predictor is not Figure 5 shows the top five significant biomarkers from
influential, then permuting its values should have little to no each individual modality for females and males.
effect on the model error. Furthermore, we found the significant p-value for each top
Table 5. p-value of the selected top five significant biomarkers for females and males.
Selected biomarkers Females p-value Males p-value
Audio biomarkers MFCC 0.0012 Spectral slope <0.0001
Modulation of amplitude 0.0008 PLP <0.0001
Autocorrelation <0.0001 Modulation of amplitude 0.022
PLP <0.0001 Spectral decrease 0.012
Spectral decrease 0.002 Entropy of energy 0.011
Spectral slope 0.0080 MFCC 0.1140
Entropy of energy 0.2100 Autocorrelation 0.0210
Video biomarkers Left eye distance (38,42) <0.0001 Left eye distance (39,41) <0.0001
Right eye distance (44,48) <0.0001 Left eye distance (38,42) 0.0028
Mouth distance (52,58) 0.0014 Right eye distance (45,47) 0.0012
Mouth angle (50,51) 0.001 Mouth angle (61,90) <0.0001
Eyebrow distance (22,23) <0.0001 Mouth angle (60,59) 0.0311
Left eye distance (39,41) <0.0001 Left eye distance (38,42) <0.0001
Left eye distance (38,42) <0.0001 Right eye distance (44,48) <0.0001
Right eye distance (45,47) 0.0063 Mouth distance (52,58) 0.6510
Mouth angle (61,90) <0.0001 Mouth angle (50,51) 0.0052
Mouth angle (60,59) <0.0001 Eyebrow distance (22,23) 0.1000
Text biomarkers No.sentence/duration <0.0001 Sentiment_max 0.003
Sentiment_mean <0.0001 Sentiment_standard deviation <0.0001
No. laughters/no. words 0.032 No. laughters/no. words 0.126
Sentiment_min 0.887 Sentiment_mean 0.366
No. word/duration 0.311 Sentiment_min 0.369
Sentiment_standard deviation 0.0036 No. sentence/duration 0.322
biomarker in relation to the target class by using the mul- elements of sound is pitch, as shown in Figs. 5a, and 5b.
tiple linear regression shown in Table 5. The resulted signifi- Studies show that when someone speaks in a slow, soft
cant biomarkers behave differently for females and males in voice, they may be signaling sadness or depression (Siegman
each modality. In addition, a crossover interaction was and Boyle, 1993). Pitch is determined from frequencies that
observed between gender. For example, MFCC ranked the are clear, and can be detected via modulation of amplitude
highest in the list for females, while PLP and spectral (Morley and Rowe, 1990; Darwin et al., 1994), autocorrel-
decrease were ranked the lowest. However, spectral slope ation signals (Rabiner, 1977), MFCC (Noll, 1967; Sondhi,
and PLP were both ranked high for males. As a result, in- 1968), and energy (Seither-Preisler et al., 2004). As we could
depth clinical interpretation is needed to decipher the clin- see from the result of audio biomarkers in Fig. 5a, MFCC,
ical implications of these observations, and their relationship autocorrelation, and modulation of amplitude signals were
with respect to depression detection. among the most expressive biomarkers for females. The sig-
nificance of these biomarkers is shown in Table 5. This
might be due to the fact that the pitch of the male voice falls
4. Discussion under low frequencies, whereas the female voice is
In this article, we studied the details of the multi-modality higher pitched.
Another important indication of depression is anger,
fusion model proposed in Samareh et al. (2017) by integrat-
which is characterized by an increase in pitch. Biaggio et al.
ing vocal, facial and textual modalities to discover promising
(1987), studied the effect of anger over 112 university stu-
biomarkers that were predictive of depression. The impact
dents on depression. They found that, among depressed sub-
of the biomarker selection method was exploited to highlight
jects, there was a more intense experience of hostility and
each biomarker’s communicative power for depression
less control over anger. Angry voices typically have large
detection, and crossover interactions were observed in gen-
proportion of high-frequency energy in the spectrum, and
der-specific predictions. Developing biomarkers that can be
hence can be detected via frequency-based biomarkers like
automatically extracted from objective measurements dem- MFCC (Grossmann, 2010; Saito, 1998), and PLP (Burkhardt
onstrates the increasing promise of this methodology in the et al., 2009). Interestingly, the results showed that these bio-
accurate diagnosis of a patient’s mental condition. markers appear for both females and males, as shown in
Therefore, to interpret the observations, and to identify dis- Fig. 5a and 5b.
parities between gender, we investigated their clinical inter- Another voice characteristic the scientific community
pretations with respect to depression detection. agrees aids in detecting depression is laughter. The influence
of depression on the reduction of laughter frequency in
4.1. Audio biomarkers and their relationship with speech has been shown in Fonzi et al. (2010). It has been
depression detection found that laughter is able to improve negative consequen-
ces of stressful events and depressive symptoms. This char-
We found that non-linguistic speech patterns provided sig- acteristic of sound can be found through changes in
nificant communicative power through variations in pitch, spectral/cepstral biomarkers (Truong and Van Leeuwen,
loudness, interruption, anger, and laughter. Hence, they are 2007). As shown in Table 5, the spectral/cepstral biomarkers
helpful in providing cues that aid in interpretation of audio were significant for both females and males. However, we
signals. For example, one of the most important perceptual found cross-relations across speech patterns. Any changes in
Table 6. Audio biomarkers, relationship with non-linguistic speech patterns. " and # show that higher and lower level of these biomarkers indicates higher risk
of depression, respectively.
Audio biomarkers # Loudness # Pitch " Silence " Interruption " Pauses " Anger # Laughter
Modulation of (Bauer et al., (Morley and
amplitude 2006; Gramming Rowe, 1990;
et al., 1988) Darwin
et al., 1994)
Envelope (Robel and (Bohnet and
Rodet, 2005) Frey, 1999)
Autocorrelation (Sato et al., 2002) (Rabiner, 1977)
Onset detector (Holzapfel (Brossier
et al., 2010) et al., 2004)
Entropy of energy (Fletcher and (Seither-Preisler
Steinberg, 1924) et al., 2004)
Zero crossing (Geckinli and (Bohnet and (Erdol
Yavuz, 1977) Frey, 1999) et al.,
1993)
PLP (Hermansky, 1990) (Pollak (Burkhardt (Truong and Van
et al., 1995) et al., 2009) Leeuwen, 2007)
MFCC (Noll, 1967; (Atal and (Samborski et al., (Grossmann, (Truong and Van
Sondhi, 1968) Rabiner, 1976) 2011; Reynolds 2010; Leeuwen, 2007)
et al., 2000) Saito, 1998)
Spectral decrease (Combrinck and (Robel and (Varadharajan (Truong and Van
Botha, 1996) Rodet, 2005) et al., 2006) Leeuwen, 2007)
Spectral rolloff (Combrinck and (Truong and Van
Botha, 1996) Leeuwen, 2007)
Spectral flux (Combrinck and (Truong and Van
Spectral centroid (Combrinck and (Truong and Van
Spectral slope (Combrinck and (Truong and Van
Spectral (Combrinck and (Truong and Van
autocorrelation Botha, 1996) Leeuwen, 2007)
the intonation or the rhythm of speech could reflect the significant (both females and males). This is because non-
spectral/cepstral biomarkers and could be considered as speech mouth movements have shown indications of depres-
either laughter or anger. This shows, despite the great value sion symptoms (Heller and Haynal, 1997).
of sound to determine an emotional state of the speaker, Another significant biomarker was the rapid eyebrow
further investigation is required for drawing any conclusion. raising, which was observed in females only. Studies show
The complete list of vocal characteristics and their relation- that when the frequency of an audio is pronounced, it is
ship with audio biomarkers is shown in Table 6. usually also associated with the eyebrow movement. This is
due to the fact that audio and facial signals are integrated.
The fact that speech and facial biomarkers interact suggests
4.2. Video biomarkers and their relationship with that both are controlled by the same control system, and
depression detection need further investigation.
We found that biomarkers related to eye, eyebrow, and
mouth movements were predictive of depression detection,
4.3. Text biomarkers and their relationship with
as shown in Fig. 5c, and 5d. We also investigated the influ-
depression detection
ence of gender-dependent classification on the selected video
biomarkers. For example, we found that eye movement We found that the ratio of depression-related words over
holds a discriminative power for detecting depression. The the duration was a significant text biomarker for detecting
results show that the vertical L2 distance of the left eye depression in females, as shown in Figure 5. In contrast, this
[(38,42) and (39,41)], and the vertical L2 distance of the biomarker did not show any significance for men, as shown
right eye [(44,48) and (45,47)] were significant biomarkers in Table 5. Interestingly, the ratio of the number of instan-
among depressed subjects for both female and male partici- ces of laughter over the number of words was a significant
pants, as shown in Table 5. This might be an indication of biomarker for females, and not for males. Research has con-
fatigue, which is a common symptom of depression. firmed that emotional responses in women to report feeling
Another biomarker is the mouth movement, which is are stronger. This is because women likely tend to exhibit
accompanied by a deep breath. The mouth movement some- positive emotions, including the expression of joy or laugh-
times happens when the lips are closed, or pushed forward, ter (Shields, 2000).
and twisted to one side. Usually, rapid mouth movement is However, the remaining results on gender differences
an indication of self-consciousness in social situations. show that there is no gender dependence in detecting
However, from Table 5, we could see this was not a signifi- depression using text biomarkers. One reason is that the
cant biomarker for male participants. Further, we observed number of depression-related words detected were identified
that the angle between the points in the mouth are from a common dictionary for both groups. A key factor
known to differ between males and females is the use of knowledge of modality-specific information and interactions
affiliative language versus assertive language. Affiliative lan- among them and other mutual properties could be very use-
guage positively engages the other person by expressing ful. The current multi-modality fusion model ignored the
agreement, versus the assertive language that includes direct- interactions among modalities, such as audio-visual rela-
ive statement. Women tend to use affiliative language more, tions. Some studies have shown that combining audio and
and men tend to use assertive language more (Leaper and visual biomarkers can improve emotion recognition (Kim,
Smith, 2004). Perhaps further research is needed in a more 2007), which is a closely related topic to depression evalu-
discriminative setting in order to properly construct the dic- ation (Kim, 2007). For example, Ekman investigated the sys-
tionary and shed light on this issue. tematic relations of speech and eyebrow movement (Cave
et al., 2002). He found that when the frequency of a sound
4.4. Limitations and next steps is pronounced, it is usually associated with the eyebrow
movement. This is due to the fact that audio and facial sig-
One of the challenges that became apparent was the avail- nals are integrated. Further, we found that there are also
ability and limitations of the baseline dataset. For example, cross-relations in speech patterns. For example, we found
the video dataset had some restrictions due to its unavail- that both anger and laughter appear in audio signals when
ability and restricted format. The results showed that fusing frequencies are amplified. Further investigation is needed to
the video with other modalities was not very helpful. The distinguish these two signals. In addition, we recognized the
availability of raw video recordings could help in extracting importance of biomarkers in a multi-modality noisy data.
more relevant biomarkers. Hence, it could improve the pre- With the addition of more relevant biomarkers, overall per-
dictability of the multi-modality mechanism. Another limita- formance could be largely improved. For example, the
tion of the dataset was the unavailability of some measures silence detection could be used to check how often the par-
that could improve the detectability of depression. For ticipants keep quiet, which may act as a depres-
example, studies showed that physiological signals such as sion indicator.
heart rate are strongly correlated with depression detection One possible way to improve the multi-modality fusion
(Chen et al., 2011; Ringeval et al., 2015), and are comple- model is to build a personalized model by constructing an
mentary to audiovisual data. Thus, access to these measures individual model and then aggregating them to generate
could help the detection task. On the other hand, Wizard of prediction outcomes (Samareh and Huang, 2018). Towards
Oz interviews were transcribed from a composite video, this goal, the model could be extended to a more general
combining both participant and interviewer. Later auto- social signal processing mechanism. This could help us in
mated interviews were transcribed from the audio stream of characterizing social interactions by assessing audio-visual
the participant, only without including the interviewer ques- recordings in real-world conditions. Another solution is to
tions. However, full comprehension of the text is not pos- transcribe the audio dataset. Hence, learning complex emo-
sible. For example, several transcribed files included
tions that interact with one another, such as anger and
unrecognizable words marked by “x,” suggesting that predic-
laughter, may help optimize recognition accuracy.
tion also could be challenged with the existing dataset.
Another limitation of this study is the virtual interviewer,
Ellie.4 She functions through a computer program that uses ORCID
different algorithms to determine her questions and motions
Aven Samareh http://orcid.org/0000-0002-2630-8041
and analyzes the emotional cues. The computer program
reads a patient’s current facial expression and the variation
in expressions throughout the session and taking any flat
expression symptomatic of depression. Despite her ability to References
collect signs of mental health problems, she cannot deliver
Alford, R.D. (1983) Sex differences in lateral facial facility: The effects
solutions and doesn’t have the humanity to aid in therapy of habitual emotional concealment. Neuropsychologia 21(5),
work. For example, unlike a human therapist, Ellie cannot 567–570.
respond to questions. Moreover, some people find it easier Alghowinem, S., Goecke, R., Wagner, M., Epps, J., Breakspear, M.,
to open up to a human therapist when speaking face-to- Parker, G. (2013a) Detecting depression: A comparison between
face. Furthermore, Ellie’s response is limited to changes in spontaneous and read speech. In: Acoustics, Speech and Signal
facial expression and recordings of the conversation; how- Processing (ICASSP), 2013 IEEE International Conference,
7547–7551.
ever, human behavior can be difficult to understand, and Alghowinem, S., Goecke, R., Wagner, M., Parkerx, G., Breakspear, M.
may rely on changes in psychological parameters (e.g., blood (2013b) Head pose and movement analysis as an indicator of
pressure, weight, cardiac conditions, etc.). depression. In: Affective Computing and Intelligent Interaction
We also faced the challenge presented by the limitations (ACII), 2013 Humaine Association Conference, pp. 283–288.
of the current model. The most essential endeavor to be Alghowinem, S., Goecke, R., Epps, J., Wagner, M., Cohn, J.F. (2016)
taken in using the multi-modality fusion model is to learn Cross-cultural depression recognition from vocal biomarkers. In:
INTERSPEECH, 1943–1947.
about relationships between modalities. Thus, prior
Alpert, M., Pouget, E.R., Silva, R.R. (2001) Reflections of depression in
acoustic measures of the patient’s speech. Journal of Affective
4
http://viterbi.usc.edu/news/news/2013/a-virtual-therapist.htm Disorders 66(1), 59–69.
Atal, B., Rabiner, L. (1976) A pattern recognition approach to voiced- Fonzi, L., Matteucci, G., Bersani, G. (2010) Laughter and depression:
unvoiced-silence classification with applications to speech recogni- Hypothesis of pathogenic and therapeutic correlation. Rivista di
tion. IEEE Transactions on Acoustics, Speech, and Signal Processing Psichiatria 45(1), 1–6.
24(3), 201–212. Geckinli, N., Yavuz, D. (1977) Algorithm for pitch extraction using
Baltrusaitis, T., Robinson, P., Morency, L.-P. (2016) Openface: An open zero-crossing interval sequence. IEEE Transactions on Acoustics,
source facial behavior analysis toolkit. In: Applications of Computer Speech, and Signal Processing 25(6), 559–564.
Vision (WACV), 2016 IEEE Winter Conference, 1–10. Ghosh, S., Chatterjee, M., Morency, L.-P. (2014) A multimodal con-
Bauer, J.J., Mittal, J., Larson, C.R., Hain, T.C. (2006) Vocal responses text-based approach for distress assessment. In: Proceedings of the
to unanticipated perturbations in voice loudness feedback: An auto- 16th International Conference on Multimodal Interaction, 240–246.
matic mechanism for stabilizing voice amplitude. The Journal of the Gramming, P., Sundberg, J., Ternstr€ om, S., Leanderson, R., Perkins,
Acoustical Society of America 119(4), 2363–2371. W.H. (1988) Relationship between changes in voice pitch and loud-
Beal, M.J., Jojic, N., Attias, H. (2003) A graphical model for audiovisual ness. Journal of Voice 2(2), 118–126.
object tracking. IEEE Transactions on Pattern Analysis and Machine Grossmann, T. (2010) The development of emotion perception in face
Intelligence 25(7), 828–836. and voice during infancy. Restorative Neurology and Neuroscience
Bettes, B.A. (1988) Maternal depression and motherese: Temporal and 28(2), 219–236.
intonational features. Child Development, 1089–1096. Hall, J.A. (1990) Nonverbal Sex Differences: Accuracy of Communication
Biaggio, M.K., Godwin, W.H. (1987) Relation of depression to anger and Expressive Style. Johns Hopkins University Press, Baltimore,
and hostility constructs. Psychological Reports 61(1), 87–90. MD.
Bohnet, I., Frey, B.S. (1999) Social distance and other-regarding behav- Hamm, J., Kohler, C.G., Gur, R.C., Verma, R. (2011) Automated facial
ior in dictator games: Comment. The American Economic Review action coding system for dynamic analysis of facial expressions in
89(1), 335–339. neuropsychiatric disorders. Journal of Neuroscience Methods 200(2),
Brossier, P., Bello, J.P., Plumbley, M.D. (2004) Real-time temporal seg- 237–256.
mentation of note objects in music signals. In: Proceedings of ICMC He, L., Lech, M., Allen, N. (2010) On the importance of glottal flow
2004, the 30th Annual International Computer Music Conference. spectral energy for the recognition of emotions in speech. In: 11th
Burkhardt, F., Polzehl, T., Stegmann, J., Metze, F., Huber, R. (2009) Annual Conference of the International Speech Communication
Detecting real life anger. In: Acoustics, Speech and Signal Processing, Association, Makuhari, Chiba, Japan, 2346-2349.
4761–4764. Heller, M., Haynal, V. (1997) Depression and suicide faces. What the
Cannizzaro, M., Harel, B., Reilly, N., Chappell, P., Snyder, P.J. (2004) face reveals’ (Oxford, 1997), 398–407.
Voice acoustical measurement of the severity of major depression. Hermansky, H. (1990) Perceptual linear predictive (PLP) analysis of
Brain and Cognition 56(1), 30–35.
speech. Journal of the Acoustical Society of America 87(4),
Cave, C., Guaïtella, I., Santi, S. (2002) Eyebrow movements and voice
1738–1752.
variations in dialogue situations: An experimental investigation. In:
Holmboe, E.S., Durning, S.J. (2014) Assessing clinical reasoning:
Seventh International Conference on Spoken Language Processing,
Moving from in vitro to in vivo. Diagnosis 1(1), 111–117.
Denver
Holzapfel, A., Stylianou, Y., Gedik, A.C., Bozkurt, B. (2010) Three
Chen, Y.-T., Hung, I.-C., Huang, M.-W., Hou, C.-J., Cheng, K.-S.
dimensions of pitched instrument onset detection. IEEE
(2011) Physiological signal analysis for patients with depression. In:
Transactions on Audio, Speech, and Language Processing 18(6),
Biomedical Engineering and Informatics (BMEI), 2011 4th
1517–1527.
International Conference, 2, 805–808.
Huang, Z., Stasak, B., Dang, T., Wataraka Gamage, K., Le, P., Sethu,
Cheng, B., Wang, Z., Zhang, Z., Li, Z., Liu, D., Yang, J., Huang, S.,
V., Epps, J. (2016a) Staircase regression in oa rvm, data selection
Huang, T.S. (2017) Robust emotion recognition from low quality
and low bit rate video: A deep learning approach. Affective Computing and gender dependency in avec 2016. In: Proceedings of the 6th
and Intelligent Interaction (ACII), 2017 Seventh International International Workshop on Audio/Visual Emotion Challenge, 19–26.
Conference on Affective Computing and Intelligent Interaction R€
obel, A., Rodet, X. (2005) Efficient spectral envelope estimation and
(ACII2017), EEE, San Antonio, Texas, 65-70. its application to pitch shifting and envelope preservation. In:
Combrinck, H., Botha, E. (1996) On the mel-scaled cepstrum. International Conference on Digital Audio Effects, 30–35.
Department of Electrical and Electronic Engineering, University of Rabiner, L. (1977) On the use of autocorrelation analysis for pitch
Pretoria. detection. IEEE Transactions on Acoustics, Speech, and Signal
Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., Processing 25(1), 24–33.
Quatieri, T.F. (2015) A review of depression and suicide risk assess- Rajni, D., Das, N.N. (2016) Emotion recognition from audio signal.
ment using speech analysis. Speech Communication 71, 10–49. International Journal of Advanced Research in Computer Engineering
Darby, J.K., Simmons, N., Berger, P.A. (1984) Speech and voice param- & Technology (IJARCET) 5(6).
eters of depression: A pilot study. Journal of Communication Kim, J. (2007) Bimodal emotion recognition using speech and physio-
Disorders 17(2), 75–85. logical changes. In: Robust Speech Recognition and Understanding,
Darwin, C., Ciocca, V., Sandell, G. (1994) Effects of frequency and InTech, Vienna, Austria.
amplitude modulation on the pitch of a complex tone with a mis- Kroenke, K., Strine, T.W., Spitzer, R.L., Williams, J.B., Berry, J.T.,
tuned harmonic. The Journal of the Acoustical Society of America Mokdad, A.H. (2009) The phq-8 as a measure of current depression
95(5), 2631–2636 in the general population. Journal of Affective Disorders 114(1),
Ellgring, H. (2007) Non-verbal Communication in Depression. 163–173.
Cambridge University Press, Kuty, S. ., Stassen, H., et al. (1993) Speaking behavior and voice sound
Erdol, N., Castelluccia, C., Zilouchian, A. (1993) Recovery of missing characteristics in depressive patients during recovery. Journal of
speech packets using the short-time energy and zero-crossing meas- Psychiatric Research 27(3), 289–307.
urements. IEEE Transactions on Speech and Audio Processing 1(3), Leaper, C., Smith, T.E. (2004) A meta-analytic review of gender varia-
295–303. tions in children’s language use: Talkativeness, affiliative speech, and
Fletcher, H., Steinberg, J. (1924) The dependence of the loudness of a assertive speech. Developmental Psychology 40(6), 993.
complex sound upon the energy in the various frequency regions of Mitchell, A.J., Vaze, A., Rao, S. (2009) Clinical diagnosis of depression
the sound. Physical Review 24(3), 306. in primary care: A meta-analysis. The Lancet 374(9690), 609–619.
France, D.J., Shiavi, R.G., Silverman, S., Silverman, M., Wilkes, M. Moore, J.D., Tian, L., Lai, C. (2014) Word-level emotion recognition
(2000) Acoustical properties of speech as indicators of depression using high-level features. In: International Conference on Intelligent
and suicidal risk. IEEE Transactions on Biomedical Engineering Text Processing and Computational Linguistics, Springer, Konya,
47(7), 829–837. Turkey, 17-31.
Morley, J.W., Rowe, M.J. (1990) Perceived pitch of vibrotactile stimuli: Ringeval, F., Schuller, B., Valstar, M., Gratch, J., Cowie, R., Scherer, S.,
Effects of vibration amplitude, and implications for vibration fre- Mozgai, S., Cummins, N., Schmitt, M., Pantic, M. (2017) Avec 2017:
quency coding. The Journal of Physiology 431(1), 403–416. Real-life depression, and affect recognition workshop and challenge.
Newman, S., Mather, V.G. (1938) Analysis of spoken language of In: Proceedings of the 7th International Workshop on Audio/Visual
patients with affective disorders. American Journal of Psychiatry Emotion Challenge.
94(4), 913–942. Rivet, B., Wang, W., Naqvi, S.M., Chambers, J.A. (2014) Audiovisual
Nielsen, F.Å. (2011) A new ANEW: Evaluation of a word list for senti- speech source separation: An overview of key methodologies. IEEE
ment analysis in microblogs. arXiv:1103.2903. Signal Processing Magazine 31(3), 125–134.
Nolen-Hoeksema, S. (1987) Sex differences in unipolar depression: Saito, T. (1998) On the use of f0 features in automatic segmentation
Evidence and theory. Psychological Bulletin 101(2), 259. for speech synthesis. In: ICSLP.
Noll, A.M. (1967) Cepstrum pitch determination. The Journal of the Samareh, A., Jin, Y., Wang, Z., Chang, X., Huang, S. (2017) Predicting
Acoustical Society of America 41(2), 293–309. depression severity by multi-modal feature engineering and fusion.
Ooi, K.E.B., Low, L.-S.A., Lech, M., Allen, N. (2011) Prediction of clin- arXiv:1711.11155.
ical depression in adolescents using facial image analaysis. In: Samareh, A., Huang, S. (2018) Dl-chi: A dictionary learning-based con-
WIAMIS 2011: 12th International Workshop on Image Analysis for temporaneous health index for degenerative disease monitoring.
Multimedia Interactive Services, Delft, The Netherlands, April 13-15, EURASIP Journal on Advances in Signal Processing 2018(1), 17.
2011. Shivappa, S.T., Trivedi, M.M., Rao, B.D. (2010) Audiovisual informa-
Pampouchidou, A., Simantiraki, O., Fazlollahi, A., Pediaditis, M., tion fusion in human–computer interfaces and intelligent environ-
Manousos, D., Roniotis, A., Giannakakis, G., Meriaudeau, F., Simos, ments: A survey. Proceedings of the IEEE 98(10), 1692–1715.
P., Marias, K., et al. (2016a) Depression assessment by fusing high Stratou, G., Scherer, S., Gratch, J., Morency, L.-P. (2015) Automatic
and low level features from audio, video, and text. In: Proceedings of nonverbal behavior indicators of depression and PTSD: The effect of
the 6th International Workshop on Audio/Visual Emotion Challenge, gender. Journal on Multimodal User Interfaces 9(1), 17–29.
27–34. Shields, S.A. (2000) Thinking about gender, thinking about theory:
Pampouchidou, A., Simantiraki, O., Fazlollahi, A., Pediaditis, M., Gender and emotional experience. Gender and Emotion: Social
Psychological Perspectives, 3–23.
Manousos, D., Roniotis, A., Giannakakis, G., Meriaudeau, F., Simos,
Stratou, G., Scherer, S., Gratch, J., Morency, L.-P. (2013) Automatic
P., Marias, K., et al. (2016b) Depression assessment by fusing high
nonverbal behavior indicators of depression and PTSD: Exploring
and low level features from audio, video, and text. In: Proceedings of
gender differences. In: Affective Computing and Intelligent
the 6th International Workshop on Audio/Visual Emotion Challenge,
Interaction (ACII), 2013 Humaine Association Conference, 147–152.
27–34.
Sondhi, M. (1968) New methods of pitch extraction. IEEE Transactions
Pollak, P., Sovka, P., Uhlir, J. (1995) Cepstral speech/pause detectors.
on Audio and Electroacoustics 16(2), 262–266.
In: Proc. IEEE Workshop on Nonlinear Signal and Image Processing,
Seither-Preisler, A., Krumbholz, K., Patterson, R., Seither, S.,
388–391. L€utkenh€oner, B. (2004) Interaction between the neuromagnetic
Reynolds, D.A., Quatieri, T.F., Dunn, R.B. (2000) Speaker verification responses to sound energy onset and pitch onset suggests common
using adapted gaussian mixture models. Digital Signal Processing generators. European Journal of Neuroscience 19(11), 3073–3080.
10(1-3), 19–41. Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P.,
Samborski, R., Sierra, D., et al. (2011) Hybrid Wavelet-Fourier-HMM Feuston, B.P. (2003) Random forest: as classification and regression
speaker recognition. International Journal of Hybrid Information tool for compound classification and qsar modeling. Journal of
Technology 4(4), 25–42. Chemical Information and Computer Sciences 43(6), 1947–1958.
Sato, S., Kitamura, T., Ando, Y. (2002) Loudness of sharply (2068 db/ Truong, K.P., Van Leeuwen, D.A. (2007) Automatic discrimination
octave) filtered noises in relation to the factors extracted from the between laughter and speech. Speech Communication 49(2),
autocorrelation function. Journal of Sound and Vibration 250(1), 144–158.
47–52. Varadarajan, V.S., Hansen, J.H. (2006) Analysis of Lombard effect
Schwartz, G.E., Fair, P.L., Salt, P., Mandel, M.R., Klerman, G.L. (1976) under different types and levels of noise with application to in-set
Facial muscle patterning to affective imagery in depressed and non- speaker ID systems. In: INTERSPEECH.
depressed subjects. Science 192(4238), 489–491. Yang, L., Jiang, D., He, L., Pei, E., Oveneke, M.C., Sahli, H. (2016)
Siegman, A.W., Boyle, S. (1993) Voices of fear and anxiety and sadness Decision tree based depression classification from audio video and
and depression: the effects of speech rate and loudness on fear and language information. In: Proceedings of the 6th International
anxiety and sadness and depression. Journal of Abnormal Psychology Workshop on Audio/Visual Emotion Challenge, 89–96.
102(3), 430. Yang, Y., Fairbairn, C., Cohn, J.F. (2013) Detecting depression severity
Ringeval, F., Schuller, B., Valstar, M., Jaiswal, S., Marchi, E., Lalanne, from vocal prosody. IEEE Transactions on Affective Computing 4(2),
D., Cowie, R., Pantic, M. (2015) Avec 2015: The first affect recogni- 142–150.
tion challenge bridging across audio, video, and physiological data. Zahavy, T., Magnani, A., Krishnan, A., Mannor, S. (2016) Is a picture
In: Proceedings of the 5th International Workshop on Audio/Visual worth a thousand words? A deep multi-modal fusion architecture
Emotion Challenge, 3–8. for product classification in e-commerce. arXiv:1611.09534.

Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces

Uploaded by

Copyright:

Available Formats

IISE Transactions on Healthcare Systems Engineering

ISSN: 2472-5579 (Print) 2472-5587 (Online) Journal homepage: https://www.tandfonline.com/loi/uhse21

Detect depression from communication: how

To link to this article: https://doi.org/10.1080/24725579.2018.1496494

Accepted author version posted online: 24

Submit your article to this journal

Article views: 157

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Detect depression from communication: how computer vision, signal processing,

1. Introduction the subsequent decision making will be tailored to a correct

2.3.1. Overview of the multi-modality fusion model

1XM instance. Hence, time domain is useful when amplitude or

Table 1. Description of audio biomarkers used in a time domain.

Table 2. Description of audio biomarkers used in a frequency domain.

we introduced a new set of text sentiment biomarkers,

You might also like