You are on page 1of 14

Suicide Detection with Natural Language Processing:

Applied in National Taiwan University

Text Mining Project

R07722013 王俊凱
B06303134 蔡旻頤
B06109022 呂紀廷
Content

1 Introduction 3

2 Data 3

3 Exploratory Data Analysis 4


3.1 Word-Level Analysis 4
3.2 Length of Posts 5
3.3 Text Frequency 6
3.4 Word Cloud 6
3.5 Word Network Graph 7

4 Experiment and Result 8

5 Discussion 10
5.1 Keywords to Differentiate Two Kinds of Articles 10
5.2 Causes of Suicidal Intention 11

6 Application 12

7 Conclusion and Prospect 13

References 14

2
1 Introduction
Over the past few years, “mental health” has become a serious concern. Everyone, including
celebrities, experiences depression from time to time, and because of the high-tech society
we’re in, people now tend to express their feelings on social media, instead of going to
psychiatrists. In addition, students in NTU are bombarded by the endless homework and
exams. What’s worse, tragedies have happened repeatedly on the NTU campus recently,
which has a severe impact on all teachers and students. Many people have begun to post
negative articles on the NTU student forum on Facebook. Therefore, we want to use the text
mining methods learned in the course in order to predict whether anyone has thoughts of
suicide or severe depression. After building the model, we will construct a dynamic
mechanism, i.e., an alarm system, which will return a warning signal when detecting suicidal
posts, for the purpose of identifying and preventing the tragedies beforehand.
First we want to explore, using exploratory data analysis (EDA), whether there is a pattern
behind suicidal posts, regarding the aspects of length of posts, most used words, … etc.
Secondly, we want to see if we can predict the articles have suicidal intentions, with machine
learning methods of different embeddings and models.
This paper proceeds in 6 sections after the introduction. Section 2 provides the background of
the data and the methods of obtaining it. Section 3 analyses the datasets with exploratory data
analysis (EDA). Section 4 presents the experiments, which we use different embeddings and
models, and reports the performances. Section 5 discusses the significantly important
keywords to distinguish suicidal posts from non-suicidal articles, and we explore the possible
causes responsible for having suicidal thoughts. In Section 6, we use the model obtained in
last section and develop an alarm system to identify if the posts have the intention of suicide.
Section 7 concludes and gives the prospect for the future.

2 Data
The data for this study comes from various sources and consist of two distinct datasets. The
first dataset is the posts from Reddit. To be particular, we scrape the data ​from
r/SuicideWatch via Reddit’s API, which only allows us to obtain approximately 1000 posts.
With the data collected and the existing dataset, also articles from ​r/SuicideWatch​, on
GitHub, we combine both data to construct a suicidal posts dataset.
The second dataset, the non-suicidal data, comes from numerous different subreddits,
including ​r/AskReddit, r/BusinessFinance, r/ScienceTechnology, r/Photography, r/AMA,
r/Food, r/Sports, r/PolicyEconomy, ​… , all of which are non-suicide-related. The reason why
we include such diverse data is that the subjects of non-suicidal articles are countless. After
collecting the data, we mix all of them and randomly select 2104 posts, the same sample size
as the first dataset.
After the cleaning, like image removal, we have two variables, a binary variable labelling if
one has the intention of suicide (1 for having the suicidal thought and 0 otherwise) and a
character-type variable containing the content from posts. We label the binary variable 1 if
the posts come from ​r/SuicideWatch and 0 for those which are from other sources. The
observations for suicidal and non-suicidal articles are 2104 for each, making our final data a
balanced data with 4208 observations in total.

3
3 Exploratory Data Analysis

3.1 ​Word-Level Analysis


We analysed the word frequencies in posts. We can see from Figure 1 that:
Many Similar Words: Terms like just, like, people, want, will appear on both “Most Used
Words” plots. The similarities may make our model difficult to identify the suicidal posts.
But this reflects the reality that the same words are used in posts with totally different
meanings and it can be a challenge which can be solved by text mining techniques.
Words Referring To Suicide: We can see that there are words like ​die, fucking, end ​and
suicide used frequently in suicidal posts. Other sentimental words like ​feel, want and ​think
appear more than twice the amount of times in suicidal posts compared to non-suicidal ones.
So are the terms like ​anymore, everything, nothing​ and ​someone​.
Aspects: It is interesting to note that the top words in suicidal posts include the different
aspects. e.g., ​life, friends and ​family.​ They are all related to interpersonal relationships,
whereas the aspects in non-suicidal posts are ​government, money and ​things​, which are much
more different.

Figure 1: Most Used Words


Suicidal Posts

Non-Suicidal Posts

4
3.2 ​Length of Posts
In Figure 2-1, we construct a scatterplot with document_id (​X​) and length of post (​Y​). We can
see that non-suicidal posts are longer than suicidal ones. When placing the length of post
orderly (Figure 2-2), the longest post with non-suicidal intention has more than 6000 words
yet there are only 3000 words in the longest suicidal article.

Figure 2-1: Length of Posts

Figure 2-2: Length of Posts (sorted)

5
3.3 Text ​Frequency
We display the frequencies of the same words in both non-suicidal and suicidal posts in
Figure 3, with a 45-degree line to highlight the relative frequency. Under the 45-degree line,
there are words appearing more repeatedly in suicidal articles, like ​kill, shit, fuck, hate,
suicide, fucking and ​die​, which is quite straightforward. Other terms like ​also, good, many
and ​work in the top left side of the plot are the words occurring more often in non-suicidal
posts.

Figure 3: Scatterplot Displaying Text Frequencies

3.4 ​Word Cloud


The word clouds for both suicidal and non-suicidal articles are shown in Figure 4. It is clear
to note that:
Word Range: ​The most frequently used words in suicidal posts are more concentrated on
life, like the words ​life, live, die, day, friends and ​family​; however, those in non-suicidal posts
are more diverse, like ​india, world, government, company a​ nd ​economy​.
Obvious Difference: It can be clearly seen that aside from suicide-related words, e.g.,
suicide, kill and ​end​, words like ​anymore, never, nothing, help ​and ​still also occur regularly in
suicidal posts, but they do not appear in the counterpart plot.

6
Figure 4: Word Cloud
Suicidal Posts Non-Suicidal Posts

3.5 ​Word Network Graph


Next we construct the network for different types of posts (Figure 5):
Different Cores: ​With suicidal posts, the words are centered around the word ​“feel”​. We can
say that people with depressive thoughts tend to express their feelings. On the other hand,
there is no evident word to be revolved around in the non-suicidal posts. The most frequently
used words are equally linked to one another.
Hot Terms: ​From suicidal posts, there is ​“mental health” in the left-center section and
“could wish” in the right-bottom section. Both are popular terms when talking about the
condition of hypochondria.

Figure 5: Network Graph


Suicidal Posts Non-Suicidal Posts

7
4 Experiment and Result

We design our experiment by comparing two different methods of vectorization (TF-iDF and
Word2vec) and different models (Naive-Bayes, Support Vector Machine, Random Forest,
Xgboost and BERT). The metrics we chose include Area Under the Curve (AUC), accuracy,
precision, recall, F1-score and runtime. Since the result of misclassifying suicidal intention
posts as non-suicidal intention posts is more serious than the other way around, we focus
more on recall rate rather than precision rate.
The results are shown in Table 1 and Figure 6. After adjustment of hyperparameters, when
using TF-iDF as the vectorization method, SVM has the highest AUC and all other metrics.
Naive-Bayes performs the worst with an AUC of 0.76. When using Word2vec as the
vectorization method, Xgboost has the highest AUC and all other metrics. SVM are the most
expensive one to train among the models using word2vec vectorization. Again, Naive-Bayes
performs the worst with an AUC of 0.56, almost similar to the AUC of random guess. BERT
has a AUC of 0.99 which is the same as SVM using TF-iDF, however, the training time and
runtime for BERT are way worse.
One can tell that TF-iDF has a better score than Word2vec probably, because in the article,
the existence of certain critical words (i.e. suicide, die and kill... etc.) are more important
than the relationship between words. Finally, we chose SVM as our final model, since it has
the highest AUC with an acceptable runtime.

Table 1: Model Performance

Training
Embedding Model AUC Precision Recall F1-score Accuracy Time
(seconds)
Naïve-Bayes 0.76 0.7717 0.7577 0.7545 0.7577 2.8558
SVM 0.99 0.955 0.9549 0.9549 0.9549 1282.032
TF-iDF
Random Forest 0.97 0.9027 0.9026 0.9026 0.9026 6.7506
Xgboost 0.98 0.9289 0.9287 0.9287 0.9287 83.6787
Naïve-Bayes 0.56 0.6646 0.5606 0.4792 0.5606 0.0153
SVM 0.96 0.9129 0.9121 0.9121 0.9121 225.8384
Word2Vec
Random Forest 0.96 0.9074 0.905 0.9049 0.905 3.8493
Xgboost 0.98 0.9409 0.9406 0.9406 0.9406 3.5317
BERT 0.99 0.9504 0.9501 0.9501 0.9501 > 3600

8
Figure 6: ROC Curve

9
5 Discussion

5.1 Keywords to Differentiate Two Kinds of Articles


Next, we would like to analyze a few questions that we are interested in. First, we would like
to know what the keywords are to differentiate suicidal intention articles from non-suicidal
intention articles. We obtain the feature importance decided by Random Forest to answer this
question. As shown in Figure 7, in the wordcloud the bigger the word the more important the
word is, we can easily see that words like ​“anymore”, “suicide”, “life”, “die”, “feel”,
“comment” are strong indicators of suicidal intention. What is really interesting is that words
like ​“http”, “https”, “www”, “com” are also keywords to differentiate two kinds of articles,
and we found that in the training data there are 606 non-suicidal articles with links and 18
suicidal articles with links, meaning that post attached with links are mostly to have
non-suicidal intention.

Figure 7: Keywords to Differentiate Articles

10
5.2 Causes of Suicidal Intention
The second question that we are interested in is to understand the possible causes of suicidal
intention. In this part, we used Latent Dirichlet Allocation (LDA) to find out certain topics
(as shown in Table 2). We found that the articles in the training data can be roughly classified
into 5 topics, which are peer, love, work/school/parents, family and others.

Table 2: Possible Causes of Suicidal Thoughts

Topic
Topic Top 20 Words Per Topic
No.

friends​, going, anymore, go, much, im, never, one, think, get,
1 Peer
really, even, time, would, ​people​, know, life, feel, want, like

que, eu, without, ​love​, ass, sorry, excuse, ​sweet​, ​dear​, berry, even,
2 Love
angelic, ​single​, ​bitch​, become, real, ​beloved​, little, life, blanc

Work/
make, shit, could, ​job​, time, since, day, started, never, every,
3 School/
​school​, amp, one, years, mom, even, life, got, dad, get
Parents

never, hes, need, suicidal, live, go, please, going, end, someone,
4 Others
even, help, get, die, life, like, see, suicide, people, want

help, live, really, think, die, time, anymore, much, would, fucking,
5 Family
go, going, one, get, know, life, even, want, feel, like, ​family

11
6 Application

The applications of our model are broad and one of them is to be used as an alarm system to
prevent further tragedies. Our model will automatically web-crawl posts from Facebook
groups daily and to detect if the posts are with suicidal intentions. A flow chart of our
application can be found in Figure 3. First, we obtain posts from Facebook using
web-crawling technique, subsequently translate it to English using Google API, perform data
preprocessing including stopword deletion and ​punctuation removal, and decide if the posts
are suicide-related using SVM model.
The outcome of application is shown in Figure 9. We web-crawl the posts from NTU student
forum, PTT and DCard, manually label the posts (1 for having suicidal thoughts and 0
otherwise) as Label, and let the outcome of the alarm system as Model Prediction. It’s clear
that there are only two posts being misclassified, and the reason is that words used in the
non-suicidal posts contain several terms that are previously identified as suicide-related
words.

Figure 8: Flow Chart

Figure 9: Application Result

12
7 Conclusion and Prospect
We find distinctly different patterns between suicidal and non-suicidal posts. For suicidal
posts, the most used words include terms that rarely appear in the non-suicidal articles, e.g.,
suicide and ​die.​ The length of posts with suicidal intention is much shorter than that with
non-suicidal thought and we know that words used in suicidal articles center on the word
“feel”​. With the different embeddings and models, we choose the SVM model, which has the
accuracy of 95%, to develop an alarm system. We apply the said system to posts from various
social media and achieve the accuracy of roughly 86%.
For the future prospect, we expect to collaborate with medical facilities. With collaboration,
we can obtain medical records written in Chinese. Therefore, we are able to retrain the model
to prevent the unnecessary error from translating the posts to English. Another benefit is that
with the acquisition of first-hand medical records, e.g., knowing the patients’ psychiatric and
medical history, we can include a new weighted variable into the model to improve the
accuracy of the model and to better the prediction power. Aside from model optimization, the
application of the alarm system to various social platforms is anticipated. Nowadays, the
main outlet of expressing innerly negative emotions is via social media. If we can cooperate
with Facebook, Instagram and Reddit, our model is able to reach out to more users, stopping
more imminent tragedies and making the world a less sad place.

13
References
Cook, Benjamin L., et al. "Novel use of natural language processing (NLP) to predict suicidal
ideation and psychiatric symptoms in a text-based mental health intervention in Madrid."
Computational and mathematical methods in medicine 2016 (2016).

Fernandes, Andrea C., et al. "Identifying suicide ideation and suicidal attempts in a
psychiatric clinical research database using natural language processing." Scientific reports
8.1 (2018): 1-10.

Carson, Nicholas J., et al. "Identification of suicidal behavior among psychiatrically


hospitalized adolescents using natural language processing and machine learning of
electronic health records." PloS one 14.2 (2019): e0211116.

Metzger, Marie-Hélène, et al. "Use of emergency department electronic medical records for
automated epidemiological surveillance of suicide attempts: a French pilot study."
International journal of methods in psychiatric research 26.2 (2017): e1522.

Obeid, Jihad S., et al. "Identifying and Predicting intentional self-harm in electronic health
record clinical notes: Deep learning approach." JMIR medical informatics 8.7 (2020):
e17784.

14

You might also like