School of Information Technology and Engineering

SCHOOL OF INFORMATION TECHNOLOGY AND ENGINEERING
B-TECH SOFTWARE ENGINEERING
FALL SEM -2019-2020
FINAL REPORT
GROUP MEMBERS :
1.BADAM MANISRINIVASA NARAYANA-17BCE0606
2.CHRIS LAZER-17BCE2160
COURSE : DATA VISUALIZATION
SLOT : G1
COURSE CODE : CSE3020
FACULTY : RAJ KUMAR R
SCHOOL :SCOPE
TITLE :Analysis of suicide statistics

CONTENTS:
 ABSTRACT
 KEYWORDS
 INTRODUCTION
 CODE AND RESULTS
 CONCLUSION
 REFERENCES
Abstract:
Suicide is one of the major causes of death across the world. With data being
generated in humongous quantity every second through various media like
social networking sites, surveys, etc.; a lot of relevant information is available
for suicide analysis. Data from social networking sites especially twitter has
been extensively considered for research to automate the process of suicide
prediction by using various machine learning and text mining techniques. Apart
from the social media analysis, socio-economic and cultural factors have been
studied to find reasons that drive people towards suicides. A lot of research
has focused on studying social media posts and surveys but research on real-
time data is at inchoate stage. This paper aims at elucidating various factors
responsible for suicide ideation, techniques and algorithms used to automate
suicide prediction and also notice the issues and challenges faced by the
existing research to expatiate requirements of future research.
Keywords: Suicide, Machine Learning, Social Networking Sites, Clinical

data,world health organization(who).
Introduction:
Statistics for Suicide Analysis for all the countries in the world using
Machine Learning Algorithms to find some interesting patterns, solutions
and Clues about Suicides using Data Analysis and Data Visualizations.
CODE AND RESULTS:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
data = pd.read_csv('Desktop/Data_Visualization_project//who_suicide_statistics.csv')
data = data.sort_values(['year'], ascending = True)
print(data.shape)
data.head()
# correlation plot
f, ax = plt.subplots(figsize = (4, 3))
corr = data.corr()
sns.heatmap(corr, mask = np.zeros_like(corr, dtype = np.bool),
cmap = sns.diverging_palette(3, 3, as_cmap = True), square = True, ax = ax)
# renaming the columns
data.rename({'sex' : 'gender', 'suicides_no' : 'suicides'}, inplace = True, axis = 1)
data.columns
# visualising the different countries distribution in the dataset
data['country'].value_counts(normalize = True)
data['country'].value_counts(dropna = False).plot.bar(color = 'cyan', figsize = (24, 8))
plt.title('Distribution of 141 coutries in suicides')

plt.xlabel('country name')
plt.ylabel('count')
plt.show()
# visualising the different year distribution in the dataset
data['year'].value_counts(normalize = True)
data['year'].value_counts(dropna = False,).plot.bar(color = 'magenta', figsize = (8, 6))
plt.title('Distribution of suicides from the year 1985 to 2016')

plt.xlabel('year')
plt.ylabel('count')
plt.show()
# suicides in different age groups
x1 = data[data['age'] == 0]['suicides'].sum()
x = pd.DataFrame([x1, x2, x3, x4, x5, x6])

x.index = ['5-14', '15-24', '25-34', '35-54', '55-74', '75+']
x.plot(kind = 'bar', color = 'grey')
plt.title('suicides in different age groups')

plt.xlabel('Age Group')
plt.ylabel('count')
plt.show()
# visualising the gender distribution in the dataset
data['gender'].value_counts(normalize = True)
data['gender'].value_counts(dropna = False).plot.bar(color = 'black', figsize = (4, 3))
plt.title('Distribution of 141 coutries in suicides')

plt.xlabel('gender')
plt.ylabel('count')
plt.show()
# total population of 141 countres over which the suicides survey is committed
data['population'].sum()
63761315943.0
# Average population
Avg_pop = data['population'].mean()
print(Avg_pop)
1664091.1353742562
# total number of suicides committed in the 141 countries from 1985 to 2016
data['suicides'].sum()
8026455.0
# Average suicide in the world
Avg_sui = data['suicides'].mean()
print(Avg_sui)
193.3153901734104
# Imputing the NaN values from the population column
data['population'] = data['population'].fillna(data['population'].median())
data['population'].isnull().any()
False
# Imputing the values suicides no column
data['suicides'] = data['suicides'].fillna(0)
data['suicides'].isnull().any()
False
# rearranging the columns
data = data[['country', 'year', 'gender', 'age', 'population', 'suicides']]

data.head(0)
data = data.drop(['country'], axis = 1)

data.head(0)
#splitting the data into dependent and independent variables
x = data.iloc[:,:-1]
y = data.iloc[:,-1]
print(x.shape)
print(y.shape)
(43776, 4)
(43776,)
# splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 45)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(32832, 4)
(32832,)
(10944, 4)
(10944,)
# min max scaling
# importing the min max scaler

from sklearn.preprocessing import MinMaxScaler
# creating a scaler
mm = MinMaxScaler()
# scaling the independent variables

x_train = mm.fit_transform(x_train)
x_test = mm.transform(x_test)
return self.partial_fit(X, y)
# using principal component analysis
from sklearn.decomposition import PCA
# creating a principal component analysis model

#pca = PCA(n_components = None)
# feeding the independent variables to the PCA model

#x_train = pca.fit_transform(x_train)
#x_test = pca.transform(x_test)
# visualising the principal components that will explain the highest share of variance
#explained_variance = pca.explained_variance_ratio_
#print(explained_variance)
# creating a principal component analysis model

#pca = PCA(n_components = 1)
# feeding the independent variables to the PCA model

#x_train = pca.fit_transform(x_train)
#x_test = pca.transform(x_test)
# applying k means clustering
# selecting the best choice for no. of clusters

from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state =
0)
km.fit(x_train)
wcss.append(km.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('no. of clusters')
plt.ylabel('WCSS')
plt.show()
# applying kmeans with 4 clusters
km = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state =

0)
y_means = km.fit_predict(x_train)
# visualising the clusters
plt.scatter(x_train[y_means == 0, 0], x_train[y_means == 0, 1], s = 100, c = 'pink', label = 'clus

ter 1')
plt.scatter(x_train[y_means == 1, 0], x_train[y_means == 1, 1], s = 100, c = 'cyan', label = 'clu
ster 2')
plt.scatter(x_train[y_means == 2, 0], x_train[y_means == 2, 1], s = 100, c = 'magenta', label =
'cluster 3')
plt.scatter(x_train[y_means == 3, 0], x_train[y_means == 3, 1], s = 100, c = 'violet', label = 'cl
uster 4')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], s = 100, c = 'red', label = 'centr

oids')
plt.title('Cluster of Clients')
plt.xlabel('cc')
plt.show()
from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score
# creating the model

model = LinearRegression()
# feeding the training data into the model

model.fit(x_train, y_train)
# predicting the test set results

y_pred = model.predict(x_test)
# calculating the mean squared error

mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)
# calculating the root mean squared error

rmse = np.sqrt(mse)
print("RMSE :", rmse)
#calculating the r2 score

r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)
from sklearn.svm import SVR

model = SVR()



print("MSE :", mse)

rmse = np.sqrt(mse)

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()



print("MSE :", mse)

rmse = np.sqrt(mse)

MSE : 117021.75686150225
RMSE : 342.08442943446323
r2_score : 0.8003564233332043
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()



print("MSE :", mse)

rmse = np.sqrt(mse)

MSE : 151779.80178841596
RMSE : 389.5892731947531
r2_score : 0.7410578741295126
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor()



print("MSE :", mse)

rmse = np.sqrt(mse)
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes = 100, max_iter = 50 )



print("MSE :", mse)

rmse = np.sqrt(mse)

CONCLUSION:
This REPORT surveyed various approaches and media used in suicide analysis like WHO
based suicide detection systems, analysis of census data, various blogs and surveys. It can
be inferred that though these models exhibit fair enough accuracies, there is a need for a
better and robust system to automate the process of suicide prediction with high accuracy
and precision that functions well in real time environment.
References :
[1] Varathan, K.D., Talib, N.: Suicide detection system based on Twitter, Science and
Information Conference, pp. 785-788, DOI:10.1109/SAI.2014. 6918275, London, UK (2014)
[2] C. Hanson et al.: The Impact of Online Social Capital on Twitter Users At-Risk for Suicide,
IEEE International Conference on Healthcare Informatics(ICHI), pp. 454-454, DOI:10.1109/
ICHI.2017.87, Park City, UT, USA (2017)
[3] Priyanka, S.S., Galgali, S., Priya, S.S., Shashank, B.R., Srinivasa, K.G.: Analysis of suicide
victim data for the prediction of number of suicides in India, International Conference on
Circuits, Controls, Communications and Computing (I4C), pp. 1-5,
DOI:10.1109/CIMCA.2016.8053293, Bangalore, India (2016)
[4] M. E. Larsen et al.: We Feel: Mapping Emotion on Twitter, IEEE Journal of Biomedical
and Health Informatics, vol. 19, no. 4, pp. 1246-1252, DOI:10.1109/JBHI.2015.2403839
(2015)
[5] https://data.gov.in/catalog/stateut-wisedistribution-suicides-means-adopted
[6] B. O'Dea et al.: Detecting suicidality on Twitter, Internet Interventions volume 2 Issue 2,
183– 188 (2015)
[7] Ben-Ari, A., Hammond, K.: Text Mining the EMR for Modeling and Predicting Suicidal
Behavior among US Veterans of the 1991 Persian Gulf War, 2015 48th Hawaii International
Conference on System Sciences (HICSS), pp. 3168-3175, DOI:10.1109/ HICSS.2015. 382,
Kauai, HI, USA (2015)
[8] H. H. Shuai et al.: A Comprehensive Study on Social Network Mental Disorders Detection
via Online Social Media Mining, IEEE Transactions on Knowledge and Data Engineering, vol.
PP, no. 99, pp. 1-1, DOI:10.1109/TKDE.2017.2786695 (2017)
[9] Jihoon, O., Kyongsik, Y., Ji-Hyun, H., JeongHo, C.: Classification of Suicide Attempts
through a International Journal of Pure and Applied Mathematics Special Issue 242 Machine
Learning Algorithm Based on Multiple Systemic Psychiatric Scales, Frontiers in Psychiatry,
volume 8, pp 192, DOI: 10.3389/fpsyt.2017.00192 (2017)
[10] X. Huang et al.: Detecting Suicidal Ideation in Chinese Microblogs with Psychological
Lexicons, IEEE 11th Intl Conf on Ubiquitous Intelligence and Computing and IEEE 11th Intl
Conf on Autonomic and Trusted Computing and IEEE 14th Intl Conf on Scalable Computing
and Communications and Its Associated Workshops (UTC-ATC-ScalCom), pp. 844- 849,
DOI:10.1109/UIC-ATC-ScalCom.2014.48, Bali, Indonesia (2014)

School of Information Technology and Engineering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

School of Information Technology and Engineering

Uploaded by

Copyright:

Available Formats

SCHOOL OF INFORMATION TECHNOLOGY AND ENGINEERING

B-TECH SOFTWARE ENGINEERING

FALL SEM -2019-2020

1.BADAM MANISRINIVASA NARAYANA-17BCE0606

COURSE : DATA VISUALIZATION

COURSE CODE : CSE3020

FACULTY : RAJ KUMAR R

TITLE :Analysis of suicide statistics

Keywords: Suicide, Machine Learning, Social Networking Sites, Clinical

import matplotlib.pyplot as plt

data = data.sort_values(['year'], ascending = True)

f, ax = plt.subplots(figsize = (4, 3))

data.rename({'sex' : 'gender', 'suicides_no' : 'suicides'}, inplace = True, axis = 1)

plt.title('Distribution of 141 coutries in suicides')

plt.title('Distribution of suicides from the year 1985 to 2016')

x = pd.DataFrame([x1, x2, x3, x4, x5, x6])

plt.title('suicides in different age groups')

# visualising the gender distribution in the dataset

plt.title('Distribution of 141 coutries in suicides')

# Average suicide in the world

# Imputing the NaN values from the population column

# Imputing the values suicides no column

# rearranging the columns

data = data[['country', 'year', 'gender', 'age', 'population', 'suicides']]

data = data.drop(['country'], axis = 1)

#splitting the data into dependent and independent variables

# splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 45)

# importing the min max scaler

# scaling the independent variables

# using principal component analysis

from sklearn.decomposition import PCA

# creating a principal component analysis model

# feeding the independent variables to the PCA model

# creating a principal component analysis model

# feeding the independent variables to the PCA model

# applying k means clustering

# selecting the best choice for no. of clusters

# applying kmeans with 4 clusters

km = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state =

# visualising the clusters

plt.scatter(x_train[y_means == 0, 0], x_train[y_means == 0, 1], s = 100, c = 'pink', label = 'clus

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], s = 100, c = 'red', label = 'centr

from sklearn.linear_model import LinearRegression

# creating the model

# feeding the training data into the model

# predicting the test set results

# calculating the mean squared error

# calculating the root mean squared error

#calculating the r2 score

from sklearn.svm import SVR

# feeding the training data into the model

# predicting the test set results

# calculating the mean squared error

# calculating the root mean squared error

#calculating the r2 score

from sklearn.ensemble import RandomForestRegressor

# creating the model

# feeding the training data into the model

# predicting the test set results

# calculating the mean squared error