You are on page 1of 17

SCHOOL OF INFORMATION TECHNOLOGY AND ENGINEERING

B-TECH SOFTWARE ENGINEERING

FALL SEM -2019-2020

FINAL REPORT

GROUP MEMBERS :

1.BADAM MANISRINIVASA NARAYANA-17BCE0606

2.CHRIS LAZER-17BCE2160

COURSE : DATA VISUALIZATION

SLOT : G1

COURSE CODE : CSE3020

FACULTY : RAJ KUMAR R

SCHOOL :SCOPE

TITLE :Analysis of suicide statistics


CONTENTS:

 ABSTRACT
 KEYWORDS
 INTRODUCTION
 CODE AND RESULTS
 CONCLUSION
 REFERENCES

Abstract:
Suicide is one of the major causes of death across the world. With data being
generated in humongous quantity every second through various media like
social networking sites, surveys, etc.; a lot of relevant information is available
for suicide analysis. Data from social networking sites especially twitter has
been extensively considered for research to automate the process of suicide
prediction by using various machine learning and text mining techniques. Apart
from the social media analysis, socio-economic and cultural factors have been
studied to find reasons that drive people towards suicides. A lot of research
has focused on studying social media posts and surveys but research on real-
time data is at inchoate stage. This paper aims at elucidating various factors
responsible for suicide ideation, techniques and algorithms used to automate
suicide prediction and also notice the issues and challenges faced by the
existing research to expatiate requirements of future research.

Keywords: Suicide, Machine Learning, Social Networking Sites, Clinical


data,world health organization(who).

Introduction:
Statistics for Suicide Analysis for all the countries in the world using
Machine Learning Algorithms to find some interesting patterns, solutions
and Clues about Suicides using Data Analysis and Data Visualizations.
CODE AND RESULTS:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt


import seaborn as sns

data = pd.read_csv('Desktop/Data_Visualization_project//who_suicide_statistics.csv')

data = data.sort_values(['year'], ascending = True)

print(data.shape)

data.head()

# correlation plot

f, ax = plt.subplots(figsize = (4, 3))

corr = data.corr()
sns.heatmap(corr, mask = np.zeros_like(corr, dtype = np.bool),
cmap = sns.diverging_palette(3, 3, as_cmap = True), square = True, ax = ax)
# renaming the columns

data.rename({'sex' : 'gender', 'suicides_no' : 'suicides'}, inplace = True, axis = 1)

data.columns
# visualising the different countries distribution in the dataset

data['country'].value_counts(normalize = True)
data['country'].value_counts(dropna = False).plot.bar(color = 'cyan', figsize = (24, 8))

plt.title('Distribution of 141 coutries in suicides')


plt.xlabel('country name')
plt.ylabel('count')
plt.show()
# visualising the different year distribution in the dataset

data['year'].value_counts(normalize = True)
data['year'].value_counts(dropna = False,).plot.bar(color = 'magenta', figsize = (8, 6))

plt.title('Distribution of suicides from the year 1985 to 2016')


plt.xlabel('year')
plt.ylabel('count')
plt.show()
# suicides in different age groups

x1 = data[data['age'] == 0]['suicides'].sum()
x2 = data[data['age'] == 1]['suicides'].sum()
x3 = data[data['age'] == 2]['suicides'].sum()
x4 = data[data['age'] == 3]['suicides'].sum()
x5 = data[data['age'] == 4]['suicides'].sum()
x6 = data[data['age'] == 5]['suicides'].sum()

x = pd.DataFrame([x1, x2, x3, x4, x5, x6])


x.index = ['5-14', '15-24', '25-34', '35-54', '55-74', '75+']
x.plot(kind = 'bar', color = 'grey')

plt.title('suicides in different age groups')


plt.xlabel('Age Group')
plt.ylabel('count')
plt.show()

# visualising the gender distribution in the dataset

data['gender'].value_counts(normalize = True)
data['gender'].value_counts(dropna = False).plot.bar(color = 'black', figsize = (4, 3))

plt.title('Distribution of 141 coutries in suicides')


plt.xlabel('gender')
plt.ylabel('count')
plt.show()

# total population of 141 countres over which the suicides survey is committed

data['population'].sum()

63761315943.0

# Average population

Avg_pop = data['population'].mean()
print(Avg_pop)
1664091.1353742562

# total number of suicides committed in the 141 countries from 1985 to 2016

data['suicides'].sum()

8026455.0

# Average suicide in the world

Avg_sui = data['suicides'].mean()
print(Avg_sui)
193.3153901734104

# Imputing the NaN values from the population column

data['population'] = data['population'].fillna(data['population'].median())
data['population'].isnull().any()
False

# Imputing the values suicides no column

data['suicides'] = data['suicides'].fillna(0)
data['suicides'].isnull().any()

False

# rearranging the columns

data = data[['country', 'year', 'gender', 'age', 'population', 'suicides']]


data.head(0)

data = data.drop(['country'], axis = 1)


data.head(0)

#splitting the data into dependent and independent variables

x = data.iloc[:,:-1]
y = data.iloc[:,-1]

print(x.shape)
print(y.shape)
(43776, 4)
(43776,)

# splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 45)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(32832, 4)
(32832,)
(10944, 4)
(10944,)
# min max scaling

# importing the min max scaler


from sklearn.preprocessing import MinMaxScaler

# creating a scaler
mm = MinMaxScaler()

# scaling the independent variables


x_train = mm.fit_transform(x_train)
x_test = mm.transform(x_test)

return self.partial_fit(X, y)

# using principal component analysis

from sklearn.decomposition import PCA

# creating a principal component analysis model


#pca = PCA(n_components = None)

# feeding the independent variables to the PCA model


#x_train = pca.fit_transform(x_train)
#x_test = pca.transform(x_test)

# visualising the principal components that will explain the highest share of variance
#explained_variance = pca.explained_variance_ratio_
#print(explained_variance)

# creating a principal component analysis model


#pca = PCA(n_components = 1)

# feeding the independent variables to the PCA model


#x_train = pca.fit_transform(x_train)
#x_test = pca.transform(x_test)

# applying k means clustering

# selecting the best choice for no. of clusters


from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state =
0)
km.fit(x_train)
wcss.append(km.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('no. of clusters')
plt.ylabel('WCSS')
plt.show()

# applying kmeans with 4 clusters

km = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state =


0)
y_means = km.fit_predict(x_train)

# visualising the clusters

plt.scatter(x_train[y_means == 0, 0], x_train[y_means == 0, 1], s = 100, c = 'pink', label = 'clus


ter 1')
plt.scatter(x_train[y_means == 1, 0], x_train[y_means == 1, 1], s = 100, c = 'cyan', label = 'clu
ster 2')
plt.scatter(x_train[y_means == 2, 0], x_train[y_means == 2, 1], s = 100, c = 'magenta', label =
'cluster 3')
plt.scatter(x_train[y_means == 3, 0], x_train[y_means == 3, 1], s = 100, c = 'violet', label = 'cl
uster 4')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], s = 100, c = 'red', label = 'centr


oids')
plt.title('Cluster of Clients')
plt.xlabel('cc')
plt.show()

from sklearn.linear_model import LinearRegression


from sklearn.metrics import r2_score

# creating the model


model = LinearRegression()

# feeding the training data into the model


model.fit(x_train, y_train)

# predicting the test set results


y_pred = model.predict(x_test)

# calculating the mean squared error


mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error


rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score


r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)

from sklearn.svm import SVR


# creating the model
model = SVR()

# feeding the training data into the model


model.fit(x_train, y_train)

# predicting the test set results


y_pred = model.predict(x_test)

# calculating the mean squared error


mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error


rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score


r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)

from sklearn.ensemble import RandomForestRegressor

# creating the model


model = RandomForestRegressor()

# feeding the training data into the model


model.fit(x_train, y_train)

# predicting the test set results


y_pred = model.predict(x_test)

# calculating the mean squared error


mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error


rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score


r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)

MSE : 117021.75686150225
RMSE : 342.08442943446323
r2_score : 0.8003564233332043

from sklearn.tree import DecisionTreeRegressor

# creating the model


model = DecisionTreeRegressor()

# feeding the training data into the model


model.fit(x_train, y_train)

# predicting the test set results


y_pred = model.predict(x_test)

# calculating the mean squared error


mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error


rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score


r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)
MSE : 151779.80178841596
RMSE : 389.5892731947531
r2_score : 0.7410578741295126

from sklearn.ensemble import AdaBoostRegressor

# creating the model


model = AdaBoostRegressor()

# feeding the training data into the model


model.fit(x_train, y_train)

# predicting the test set results


y_pred = model.predict(x_test)

# calculating the mean squared error


mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error


rmse = np.sqrt(mse)
print("RMSE :", rmse)
#calculating the r2 score
r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)

from sklearn.neural_network import MLPClassifier

# creating the model


model = MLPClassifier(hidden_layer_sizes = 100, max_iter = 50 )

# feeding the training data into the model


model.fit(x_train, y_train)

# predicting the test set results


y_pred = model.predict(x_test)

# calculating the mean squared error


mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error


rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score


r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)

CONCLUSION:

This REPORT surveyed various approaches and media used in suicide analysis like WHO
based suicide detection systems, analysis of census data, various blogs and surveys. It can
be inferred that though these models exhibit fair enough accuracies, there is a need for a
better and robust system to automate the process of suicide prediction with high accuracy
and precision that functions well in real time environment.

References :
[1] Varathan, K.D., Talib, N.: Suicide detection system based on Twitter, Science and
Information Conference, pp. 785-788, DOI:10.1109/SAI.2014. 6918275, London, UK (2014)

[2] C. Hanson et al.: The Impact of Online Social Capital on Twitter Users At-Risk for Suicide,
IEEE International Conference on Healthcare Informatics(ICHI), pp. 454-454, DOI:10.1109/
ICHI.2017.87, Park City, UT, USA (2017)
[3] Priyanka, S.S., Galgali, S., Priya, S.S., Shashank, B.R., Srinivasa, K.G.: Analysis of suicide
victim data for the prediction of number of suicides in India, International Conference on
Circuits, Controls, Communications and Computing (I4C), pp. 1-5,
DOI:10.1109/CIMCA.2016.8053293, Bangalore, India (2016)

[4] M. E. Larsen et al.: We Feel: Mapping Emotion on Twitter, IEEE Journal of Biomedical
and Health Informatics, vol. 19, no. 4, pp. 1246-1252, DOI:10.1109/JBHI.2015.2403839
(2015)

[5] https://data.gov.in/catalog/stateut-wisedistribution-suicides-means-adopted

[6] B. O'Dea et al.: Detecting suicidality on Twitter, Internet Interventions volume 2 Issue 2,
183– 188 (2015)

[7] Ben-Ari, A., Hammond, K.: Text Mining the EMR for Modeling and Predicting Suicidal
Behavior among US Veterans of the 1991 Persian Gulf War, 2015 48th Hawaii International
Conference on System Sciences (HICSS), pp. 3168-3175, DOI:10.1109/ HICSS.2015. 382,
Kauai, HI, USA (2015)

[8] H. H. Shuai et al.: A Comprehensive Study on Social Network Mental Disorders Detection
via Online Social Media Mining, IEEE Transactions on Knowledge and Data Engineering, vol.
PP, no. 99, pp. 1-1, DOI:10.1109/TKDE.2017.2786695 (2017)

[9] Jihoon, O., Kyongsik, Y., Ji-Hyun, H., JeongHo, C.: Classification of Suicide Attempts
through a International Journal of Pure and Applied Mathematics Special Issue 242 Machine
Learning Algorithm Based on Multiple Systemic Psychiatric Scales, Frontiers in Psychiatry,
volume 8, pp 192, DOI: 10.3389/fpsyt.2017.00192 (2017)

[10] X. Huang et al.: Detecting Suicidal Ideation in Chinese Microblogs with Psychological
Lexicons, IEEE 11th Intl Conf on Ubiquitous Intelligence and Computing and IEEE 11th Intl
Conf on Autonomic and Trusted Computing and IEEE 14th Intl Conf on Scalable Computing
and Communications and Its Associated Workshops (UTC-ATC-ScalCom), pp. 844- 849,
DOI:10.1109/UIC-ATC-ScalCom.2014.48, Bali, Indonesia (2014)

You might also like