You are on page 1of 9


 With big data growth in biomedical and healthcare communities, accurate analysis
of medical
 data benefits early disease detection, patient care, and community services.
However, the analysis accuracy is reduced when the quality of medical data is
incomplete. Moreover, different regions exhibit unique characteristics of certain
regional diseases, which may weaken the prediction of disease outbreaks. In this
paper, we streamline machine learning algorithms for effective prediction of
chronic disease outbreak indisease-frequent communities. We experiment the
modified prediction models over real-life hospital datacollected from central
China in 2013_2015. To overcome the difficulty of incomplete data, we use a
latentfactor model to reconstruct the missing data. We experiment on a regional
chronic disease of cerebralinfarction. To the best of our knowledge, none of the
existing work focused on both data types in the area of medical big data
analytics. Compared with several typical prediction algorithms, the prediction
accuracy of our proposed algorithm reaches 94.8% witha convergence speed,
which is faster than that of the CNN-based unimodal disease risk prediction
Existing system

 However, those existing work mostly considered structured data.

For unstructured data, for example, usin gconvolution neural
network to extract text characteristics automatically has already
attracted wide attention and also achieved very good results.
 Furthermore, there is a large difference between diseases in
different regions, primarily because of the diverse climate and
living habits in the region.
 Thus, risk classification based on big data analysis, the following
challenges remain: How should the missing data beaddresse
How should the main chronic diseases in a certainregion and the
main characteristics of the disease in the regionbe determined?
How can big data analysis technology be used to analyze the
disease and create a better model?

 According to a report by McKinsey, 50% of Americans have one or more

chronic diseases, and 80% of Americanmedical care fee is spent on
chronic disease treatment.
 Withthe improvement of living standards, the incidence of
chronicdisease is increasing.
 The United States has spent an averageof 2.7 trillion USD annually on
chronic disease treatment.This amount comprises 18% of the entire
annual GDP of the United States. The healthcare problem of chronic
diseases is also very important in many other countries.
 In China, chronic diseases are the main cause of death, according to a
Chinese report on nutrition and chronic diseases in 2015, 86.6% of
deaths are caused by chronic diseases.
 Therefore, it is essential to perform risk assessments for chronic
diseases. With the growth in medical data, collecting electronic health
records (EHR) is increasingly convenient.
Proposed System

 Proposed a healthcare system using smart clothing for

sustainable healthmonitoring. Had thoroughly studied
the heterogeneous systems and achieved the best
results for costminimization on tree and simple path
cases for heterogeneous systems.
 Support Vector Machines (SVM) is a supervised
machine learning algorithm to solve multiclass
classification problem. Given set of training examples,
marked as one of the two classes, SVM algorithm
builds model that assign each new example of the
dataset to one of those classes.
 algorithm builds model that assign each new example of the dataset to
one of those classes.
 Now, there are 2 kinds of problem. One those are linearly separable and
other are non-linearly separable. For linearly separable problems, SVM
uses a linear kernel which classifies dataset among different classes
using a linear hyperplane. Our approach was to try different classifier
and try and compare which one of those works better for the given
dataset of heart disease.
 Next we go a step further by predicting absence (zero) or presence
(non-zero) of heart disease. This is possible because we group all
severity classes (1 to 4) together which mean that a non-zero would
indicate presence of heart disease and a zero would indicate absence of
heart disease. First step is dimensionality reduction for which we use
PCA with 5 components that picks best 5 components out of 14
attributes. Now what we get is a vector representation as we obtained in
problem-1, which basically implies 303 samples x 5 features. For
problem-2, we use an 80/20 split where 80% of data is used to train
classifier and 20% is used to test. Now, we follow the same procedure as
we did for problem-1 we apply 3 classifiers i.e. Linear SVM, Non-Linear
SVM with RBF kernel and Stratified k-means cross validation with 5
folds, all for a value of C=0.001

 It is evident that input data plays an important role in prediction along

with machine learning techniques. As is seen in the dataset, provided,
we have labels from 0 to 4 where the labels of 4 are hardly 13 and when
we split the data into train and test, the number become very less
which is nothing but noise and can be totally removed from the dataset
by using filtering techniques and hence the linear model will be
available to predict the outcome much better with absence of noise.
Moreover, PCA get rid of similar feature set and still obtain predictions
with great efficiency. Moreover, we have conducted tests using
nonlinear RBF kernel which is a normal first choice and then validating
against linear SVC kernel which outperformed RBF in split case. Most
importantly, It not only helps us in predicting the outcome but also
gave us valuable insights about the nature of data, which can be used in
future to train our classifiers in a much better way.
Software Requirements

 OS : Windows
 Python IDE : python 2.7.x and above
 Pycharm IDE
 Setup tools and pip to be installed for 3.6.x and above
Hardware Requirements

 RAM: 4GB and Higher

 Processor :Intel i3 and above
 Hard Disk: 500GB: Minimum