Professional Documents
Culture Documents
This calls
DISEASE PREDICTION USING for good prediction algorithms to predict
MACHINE LEARNING CKD at an earlier stage. A wide range of
machine learning algorithms are employed
The specific gravity is nothing but The Sugar has a range of 0-5 in out
the density of the urine. It normally ranges dataset, where most of the patients have a
from 1.002-1.030 sg. Here we can see that high sugar as the fall on 0-1. Only a few
many people have density ranging from patients have mild sugar content in their
1.01-1.02. Only a few patients fall above blood. So there are chances for diabetes
the range. High specific gravity leads to among the patients.
kidney stones, so they have to be detected
early and given treatment.
Red blood corpuscles VS Class:
Albumin VS Class:
Hypertension VS Class:
Appetite VS Class:
Diabetes is also a risking
factor to cause renal problems, as the sugar
content in the blood is high. This directly
affects the filtration of blood in the kidney.
The patients will have continuous flow of
urine which causes improper filtration.
Here we can see that many patients are
affected by diabetes. About 260 of 400 Appetite may worsen with
patients have diabetes, so they have to be progression of kidney disease leading to
treated soon. malnutrition. As nutrition status is an
important factor in dialysis and natural
Coronary artery disease VS Class:
filtration process it is important to have a
proper appetite. Here the patients have
good, poor and no appetite too. We can see
that maximum of the patients have a good
appetite with respect to the class variable.
Pedal Edema VS Class:
The Red Blood Corpuscles
count is one of the main factor in CKD, as
it is essential for filtering the blood in the
kidney. From the plot we can see that half
of the patients are normal and half of them
are abnormal. So we need to concentrate
here to classify the patients effectively, as
Pedal Edema is quite it is one of the highly contributing variable
related to appetite as appetite causes to the class.
edema around our body. Diuretics are a
Anemia VS Class:
type of medication that causes the kidneys
to excrete more water and sodium,
which can reduce edema. Diuretics must
be used with care because removing too
much fluid too quickly can lower the blood
pressure, cause light-headedness or
fainting, and impair kidney function. The
results of appetite and pedal edema are
mostly the same.
It is a condition in which the blood
doesn't have enough healthy red blood
Red blood corpuscles VS Class:
cells. When there is improper filtration of
blood in our kidney, it causes anemia as
most of our blood lacks RBCs and is not
purified. This results in the reduced count
of RBC which causes anemia. Here we can
clearly see that most of the patients are
affected by anemia as they had reduced Inference:
RBC count as we saw before. The patients
Choosing significant variables is an
have to be diagnosed early to avoid CKD.
important process before we build a
Overall inference of patients with model. Because there will be certain
and without CKD from dataset: variables which may contain many null
and Nan values, they might not contribute
much to the model’s accuracy etc. So we
will be dropping some variables based on
the Chi-squared test. Chi-squared test is
used to determine the significant variables.
Selection of significant variables:
The variables which doesn’t provide
significant information to the renal disease
classification are removed. P-value can be
used to infer how significant the variable
contributions are. Variables with p-value
greater than 0.10 are removed as they
variables doesn’t provide significant
contribution to classify the variables at
90% confidence level. The insignificant
variables which has been removed for
further proceedings are ‘Sodium’,
‘Potassium’, ‘Red blood cell count’.
Dropping these variables we will train the
new dataset to implement it in the model.
Confusion matrix is nothing but an the patients with CKD and not CKD.
(negative) is 1.00 and for class 1 are supervised learning models with
preciseness of the model i.e. so data used for classification and regression
Recall value for class 0 is 1.00 and each marked as belonging to one or the
class 1 is 1.00, which describes the other of two categories, an SVM training
amount up – to which the model algorithm builds a model that assigns new
(negative) is 1.00 and for class 1 fold cross validation is done for 10
(positive) is 0.97, indicating the folds. Thus, the linear kernel SVC has
Precision Score for class 0 62.5%. Errors in the medical fields is not
(negative) is 0.00 and for class 1 acceptable, even though the model tried its
(positive) is 0.62, indicating the best after optimization too. The model is
preciseness of the model which is also not under -fit as model optimisation
Recall value for class 0 is 0 and done for 10 folds. Thus, the Radial Bases
class 1 is 1.00, which describes the kernel SVC has made the worst
and 100% respectively. Finally we have From the models constructed above
come to the accuracy of the model. we can infer that the models with the
Surprisingly, we have the model accuracy highest accuracy of 100% is Logistic
of 100%. The model has worked well and Regression, Naive Bayes Classifier,
classified the patients with and without Decision Tree and Random Forest. The
CKD with an accuracy of 100%. The model with an accuracy lesser than 100%
model is also not over-fitted as model and also a good model is Support Vector
optimisation techniques like K fold cross Classifier with an accuracy of 98.33%. The
validation is done for 10 folds. So the
model which had a fair performance is which take the healthcare industry to an
KNN Classifier with an accuracy of advanced level.
80.83%. The model with the worst
accuracy is the SVM model with Radial
Basis kernel with an accuracy of 62.5%. Reference:
Deciding the best model out the 4 models [1]. Gunarathne W.H.S.D,Perera K.D.M,
is quite a competitive task. But by looking Kahandawaarachchi K.A.D.C.P,
into the overall accuracy, precision, recall, “Performance Evaluation on Machine
mean of scores and variance the Learning Classification Techniques for
DECISION TREE MODEL is the Disease Classification and Forecasting
winner here. The reason behind the through Data Analytics for Chronic
statement is the variance and mean of Kidney Disease (CKD)”,2017 IEEE 17th
scores is also 0 and 1 respectively, which International Conference on
is perfect! Bioinformatics and Bioengineering.
mainly focus on implementing more big Applied Engineering Research ISSN 0973-
data oriented tools and techniques which 4562 Volume 12, Number 23 (2017) pp.