CSE-875d Merged

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/371469848
Utilizing Data Mining Techniques to Improve Healthcare"
Article in Computer · June 2023
CITATIONS READS
0 41
3 authors, including:
Rahul Kumar
177 PUBLICATIONS 3,095 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Independent research work View project
Novel algorithms for Influence Maximization View project
All content following this page was uploaded by Rahul Kumar on 10 June 2023.
The user has requested enhancement of the downloaded file.

Utilizing Data Mining Techniques to Improve Healthcare"
Sobhit Kumar, Ram K
ABSTRACT
In The main objective of utilizing data mining techniques in healthcare is to
develop an automated tool that can effectively notify doctors about patients'
treatment history, disease presence, and medical data. By leveraging data
mining algorithms, healthcare providers can derive valuable patterns and
insights from medical data, enabling more informed and practical decision-
making. This project specifically focuses on diabetes medical data. It
employs classification and clustering algorithms such as OPTICS,
NAIVEBAYES, and BRICH to analyze the data. The effectiveness and
efficiency of these algorithms are examined, with the goal of improving the
understanding and management of diabetes. By applying data mining
techniques to diabetes medical data, this project aims to enhance
healthcare practices and outcomes. Early disease detection, accurate
classification, and clustering of patient data can greatly support healthcare
professionals in providing personalized and effective treatments. The results
of this project have the potential to significantly impact the healthcare sector
by facilitating the identification of meaningful patterns and relationships
within medical data. By harnessing the power of data mining, healthcare
providers can optimize their decision-making processes and improve patient
care.
Keywords : Big data health care, Data mining techniques, Gaussian Naïve Bayes,
OPTICS, BIRCH
I. INTRODUCTION this rate. Diabetes is a condition during which the

physical body won't be ready to generate the
In India, healthcare systems have gained importance available amount of insulin which is important to
in recent years with the emergence of Big Data balance and monitor the amount of sugar in the body.
analytics Diabetes mellitus is posing a unique health Severe stage of diabetes can also lead to heart diseases,
problem in the country today, and hence India ranks blindness, kidney failure etc. Diabetes depends on
top in the world. Diabetes is a chronic medical two reasons:
condition that can be administered and controlled
through changes in lifestyle at an initial stage. At • Required amount of Insulin is not produced
advanced state, diabetes can be controlled easily with by the pancreas. This specifies Type-1 diabetes
early time detection and proper medication. Statistics and occurs in 5–10% of people.
as of today quotes that approximately 145 million • In Type-2, insulin production cells become
people worldwide are affected by diabetes mellitus inactive. Gestational diabetes usually attacked
and 5% of the Indian population contributes towards in women when a high sugar level is
generated during pregnancy.
A. Relation to Data Mining:
Table 1 : Comparison between type1 and type2 Machine Learning and Data Mining approach the
diabetes same methods, but Machine Learning focuses on
Feature Type 1 Type 2 prediction learned from the Training data. Data
diabetes diabetes Mining focuses on the unrecognized properties in the
Onset Sudden Gradual data. Data Mining combines with Machine Learning
methods, but goals vary. Machine Learning also uses
Age children adults
Data mining methods such as ‘Unsupervised Learning’
Body size Thin or normal Often obese
to improve learning methodologies.
Ketoacidosis Common Rare
Autoantibodies Usually present Absent B. Relation to optimization:
Prevalence ~10% ~90% Machine Learning also deals with optimization. Loss
function on a training set is a set of examples. Loss
functions describes the inconsistency between the
Interpretation and analyzing the presence of diabetes
predictions of the model being trained and the actual
is a significant problem to classify. The Classifier is
problem instances. The difference between the fields
intended such that it is more convenient and cost-
are from the goal of generalization optimization
efficient. Big Data and data mining techniques
algorithms can minimize the loss on a training set
provide a great deal to human-related applications.
whereas machine learning is concerned with
These methods find the most appropriate space in the
minimizing.
medical diagnosis which is one of the classification
phenomena. A physician is supposed to analyze many
C. Relation to statistics:
factors before the actual diagnosis of the diabetes
Machine learning and statistics were closely related
leading to a difficult task. Designing of automated
fields in terms of, but distinct in their principal goal:
diabetic detection uses machine learning and data
statistics gets population inferences from a sample,
mining techniques.
where as machine learning identifies generalizable
predictive patterns. According to Michael I. Jordan,
II. MACHINE LEARNING
the ideas of machine learning, from methodological
principles to theoretical tools, have had an extended
Machine learning (ML) is the study of computer
pre-history in statistics. The term data science as a
algorithms that automatically improves the efficiency
placeholder to call the overall field.
of complex tasks. It is seen as a subset of artificial
intelligence. Machine learning algorithms build a
III. METHODOLOGY
mathematical model based on sample data, known as
"training data", in order to make predictions or
Data mining is a new pattern for analyzing medical
decisions without being explicitly programmed to do
data and achieving useful and practical patterns. Data
so. Machine learning algorithms applications are used
mining helps us to predict the type of disease and
in filtering emails, computer vision and it is difficult
tries to find already non-identified patterns. The
to develop complex algorithms to perform required
objective of the proposed methodology is to analyze
tasks.
the medical dataset and predict whether the patient
is suffering from diabetes disease or not. The For the dataset, we apply algorithms to detect
prediction for diabetes is done using data mining whether a patient is diabetic or not. The dataset
algorithms such as Gaussian Naïve Bayes, BIRCH and consists of nine features with a class variable called
OPTICS. The Naïve Bayes technique is applied to the the outcome variable.
dataset to expect whether the patient is diabetic or
non-diabetic. BIRCH and OPTICS clustering IV. ALGORITHMS USED IN PROPOSED
algorithms are used to cluster people with similar SOLUTION
disease into one cluster and identify which algorithm
is more efficient by calculating the efficiency A. Gaussian Naïve Bayes
measures. Naïve Bayes classifiers are simple probabilistic
classifiers based on seeking Bayes' theorem with
A. Input Dataset strong independence and assumptions between the
The Dataset used for the application is the “Pima features.
Indian diabetes dataset”. The dataset consists of
several medication predictor(independent) variable Why do we use Naïve Baye’s :- simple and easy to
and one target(dependent) variables, outcome. The implement , Doesn’t require training data, Highly
dataset is a CSV(Comma Separated Value) file. It Scalable, fast and can be used to make real-time
contains upto 760 records. This dataset is taken from predictions, It is not sensitive to irrelevant features.
National Institute of Diabetes and Digestive and Naïve Bayes classifier works on the principle of
Kidney Diseases. The main objective of the dataset is conditional probability given by Bayes Theorem.
to predict whether the patient is having diabetes or Bayes' theorem allows updating the estimated
not, based on available diagnostic measurements probabilities of an event by including new
included in the dataset. Several conditions were information.
placed for the selection of these instances from a
huge database. In this dataset all patients were B. Optics Algorithm
females of age 21 years old of Pima Indian heritage. OPTICS Algorithm is abbreviated as Ordering Points
to Identify Cluster Structure. It updates from the
Features in dataset are: DBSCAN clustering algorithm. Two more terms are
updated to optics from DBSCAN clustering. They are
• Number of times pregnant
• Plasma glucose concentration 1) Core Distance:
• Diastolic blood pressure Core Distance is the minimum value of radius which
• Triceps skin fold thickness are essential to classify a given point as a core point.
• 2-Hour serum insulin If the given point is not a Core point, then its Core
• Body mass index Distance is undefined.
• Diabetes pedigree function
• Age
• Outcome
Step 5: Configure the core distance to the data point p.
Step 6: Create an Order file and include data point p
in the file.
Step 7: If core distance initialization is unsuccessful,
return back to Step 3 otherwise visit
Step 8: Calculate the reachability distance for each of
the neighbors and update the order seed with the
reference of the latest values.
Step 9: Find the neighbors for each data point in
order and update the point as processed.
Step 10: Fix the core distance of the point and append
Fig. 1: Core distance the order file.
Step 11: If there is an undefined core distance, go to
2) Reachability Distance: Step 9, else continue with Step 12.
Step 12: Repeat Step 8 until no change in the order
Step 13: End.
This clustering technique is different from other
techniques such that this technique does not
C. Birch Algorithm
explicitly branch the data into clusters. Visualization
Balanced Iterative Reducing and Clustering using
of Reachability distances is produced and is used to
cluster the data. Hierarchies (BIRCH) is a clustering algorithm that is
able to cluster large datasets by generating a small
and compact summary of the large dataset which
holds as much information as possible. The smaller
summary is clustered instead of clustering the larger
dataset. BIRCH is often used to complement other
clustering algorithms by establishing a summing-up
of the dataset that the other clustering algorithm can
now use. BIRCH has one major drawback, only the
metric attributes can be processed. A metric attribute
is an attribute whose values are often represented in
Fig. 2: Reachability distance Euclidean space
3) Algorithm steps:
Step 1: Initially ε and MinPts got to be specified.

Step 2: All the data points with data in the dataset are
marked as unprocessed.
Step 3: Neighbors are found for each point p which is
unprocessed.
Step 4: Now mark the data point as processed.
Step 6: Considering the leaf entities of the CF tree,
the cluster quality is improved accordingly by
applying the universal clustering algorithm.
Step 7: Redistribution of data objects and labelling
each point in the completely built CF tree.
V. COMPARISON BETWEEN PERFORMANCE OF

ALGORITHMS
The performance of algorithms is calculated by using

precision, recall and F1 scores.
Precision: Precision is a good measure to determine
when the values of False Positive is high. For instance,
Fig. 3: Phases of BIRCH Algorithm email spam detection. In email spam detection, a false
positive means an email which is non-spam (actual
Phase 1: Scan the dataset and construct an initial in- negative) has been identified as spam (predicted
memory CF tree. spam). The user might lose important data if the
Phase 2: Scan all the leaf entities of the CF tree and precision is not high for the spam detection model.
build a replacement CF tree that is smaller in size. Recall: Recall actually calculates what percent of the
Terminate all the outliers and form the clusters. Actual Positives our model capture through labelling
Phase 3: Use the clustering algorithm to cluster all it as Positive (True Positive). Applying the equivalent
the leaf entities. This phase leads to create a group of understanding, we know that Recall shall be the
clusters. model metric we use to select our greatest model
Phase 4: The cluster centroids obtained in Phase 3 are when there is a high cost related to False Negative.
used as seeds and the data points are redistributed to F1 Score: F1 Score is needed when you want to seek a
their closest neighbour seeds to form new cluster balance between Precision and Recall. what will be
representations. Finally, each leaf entity signifies the difference between the F1 Score and Accuracy
each cluster class. then? We have previously seen that accuracy is often
1) Algorithm steps: largely contributed by a huge number of True
Step 1: Set an initial threshold value and insert data Negatives which is mostly observed in business
points to the CF tree with respect to the Insertion circumstances, we do not focus on much whereas
algorithm. False Negative and False Positive usually consists of
Step 2: Increase the edge value if the dimensions of business costs.
the tree exceed the memory limit assigned to it.
Step 3: Reconstruct the partially built tree consistent
with the newly set threshold values and memory
limit. Step 4: Repeat the above steps until all the data
objects are scanned which forms a complete tree.
Step 5: Smaller CF trees are built by varying the edge
values and eliminating the Outliers.
Table 2 : Comparison between Optics and BIRCH VI. CONCLUSION AND FUTURE SCOPE
ALGORITHM PRECISION RECALL F1 The usefulness of data mining algorithms like

Gaussian Naïve Bayes, BIRCH and OPTICS for the
SCOR
prediction of diabetic disease is demonstrated. Data
E
mining techniques are constructive in diagnosing and
Optics 0.59 0.59 0.59 clustering the report of diabetic patients. BIRCH and
OPTICS are used to cluster similar kinds of people,
BIRCH 0.42 0.415 0.40 where BIRCH deploy on the CF tree and OPTICS
deploy on the ordering of the points in the cluster.
Analysis and comparison of clustering algorithms are
Based on the above performance metrics table of the
executed by considering numerous performance
two clustering algorithms Optics and BIRCH, the best
metrics. It is observed that for the same number of
algorithm that is most suitable for Diabetes detection
clusters obtained by different clustering techniques,
is the Optics algorithm. Here a comparison is
OPTICS is the most efficient and is suitable for
considered between the algorithms that are specified
diagnosis of diabetes. This work helps the doctors to
above and in terms of all the parameters considered
diagnose and supply the recommended medicine at
Optics is considered as the best algorithm.
an early stage to the patient to cure the disease. The
main aim is to reduce the cost and provide better
treatment. In future, this can be worked with an
additional number of classification algorithms and
their accuracy can be compared to find the optimal
one.
Fig. 4: Accuracy of BIRCH
Fig. 5: Accuracy of OPTICS

References
[1] Anita S. Kini, A. Nanda Gopal Reddy, Manjit Kaur, S. Satheesh, Thomas Martinetz, Hammam
Alshazly, "Ensemble Deep Learning and Internet of Things-Based Automated COVID-19 Diagnosis
Framework", Contrast Media & Molecular Imaging, vol. 2022, Article ID 7377502, 10 pages, 2022.
https://doi.org/10.1155/2022/7377502
[2] Aditi Sharan, “Term Co-occurrence and Context Window based Combined Approach for Query
Expansion with the Semantic Notion of Terms”, International Journal of Web Science(IJWS),
Inderscience, Vol. 3, No. 1, 2017.
[3] Saurabh Kumar, S.K. Pathak, "A Comprehensive Study of XSS Attack and the Digital Forensic
Models to Gather the Evidence". ECS Transactions, Volume 107, Number 1, 2022.
[4] Yadav, C.S.; Pradhan, M.K.; Gangadharan, S.M.P.; Chaudhary, J.K.; Khan, A.A.; Haq, M.A.;
Alhussen, A.; Wechtaisong, C.; Imran, H.; Alzamil, Z.S.; Pattanayak, H.S. “Multi-Class Pixel
Certainty Active Learning Model for Classification of Land Cover Classes Using Hyperspectral
Imagery”. Electronics 2022, 11, 2799. https://doi.org/10.3390/electronics11172799.
[5] Yadav, C.S.; Yadav, A.; Pattanayak, H.S.; Kumar, R.; Khan, A.A.; Haq, M.A.; Alhussen, A.;
Alharby, S. “Malware Analysis in IoT & Android Systems with Defensive Mechanism”.
Electronics 2022, 11, 2354. https://doi.org/10.3390/electronics11152354.
[6] A Goswami, D Sharma, H Mathuku, SMP Gangadharan, CS Yadav, “Change Detection in Remote
Sensing Image Data Comparing Algebraic and Machine Learning Methods”, Electronics,
Article id: 1505208, 2022
[7] Singh, J. “An Efficient Deep Neural Network Model for Music Classification”, Int. J. Web
Science, Vol. 3, No. 3, 2022.
[8] Vijay Kumar Bohat, “Neural Network Model for Recommending Music Based on Music Genres”,
In 10th IEEE International Conference on Computer Communication and Informatics (ICCCI -
2021), Jan. 27-29, 2021, Coimbatore, INDIA.
[9] Singh, J., “Learning based Driver Drowsiness Detection Model”, In 3rd IEEE International
Conference on Intelligent Sustainable Systems (ICISS 2020), , pp. 1163-1166, Palladam, India, Dec.
2020.
[10] A. Sharan, “Rank fusion and semantic genetic notion based automatic query expansion model”,
Swarm and Evolutionary Computation, Vol-38, Elsevier, 2018.
[11] R. Singh, “Ranks Aggregation and Semantic Genetic Approach based Hybrid Model for Query
Expansion ", International Journal of Computational Intelligence Systems, Vol. 10 (2017) 34– 55.
[12] A. Sharan, “A new fuzzy logic based query expansion model for efficient information retrieval
using relevance feedback approach”, Neural Computing And Applications, Vol 28, Springer, 2017.
[13] Chin-Teng Lin, Mukesh Prasad, Chia-Hsin Chung, Deepak Puthal, Hesham El-Sayed, Sharmi
Sankar, Yu-Kai Wang, Jagendra Singh, Arun Kumar Sangaiah, “IoT-based Wireless
Polysomnography Intelligent System for Sleep Monitoring", IEEE Access, Vol 6, Oct 2017
[14] Mukesh Prasad, Yousef Daraghmi, Prayag Tiwari, Pranay Yadav, Neha Bharill, “Fuzzy Logic
Hybrid Model with Semantic Filtering Approach for Pseudo Relevance Feedback- based Query
Expansion”, 2017 IEEE Symposium Series on Computational Intelligence (SSCI), 2017.
[15] Rakesh Kumar, “Lexical Co-Occurrence and Contextual Window-Based Approach with Semantic
Similarity for Query Expansion”, International Journal of Intelligent Information Technologies
(IJIIT), IGI, Vol. 13, No. 3, pp. 57-78, 2017.
[17] Mukesh Prasad, Om Kumar Prasad, Er Meng Joo, Amit Kumar Saxena and Chin- Teng Lin, “A
Novel Fuzzy Logic Model for Pseudo-Relevance Feedback-Based Query Expansion”, International
Journal of Fuzzy Systems, Springer, Vol 18, 2016.
[18] A. Sharan, “Ranks aggregation and semantic genetic approach based hybrid model for query
expansion”, International Journal of Computational Intelligence Systems, Taylor & Francis, Vol.
10, Issue 1, 2017, Pages 34 - 55
[19] A. Sharan, “Relevance Feedback based Query Expansion Model using Ranks Combining and
Word2vec Approach”, IETE- Journal of Research, Taylor & Francis, Vol 62, 2016.
[20] K. Singh and Aditi Sharan, “Relevance feedback based query expansion model using Borda count
and semantic similarity approach”, Computational Intelligence and Neuroscience, Article ID:
568197, pp. 1-14, 2015.
[21] A. Sharan, “Context Window based Co-occurrence Approach for Improving Feedback based Query
Expansion in Information Retrieval”, International Journal of Information Retrieval Research, IGI,
Vol. 5, No. 4, pp. 32-46, 2015.
[22] Yadav, A., Kumar, S., Singh, J. (2022). A Review of Physical Unclonable Functions (PUFs) and Its
Applications in IoT Environment. In: Hu, YC., Tiwari, S., Trivedi, M.C., Mishra, K.K. (eds) Ambient
Communications and Computer Systems. Lecture Notes in Networks and Systems, vol 356.
Springer, Singapore. https://doi.org/10.1007/978-981-16-7952-0_1
[23] Saurabh Kumar, Suryakant Pathak (2022) An enhanced digital forensic investigation framework
for XSS attack, Journal of Discrete Mathematical Sciences and Cryptography, 25:4, 1009-1018, DOI:
10.1080/09720529.2022.2072424
[24] Sharan, A., "A novel model of selecting high quality pseudo-relevance feedback documents using
classification approach for query expansion," 2015 IEEE Workshop on Computational Intelligence:
Theories, Applications and Future Directions (WCI), 2015, pp. 1-6, doi:10.1109/WCI.2015.7495539.
[25] A. Sharan, “Lexical Ontology based Computational Model to Find Semantic Similarity”, Intelligent
Computing Networking and Informatics, Advances in Intelligent Systems and Computing, AISC
Series, Springer, Vol. 243, pp. 119-128, 2014.
[26] A. Sharan, “A new fuzzy logic based query expansion model for efficient information retrieval
using relevance feedback approach”, Neural Computing And Applications, Vol 28, Springer, 2017.
[28] R. Aggarwal, S. Tiwari and V. Joshi, "Exam Proctoring Classification Using Eye Gaze Detection,"
2022 3rd International Conference on Smart Electronics and Communication (ICOSEC), 2022, pp.
371-376, doi: 10.1109/ICOSEC54921.2022.9951987.
[29] Himansu Sekhar Pattanayak, " Multi-class Pixel Certainty Active Learning Model for Remote
Sensing Applications", Electronics, Article id - 1791849, 2022.
[30] Kamal Upreti, Aditya Kr. Gupta, Nandan Dave, Arihant Surana and Durgesh Mishra, “Deep
Learning Approach for Hand Drawn Emoji Identification”, In 2022 IEEE International Conference
on Current Development in Engineering and Technology (CCET), 23-24 Dec, SAGE University,
Bhopal, India, 2022.
[31] Kamal Upreti, Snigdha Shrivastava, Aabhas Garg, Anupam Kumar Sharma, “Prediction &
Detection of Cardiovascular Diseases using Machine Learning Approaches”, In 2022 IEEE
International Conference on Communication, Security and Artificial Intelligence (ICCSAI-2022),
24-25 Dec, Galgotia University, Greater Noida, India, 2022.
[32] Aruna Yadav, A., Kumar, S., Singh, J. (2022). A Review of Physical Unclonable Functions (PUFs)
and Its Applications in IoT Environment. In: Hu, YC., Tiwari, S., Trivedi, M.C., Mishra, K.K. (eds)
Ambient Communications and Computer Systems. Lecture Notes in Networks and Systems, vol
356. Springer, Singapore. https://doi.org/10.1007/978-981-16-7952-0_1
[33] Pramod G. Musrif, J Singh, Amol More, Ashish Shankar, Ramkrishna (2023), “Design of Green
IoT for Sustainable Smart Cities and Ecofriendly Environment”, European Chemical Bulletin
Journal, Volume 12, issue 6, 2023.
View publication stats

CSE-875d Merged

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE-875d Merged

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Utilizing Data Mining Techniques to Improve Healthcare"

Article in Computer · June 2023

177 PUBLICATIONS 3,095 CITATIONS

Independent research work View project

Novel algorithms for Influence Maximization View project

The user has requested enhancement of the downloaded file.

I. INTRODUCTION this rate. Diabetes is a condition during which the

Step 1: Initially ε and MinPts got to be specified.

V. COMPARISON BETWEEN PERFORMANCE OF

The performance of algorithms is calculated by using

ALGORITHM PRECISION RECALL F1 The usefulness of data mining algorithms like

Fig. 4: Accuracy of BIRCH

Fig. 5: Accuracy of OPTICS

View publication stats

You might also like