You are on page 1of 5

2020 International Conference for Emerging Technology (INCET)

Belgaum, India. Jun 5-7, 2020

Analysis of Anemia Using Data Mining Techniques


with Risk Factors Specification
Mohammed Sami MOHAMMED Arshed A. AHMAD Murat SARI
Department of Electronics and Department of Mathematics, Yildiz Technical Department of Mathematics, Yildiz Technical
Communications, Faculty of Electrical and University, Istanbul, 34220, Turkey University, Istanbul,34220-Turkey
Electronic Engineering, Computer Department, College of Education sarim@yildiz.edu.tr
Yildiz Technical University, for Pure Science Diyala University, Diyala-
Istanbul,34220, Turkey Iraq, arshed980@gmail.com
mohammed_sami9@yahoo.com

Abstract— Deficiency in healthy Red Blood Cells (RBC) metric. In the least development countries, anemia is
leads to insufficient oxygen to be carried to whole blood tissues. widespread especially in children and pregnant women, as in
Many reasons cause such an issue like iron or vitamin Malawi. The need of anemia prediction system and due to
deficiency which is known as Anemia. Pregnant women, the cost of such a common system, researchers suggested a
children under the age of 6, people with a low vitamin diet and low cost prediction [4]. Testing cost was 1.00$ per patient,
losing their blood due to surgery or injury are at risk that will but in this paper researchers demonstrate a spectra method
tend to have anemia. Such a disease can be diagnosed by blood which minimized the prices per patient. Such a device of
test called Complete Blood Count (CBC), which evaluates prediction disease take researchers interests to design an
Hemoglobin levels of patient’s blood. Undiagnosed or
early detector [5]. By defining impedance analysis and
untreated left disease, such as anemia, can cause health
problems such as severe fatigue and pregnancy complications.
relying on hematocrit analysis, this device works with total
Different types of anemia, especially those associated with iron patient samples. Nnumerical techniques based on the radial
or vitamin deficiency, can be ameliorated, especially when basis functions are presented as a solution of anemia
detected at an early stage. In this paper, four techniques, treatment [6].
Bayesian Network (BN), Naive Bayes (NB), Logistic Regression Rare forms and counting representations of some red
(LR) and Multilayer Perceptron (MLP) have been applied to blood cells are essential for iron deficiency anemia
predict anemia based on 539 data, with 10 attributes, collected
recognized by three different classifiers [7]. Due to some
from laboratories. The LR has given better results compared
researchers [8], data mining approaches are able to classify
to other considered techniques. In addition, attribute
evaluators such as information gain have been applied to
two types of anemia but unsuccessful to predict the reason of
demonstrate the high performance of the system with such issues. Artificial Neural Network (ANN) also
minimum characteristics. participates in anemia prediction works to classify the RBC
visual samples [9], [10] and [11]. Same approaches with a
Keywords— Anemia, Bayes Network, Naive Bayes, Logistic different processing techniques were also suggested [12]
Regression, Multi-Layer Perceptron using the Laplacian of Gaussian filters to identify anemia in
an early stage depending also on counts and shapes. Lower
amount of hemoglobin intensity in blood samples represents
I. INTRODUCTION an aspect due to anemia ailment. Three classes of eyes and
Recently, the most difficult challenge which healthcare tongue image samples are analyzed to present a classification
and health institutes are suffering from is the early detection model based on the image feature extraction [13]. Anemia
of dangerous disorders that lead to complicated health symptoms may be unknown without detection or
problems. Medical data can be assembled from different identification of RBC count in blood samples. Such an issue
sources like images, laboratories or any other different is complicated to prove it occurrences with single
source types. Working with such an unstructured data need parameters, therefore; studies such as the study [14]
techniques to be mined carefully and extracted patterns to depends on other parameters not only RBC count such as
perform an essential parts of gathering knowledge MCV, WBC and TIBC. A fuzzy logic system is presented to
information. These patterns are complicated to be analyzed this paper to extract the pattern among nine different
or even discovered by only human. Data mining techniques parameters for anemia prediction.
tries to perform the best model which to be nearest to the Depending on the variety levels of anemia development,
actual patterns that implement the data under examination. a decision system was suggested to classify its activity [15].
Detection of RBC shape/count disorders is important when In order to extract the patterns between different RBC related
data or even images are found. Some changes in RBC cell parameters for undernourishment conditions like iron
shapes according to different reasons able to help physician deficiency which cause anemia in pregnant women,
to detect anemic patient if related images are supplied for researchers presented an algorithm as seen in reference [16].
this reason as in [1]. A programmable architecture was Dataset for 539 are collected in Iraq for different parameters
proposed [2] to perform a simple anemia predictor according to map the cause of iron deficiency in such a region. Such a
to the color of RBCs. In addition, an automatic design was medical field should be carefully examined for serious points
proposed in [3] to detect anemia patients. In this article, two in order to detect such a disease at an early stage. With the
techniques have been used to recognize the differences help of this dataset, and especially for such unsupervised
between normal and abnormal forms of RBC. It has also regions, the pattern between the different parameters can be
been used a framework to compare it with the testing provided a clear vision of the main reason for these types of
samples using the Euclidian distance as a classification

978-1-7281-6221-8/20/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 20:03:39 UTC from IEEE Xplore. Restrictions apply.
disorders. This article expect to be helpful in guidance to resources for increasing ability and machine efficiency. A
health institutes and the related disease activities in the data mining technique based on the NB was used to evaluate
society in order to appropriately deal with this type of data in business improvement [23]. Main structure of the Naive
predicting the anemia types. Bayes representing each posterior and likelihood
probabilities is shown in Fig. 1.
II. THE PROPOSED METHODS
In this article, four algorithms have been used to predict
anemic patients based on 10 attributes for 6 different classes.
In this section, a brief description of these algorithms, their
principles and the reasons for their selection are discussed.
The proposed structure used for this paper are shown in Fig.
1. As shown in the figure, the data set has been prepared to
be applied to four different mining techniques with and
without attribute evaluator. In addition, it compares the
techniques used after applying the feature selector to
examine the affected parameters on the total prediction Fig. 1. The main structure of Naive Bayes algorithm applied to the anemia
system. This paper also demonstrate the limitation of these data
techniques for the dataset with different attribute values.
B. Bayes Network
A. Naive Bayes The Bayesian Network is a conditional graphical method
that uses a Bayesian conclusion for probability assessment.
Based on the Bayes theorem, the Naive Bayes classifier is a
In this method, conditional dependence between feature
family of algorithms in which the characteristics of the edges will be a factor in this joint probability. Linked-based
relevant data share the same principles. These features are dataset classification was analyzed and predicted using
considered to be either independent of each other or equal multi-relational BN as seen in [24]. A crude Fourier
contribution to the classes. The main algorithm can be approach for image classification was applied using different
derived as data mining techniques [25]. A suitable application of the
P(C|F) = {P(F|C) * P(C)}/P(F) (1) BN for human activity recognition based on a large amount
of data sets was also carried out as described in [26]. A new
where C is the different classes for anemia, F is the recognition system based on the BN for abnormal human
dataset features used to be classified as C, P(C|F) means the activity was achieved in a monitoring video [27].
probability of class C which gives the true features F and
P(F|C) means the probability of the features F that gives the C. Function logistic
right classes. Logistic regression (LR) is also a machine learning
According to the assumption which applying approach based on the basics of probability to classify
independency between dataset features and then classes will different types of data using a sigmoid function instead of the
be separated into parts as given linear one. The protein structure prediction system was
introduced using the LR [28,29]. By modeling phase
P(C|f1, f2, …,fn) = { P(f1|C) * P(f2|C)…*P(fn|C)*P(C)} .
distribution parameters, the LR method was presented to
/ {P(f1) * P(f2) … * P(fn)}. (2) predict the extracted theoretical function with those
distributions [30]. Various applications were also presented
The final classification model can be calculated by the using the LR such as placement prediction system, risk
selection of the output with maximum values as expressed by prediction of mobile user and costumer churn prediction as
given in [31], [32] and [33].
C= argmaxc P(C) * ∏ P(fi|C) (3)

where P(C) is the probability of the data class and P(fi|C) III. COLLECTED DATASET
is the probability of conditional independent features. Data was collected for 539 patients with 10 attributes and
a 1 column for 6 different class types. Fig. 2 shows some of
Use of the Naive Bayes for unbalanced data is possible,
the information collected about anemia in the laboratory as
as in many articles [17]. In this article, the NB is provided by
described in detail in [34].
ANN dataset to solve the problem of unbalanced data, when
dataset are a collection of different types and some classes
are significant in count than other classes. Various
applications of this model have been presented in the
literature [18], based on a text categorization.
Researchers [19] suggested a model using the KNN and
NB to detect text/image email spam letters. In another article
[20], prediction of disease is more likely to be associated
with asthma prediction. Business information is taken as part
of a research focusing on prediction of the earning of
business intelligence as seen in [21]. A new prediction
(a)
system was introduced [22] to detect the hypervisor attacker
which mainly occurred when hosts are connected to

Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 20:03:39 UTC from IEEE Xplore. Restrictions apply.
(b)

(c)
Fig. 2. Some features of the anemia data (a) Hemoglobin (HB), (b) Mean
Corpuscular Hemoglobin (MCH) and (c) Hematocrit (HCT).

Table 1 shows a brief description of the characteristics


associated with the collected anemia data. It shows the
minimum, maximum values, mean of each feature and the
standard deviation of each column.

Fig. 4. Proposed Algorithm for Anemia Prediction system


TABLE I. CHARACTERISTICS OF THE COLLECTED ANEMIA DATA
Feature Minimum Maximum Mean STD*
HB 1.4 18.2 9.5 4.3 In Fig. 4, the algorithm proposed for various classes for
RBC 0.9 11.9 5.1 2.2 539 rows and 10 attributes is shown in more detail.
MCH 11.7 77 26.6 4.5
WBC 1.6 146.1 13.1 16.2 IV. EXPERIMENTAL RESULTS
MCV 38.6 117 82.6 8.7
The results have been simulated using WEKA to apply
HCT 7.7 51.7 33.1 9
data mining techniques and MATLAB 2017b to apply the
MCHC 22.6 60.5 32 3.1
attribute evaluator methods. The results are divided into two
PLT 2 1892 363.2 210.1
steps: first without using feature selection methods and
AGE 6 56 21.4 10.6
second after applying. Ten cross validations were used for
Here the STD indicates the column standard deviation
the data classification system. Bayes Net, Naive Bayes,
Logistic Regression and Multi-Layer Perceptron simulation
Fig. 3 showed class relations according to different results are presented in Table 2.
parameters like HB, PLT which clarified the overlap
between 6 selected classes. TABLE II. THE SIMULATED RESULTS OF THE PROPOSED ALGORITHMS
BEFORE APPLYING FEATURE SELECTION

Root
Exhausted Mean
Accuracy Mean
parameter Time Absolute
(%) Square
(Sec.) Error
Error
BN 0.06 85.1 0.056 0.198
NB 0.01 83.6 0.064 0.209
LR 0.33 87.3 0.062 0.183
MLP 1.61 87.1 0.054 0.19

As shown in Table 2, the time taken for the NB methods


before applying the feature selector to predict such
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
biomedical behaviour is less than for other techniques. Easy
implementation, small data size usage, and the assumption of
Fig. 3. Relation of PLT and HB for 6 Anemia Classes
predictors make the NB faster in calculating, but not more
accurate. In the LR, the independent features do not normally
distribute. In addition, feature identities are not required in

Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 20:03:39 UTC from IEEE Xplore. Restrictions apply.
various assumptions. The LR provides not only a measure of this is the linearity assumption between dependent and
different types of classes of related properties, but also independent features for the LR. On the other hand, the MLP
guides between features and classes. Table 3 shows the has a trained limitation which means that the MLP stuck at
confusion matrix generated using the LR based on the local minima region without stopping or catching the global
collected anemia data. one. MLP also needs to be trained several times at various
starting points, which is a major reason for the high
TABLE III. THE CONFUSION MATRIX GENERATED USING THE LR
execution time in MLP. Also, the MLP has underfitting and
BASED ON THE ANEMIA DATA overfitting issues with manually selection of hidden layer
numbers. Fig. 5 showed the system performances before and
Class No. 1 2 3 4 5 6 after applying attribute evaluators on the same prediction
1 198 8 0 4 0 1
techniques.
2 5 68 0 10 0 0
3 1 0 3 5 0 0
4 9 5 0 200 1 2
5 0 3 0 5 2 0
6 0 5 1 3 0 0

Inadequate disease data forced researchers to seek an


improvement in the disease prediction system. Therefore, the
attribute selector methods have been used to minimize the
number of features and examine the effectiveness of this
reduction on the proposed techniques. Information Gain,
Correlation, OneR, and Symmetric Uncert attribute selectors
are the main evaluation selector applied to this article data
(a)
before applying the same prediction techniques. Table 4
shows the rank value of each attributes according to the
attribute evaluation techniques to clarify its utility according
to class prediction. In the table, three attributes (WBC, MCV
and gender) are deleted for the next step in prediction system
due to their low affection on the design. The simulated
results are shown in more detail in Table 5 after the attribute
selector has been applied.

TABLE IV. THE SIMULATED RESULTS OF THE PROPOSED ALGORITHMS


AFTER APPLYING THE FEATURE SELECTION

Root
Exhausted Mean
Accuracy Mean
parameter Time Absolute (b)
(%) Square
(Sec.) Error Fig. 5. Anemia prediction system performances (a) before and (b) after
Error
applying attribute evaluators
BN 0.02 85.3 0.057 0.205
NB 0.01 84.6 0.062 0.199 V. CONCLUSION
LR 0.17 86.1 0.068 0.183
MLP 0.84 86.1 0.068 0.189 Reduction of Red Blood Cells (RBC) causes an
insufficiency of oxygen, which forces the human body to
collapse if untreated or even detected at an early stage. In
According to Table 5, the results showed that the NB and
this paper, four different methods (BN, NB, LR and MLP)
BN have a better accuracy after using the attribute reduction
have been applied to detect the anemia types under the
due to robust assumption in the data distribution. Moreover,
consideration of 10 attributes through 539 samples. The LR
Also, training using the NB or BN is faster because it does
and MPL showed better performances compared with other
not have to deal with all dataset at once, and does not even
proposed techniques with 87.3% and 87.1%, respectively.
have to store them in a memory. The NB and BN have fast
Then, four different attribute evaluations have been utilized
execution time after combining selectors with mining
to find the risk factors that affect the prediction system the
techniques.
most. It has been concluded that the LR and MLP keep their
Performances of the LR and MLP methods are lower high performance difficult, but still have the best results
than other recommended techniques. The main reason for compared to other proposed algorithms. It has been found

TABLE V. RANK ACCORDING TO THE PROPOSED ATTRIBUTE SELECTOR


Parameters HB RBC MCH WBC MCV HCT MCHC PLT Age Gender
Info. Gain 1.09 0.77 0.37 0.18 0.13 0.81 0.37 0.22 0.24 0.01
Correlation 0.58 0.25 0.33 0.22 0.19 0.60 0.32 0.29 0.31 0.09
OneR 0.62 0.49 0.35 0.26 0.18 0.57 0.35 0.26 0.34 0.04
Symmetrical Uncert 0.53 0.44 0.20 0.14 0.10 0.45 0.22 0.15 0.17 0.02

Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 20:03:39 UTC from IEEE Xplore. Restrictions apply.
that the linearity assumption between dependent and reasons for the Bayesian developmental performance. As a
independent features lead to low performance for the LR. On further study, an optimization technique can be applied to
the other hand, it has been discovered that the MLP needs this data, taking into account the relevant features to improve
several training processes in order not to remain in the local the accuracy of this system prediction.
minima region and to maintain its high activity. Robust
assumption and rapid education have been seen as the main

[18] M. Rafi, S. Hassan, and M. S. Shaikh, “Text Categorization with


REFERENCES Wikitology as knowledge enrichment Comparing SVM and NaIve
Bayes Classifiers for,” 2011 IEEE 14th International Multitopic
[1] P. Rakshit, “Detection of Abnormal Findings in Human RBC in Conference, pp. 31–34, 2011.
Diagnosing G-6-P-D Deficiency Haemolytic Anaemia Using Image
Processing,” 2013 IEEE 1st International Conference on Condition [19] Harisinghaney, A. Dixit, S. Gupta, and A. Arora, “Text and Image
Assessment Techniques in Electrical Systems (CATCON), pp. 297– Based Spam Email Classification using KNN , NaIve Bayes and
302, 2013. Reverse DBSCAN Algorithm,” 2014 International Conference on
Reliability Optimization and Information Technology (ICROIT), pp.
[2] [2] Khan, R. K. Mondol, M. A. Zamee, and T. A. Tarique, 153–155, 2014.
“Regression Model Based On FPGA,” 2014 International Conference
on Informatics, Electronics & Vision (ICIEV), pp. 1–5, 2014. [20] S. Aneja and S. Lal, “Effective Asthma Disease Prediction Using
Naive Bayes - Neural Network fusion technique,” 2014 International
[3] S. Chakraborty, “A noble technique for detecting anemia through Conference on Parallel, Distributed and Grid Computing, pp. 137–
classification of red blood cells in blood smear,” International 140, 2014.
Conference on Recent Advances and Innovations in Engineering
(ICRAIE-2014), pp. 1–9, 2014. [21] M. T. Mishan, A. L. Kushan, and U. T. Mara, “An Analysis On
Business Intelligence Predicting Business Profitability Model Using
[4] M. Bond, J. Mvula, E. Molyneux, and R. Richards-kortum, “Design Naive Bayes Neural Network Algorithm,” vol. 5, no. October, pp. 2–
and Performance of a Low - Cost , Handheld Reader for Diagnosing 3, 2017.
Anemia in Blantyre , Malawi,” 2014 IEEE Healthcare Innovation
Conference (HIC), no. 0940902, pp. 267–270, 2014. [22] S. Ansari and K. Hans, “A Naive Bayes Classifier Approach for
Detecting Hypervisor Attacks in Virtual Machines,” 2017.
[5] J. Punter-villagrasa, J. Cid, and J. Colomer-farrarons, “Toward an
Anemia Early Detection Device Based on 50- μ L Whole Blood [23] M. Günay, “Makine Ö ğ renmesi Yöntemleri ile Kay ı p Mü ş teri
Sample,” vol. 62, no. 2, pp. 708–716, 2015. Analizi Predictive Churn Analysis with Machine Learning Methods,”
2018 26th Signal Processing and Communications Applications
[6] H. Mirinejad, T. Inanc, and A. A. M. Protocols, “Individualized Conference (SIU), pp. 1–4, 2018.
Anemia Management using a Radial Basis Function Method,” 2015
IEEE Great Lakes Biomedical Conference (GLBC), pp. 1–4, 2015. [24] O. Schulte, B. Bina, B. Crawford, D. Bingham, and Y. Xiong, “A
Hierarchy of Independence Assumptions for Multi-relational Bayes
[7] M. Lotfi, B. Nazari, S. Sadri, and N. K. Sichani, “The Detection Of Net Classifiers,” 2013 IEEE Symposium on Computational
Dacrocyte , Schistocyte and Elliptocyte cells in Iron Deficiency Intelligence and Data Mining (CIDM), pp. 150–159, 2013.
Anemia,” 2015 2nd International Conference on Pattern Recognition
and Image Analysis (IPRIA), no. Ipria, pp. 1–5, 2015. [25] M. Sundermeyer, H. Ney, and R. Schlüter, “From Feedforward to
Recurrent LSTM Neural Networks for Language Modeling,” vol. 23,
[8] C. Bellinger, A. Amid, N. Japkowicz, and H. Victor, “Multi-label no. 3, pp. 517–529, 2015.
Classification of Anemia Patients,” 2015 IEEE 14th International
Conference on Machine Learning and Applications (ICMLA), pp. [26] L. M. Rodrigues, “Classification Methods based on Bayes and Neural
825–830, 2015. Networks for Human Activity Recognition,” pp. 1141–1146, 2016.
[9] M. Tyagi, L. M. Saini, and N. Dahyia, “Detection of Poikilocyte Cells [27] Liu, J. Ying, F. Han, and M. Ruan, “Abnormal Human Activity
in Iron Deficiency Anaemia Using Artificial Neural Network,” 2016 Recognition using Bayes Classifier and Convolutional Neural
International Conference on Computation of Power, Energy Network,” 2018 IEEE 3rd International Conference on Signal and
Information and Commuincation (ICCPEIC), pp. 108–112, 2016. Image Processing (ICSIP), pp. 33–37, 2018.
[10] E. Engineering and A. V. Vidyapeetham, “Simulation Model for [28] P. J. Munson, V. Di Francesco, and R. Porrelli, “Protein Secondary
Anemia Detection using RBC counting algorithms and Watershed Structure Prediction using Periodic-Quadratic-Logistic Models:
transform,” pp. 284–291, 2017. Statistical and Theoretical Issues,” 1994.
[11] T. S. Chy, “A C omparative A nalysis by KNN , SVM & ELM C [29] Q. Ni, Z. Wang, Q. Han, G. Li, X. Wang, and G. Wang, “Using
lassification to D etect S ickle C ell A nemia,” pp. 455–459, 2019. logistic regression method to predict protein function from protein-
protein interaction data,” no. 60835005, pp. 1–4, 2009.
[12] S. Mohamad, N. Syahirah, A. Halim, M. N. Nordin, R. Hamzah, and
J. Sathar, “Automated Detection of Human RBC in Diagnosing Sickle [30] P. Lv, H. Wang, and H. Wang, “Phase Distribution Parameter
Cell Anemia with Laplacian of Gaussian Filter,” 2018 IEEE Prediction using Logistic Model in the Analysis of Two-phase Flow,”
Conference on Systems, Process and Control (ICSPC), vol. 2, no. 2014 IEEE 8th International Symposium on Embedded
December, pp. 214–217, 2018. Multicore/Manycore SoCs, pp. 23–30, 2014.
[13] S. Roychowdhury, D. Sun, M. Bihis, J. Ren, P. Hage, and H. H. [31] S. Sharma, S. Prince, and S. Kapoor, “PPS - Placement Prediction
Rahman, “Computer Aided Detection of Anemia-like Pallor,” 2017 System using Logistic Regression,” 2014 IEEE International
IEEE EMBS International Conference on Biomedical & Health Conference on MOOC, Innovation and Technology in Education
Informatics (BHI), pp. 461–464, 2017. (MITE), pp. 337–341, 2014.
[14] M. F. Shaik, “Anemia Diagnosis by Fuzzy Logic Using LabVIEW,” [32] H. Kong, S. Lin, J. Wu, and H. Shi, “The Risk Prediction of Mobile
2017. User Tricking Account Overdraft Limit based on Fusion Model of
Logistic and GBDT,” 2019 IEEE 3rd Information Technology,
[15] S. Belginova and I. Uvaliyeva, “Decision Support System for Networking, Electronic and Automation Control Conference
Diagnosing Anemia,” 2018 4th International Conference on (ITNEC), no. Itnec, pp. 1012–1016, 2019.
Computer and Technology Applications (ICCTA), no. Mcv, pp. 211–
215, 2018. [33] S. Bharadwaj, “Customer Churn Prediction in Mobile Networks using
Logistic Regression and Multilayer Perceptron ( MLP ),” 2018
[16] Sotomayor-beltran and D. Tarazona, “geographic information system Second International Conference on Green Computing and Internet of
study,” 2018 IEEE 38th Central America and Panama Convention Things (ICGCIoT), pp. 436–438, 2018.
(CONCAPAN XXXVIII), pp. 1–5, 2018.
[34] M. Sari and A. A. Ahmad, “ANEMIA MODELLING USING THE
[17] Adam, M. I. Shapiai, Z. Ibrahim, and M. Khalid, “Artificial Neural MULTIPLE REGRESSION ANALYSIS,” International Journal of
Network - Naïve Bayes Fusion for Solving Classification Problem of Analysis and Apllications (IJAA), vol. 17, no. 5, pp. 838–849, 2019.
Imbalanced Dataset Universiti Teknologi Malaysia,” 2011 Fourth
International Conference on Modeling, Simulation and Applied
Optimization, pp. 1–5, 2011.

Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 20:03:39 UTC from IEEE Xplore. Restrictions apply.

You might also like