IPASJ International Journal of Information Technology (IIJIT

)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email: editoriijit@ipasj.org
Volume 2, Issue 4, April 2014 ISSN 2321-5976


Volume 2, Issue 4, April 2014 Page 1

Abstract
The data mining comprises of analysis of large data from various perspectives and obtaining summary of useful information.
The information can be transferred into knowledge regarding future trends and history. Data mining has a very important role
in the information technology domain. Huge amounts of complex data is generated by health care sector today. These data
includes details about diseases, patients, diagnosis methods, electronic patients details hospitals resources etc,. The data mining
methods are very helpful in making medicinal decisions in disease curing. The vast data collected by healthcare industry are not
mined and hence information is hidden. And as a result the decision making is not effective. The knowledge discovered can be
used by the healthcare administrators for enhancing the service quality. In this paper, a method for identifying frequency of
diseases in particular geographical location for a given period of time using Apriori data mining technique based on association
rules is proposed.

Keywords: KDD, Bayesian classification, Genetic Algorithm.

1. INTRODUCTION
The meaning of Data Mining is the extraction of knowledge from large data. It is also named as knowledge mining
from large amount of data. There are so many other terms which give similar meaning of data mining, they are
knowledge extraction, data archaeology, data /pattern analysis etc. The other famously used terms are knowledge
discovery from data or KDD. Decision making can be achieved by converting data mining in to knowledge and this
process is called knowledge discovery. The iterative sequence present in knowledge discovery are 1.,Data cleaning
[inconsistent data and noise are removed],2., Data integration [the combination of multiple data sources are done],3.,
Data selection [the relevant data is extracted to the analysis task from data base],4., Data transformation [the data is
transformed in to relevant other forms ],5,. Data mining [the data patterns are extracted by applying intelligent
methods], 6, Pattern evaluation [depending upon some measures, we will identify the interesting patterns in
knowledge],7.,

Knowledge presentation [some techniques are used to represent knowledge and visualizations].Data mining includes
many other functions like classification, association, clustering, and predictions. Relationships and hidden patterns are
discovered by using advanced data mining techniques. Mining association rules is the one of the important data mining
applications. In 1993 association rules are used to identify relationships among item sets in data bases, these are not
inherited properties. In medical field it is used to find the most frequently occurred diseases in different geographical
locations at given time period. Hence the medical data is analyzed in this research work.

2. LITRATURE REVIEW
Jyothi soni. [1] Provided a survey of latest techniques in predicting heart diseases using data mining techniques of
knowledge discovery. So many experiments are conducted to compare the performances and to determine the outcomes.
The survey reveals that in accuracy wise Bayesian classification is having similar results as of decision tree. When
these are compared to other methods, like Neural Networks, Classification based on clustering they are performing
well. Decision tree algorithm and Bayesian classification are improved by applying Genetic algorithm optimal data
sets are obtained by reducing the actual data size which is useful in predicting Heart diseases. Carlos Ordonez [2]
studied how to limit the association rules in order to predict the heart diseases. He proposed three things to decrease the
number of patterns. Firstly, the required things needed such that attributes should present on only one side of the rule.

Secondly, divide the attributes into uninteresting groups. Thirdly to reduce the number of rules applied. Maria-Luiza
Antonie [3] investigated different data mining techniques like association rule mining and neural networks in detection
of tumor in digital mammography. The two results performed well in the accuracy wise which gave 70%

APRIORI algorithm based medical data mining
for frequent disease identification

Gitanjali J
1
, C.Ranichandra
2
,M.Pounambal
3


School of Information Technology and Engineering,
VIT UNIVERSITY, Vellore-632014, Tamil Nadu, India
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email: editoriijit@ipasj.org
Volume 2, Issue 4, April 2014 ISSN 2321-5976


Volume 2, Issue 4, April 2014 Page 2

3. APRIORI ALGORITHM
It is used for finding frequent item sets. This algorithm is proposed by R. Agrawal and R. Srikant in 1994. The name
of this algorithm is Apriori because it uses the prior knowledge of frequent item sets. Firstly, the input is D the dataset
is given and also we should know the min_sup, which is minimum support count threshold.
And we get the output as L the frequent item sets in D. The procedure for the algorithm is as follows:

step1: scan D for count of each candidate and generate .List all the candidate item sets and place the corresponding
candidate support count of .
Step2: compare the candidate support count with the minimum support count. Generate candidate from .
Step3: scan D for count of each candidate. Compare the candidate support count with the minimum support count list
remaining in .This process is continued until the most frequent item set is produced.

4. PROPOSED WORK
Apriori algorithm can be used for mining the disease occurrence details for a specific time range.

Algorithm
M
n
: Medical data item set of size n
F
n
: frequent item set of size n
F1 ={frequent items};
for (n =1; F
n
!=; n++) do begin
M
n+1
=Medical data derived from F
n
each t transaction in the database do
Increment count of all the medical data in M
n+1
that are
in t
F
n+1
=min_support medical data in M
n+1

end
return 
n
F
n
;

Result of the Research
The proposed approach is very helpful in identifying the frequently occurring diseases in a huge medical data. As a
result, medical conclusions and decisions regarding frequent diseases can be made by practitioners accurately. Data for
analysis is obtained from different geographical areas during various time ranges.

5. EXPERIMENTAL RESULT
This research utilizes the data set containing the electronic medical details of different patients. This include patient’s
name, disease name, age, sex, date, address, , etc, in particular year. Fig.1. shows the bar graph of the number of
diseases affecting the patients monthly. Fig. 2. Depicts number of patients affected by various diseases monthly.. It
unfolds the fact that in a particular month some patients are affected by the same disease.

0
2
4
6
8
10
12
14
J
a
n
M
a
r
M
a
y
J
u
l
S
e
p
N
o
v
no of diseases
no…

Fig.1. shows the bar graph of the number of diseases affecting the patients monthly
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email: editoriijit@ipasj.org
Volume 2, Issue 4, April 2014 ISSN 2321-5976


Volume 2, Issue 4, April 2014 Page 3

0
20
40
60
80
100
120
140
160
Jan Mar May Jul Sep Nov
no of patients
no of
patie
nts

Fig. 2. Depicts number of patients affected by various diseases monthly






















Apriori

Minimum support:=0.35 (4 instances)
Minimum metric =0.9
Number of cycles performed: 13





Large item sets L(1): 20
Large Item sets L(1):



Attributes: 29
AIDS
Allergies
Heart disease
Asthma
HIV
human papilloma virus
hypertension
Impotence
Insomnia
Jaundice
Kidney Disease
Leukemia
Liver cancer
Liver Disease
Lung Cancer
Lupus
Overweight
Eye Disease
Pain
Pertussis
Pregnancy
Raynauds Phenomenon
sexually transmitted diseases
sleep disorders
smoking
stroke
Thrush
Thyroid disorders
Whooping Cough

Relation: Disease
Instances: 12
J anuary
February
March
April
May
J une
J uly
August
September
October
November
December

IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email: editoriijit@ipasj.org
Volume 2, Issue 4, April 2014 ISSN 2321-5976


Volume 2, Issue 4, April 2014 Page 4

Allergies 4
Heart disease 7
J aundice 5
HIV 4
Hypertension 5
Impotence 6
Thrush 4
Whooping Cough 7
Insomnia 6
sexually transmitted diseases 5
sleep disorders 5
Smoking 9
Raynauds Phenomenon 4
Pregnancy 4
Pain 7
Overweight 5
Lung Cancer 4
Liver Disease 4
large item sets L(2): 21
Large Item sets L(2):

Heart disease, hypertension 4
Heart disease, Insomnia 5
Heart disease, Kidney Disease 4
Heart disease, Liver Disease 4
Heart disease, Overweight 5
Heart disease, Pain 5
Heart disease, smoking 6
Asthma, Thrush 4
HIV,smoking 4
Hypertension, smoking 4
Impotence, Raynauds Phenomenon 4
Impotence, smoking 4
Impotence, Whooping Cough 5
Insomnia, smoking 5
Liver Disease, Overweight 4
Liver Disease, smoking 4
Pain, sexually transmitted diseases 5
Overweight, smoking 4
Pain, smoking 5
sexually transmitted diseases, smoking 4
Smoking, Whooping Cough 5

Large item sets L(3): 7
Large Item sets L(3):

Heart disease,Insomnia,smoking 4
Heart disease, Liver Disease, Overweight 4
Heart disease, Liver Disease=t smoking 4
Heart disease,Overweight,smoking 4
Impotence,smoking,Whooping Cough 4
IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email: editoriijit@ipasj.org
Volume 2, Issue 4, April 2014 ISSN 2321-5976


Volume 2, Issue 4, April 2014 Page 5

Liver Disease, Overweight, smoking 4
Pain, sexually transmitted diseases, smoking 4

Large item sets L(4): 1
Large Item sets L(4):

Heart disease, Liver Disease,Overweight,smoking 4

6. CONCLUSION
This research work proposes Apriori data mining based on association rule and generates the frequency of diseases
affected by patients and also the number of patients affected by these diseases .Based on various geographical areas and
at various time periods the study is made.. Existing electronic medical details obtained from hospitals are utilized as
training data set for analysis. The analysis and study concluded that the patients are affected frequently by 4 different
diseases at different geographical areas during a particular year.

References
[1] Jyothi Soni, et al., “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”
[2] Carlos Ordonez, “Improving Heart Disease Prediction Using Constrained Association Rules”
[3] Maria-Luiza Antonie et al., “Application of Data Mining Techniques for Medical Image Classification”
[4] M. Ilayaraja, T. Meyyappan,”Mining Medical Data to Identify Frequent Diseases using Apriori
[5] Murugesan K., Md.Rukunuddin Ghalib., Gitanjali J., Indumathi J., Manjula D.(2009), “A pioneering Cryptic
Random Projection based approach for Privacy Preserving Data Mining”, In proceedings of The IEEE
International Conference on Information Reuse and Integration (IEEE IRI-09) July 10-12, Las Vegas, USA. pp.
437-439.
[6] “Sprouting Modus Operandi for Selection of the Best PPDM Technique for Health Care Domain”, International
Journal Conference in Recent Trends in Computer Science. Vol. 1, No. 1, pp. 627-629.


AUTHORS
GITANJALI J received her M.Tech IT Networking from Vellore Institute of Technology, India, in year
2008.She is working for Vellore Institute of technology as an Assistant Professor Senior. She is currently
doing her PhD from VIT University, Vellore. Her research interest includes Security for Data Mining,
Networks, Software Engineering and Ontology.

C.RANICHANDRA is working as Assistant Professor Selection Grade in VIT University, Vellore, Tamil
Nadu, India. She has fourteen years of teaching experience in VIT. Ranichandra was born in 1975 in
Madurai District. She graduated in B.Tech(CSE) from Vellore Engineering College in 19197 and received
her M.Tech (CSE) from VIT University in 2008. The author started the research work from 2009 in Grid
Databases and is currently working on Database issues in Cloud.

M.Pounambal is Assistant Professor Selection Grade in School of Information Technology and Engineering
at VIT University, Vellore, India. She received B.E and M.Tech Degree in Computer Science field. Her area
of interest includes Wireless Networks.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.