You are on page 1of 5

Computer Science Section

Disordered Metabolic Evaluation in Renal Stone Recurrence:


A Data Mining Approach
1
Seyyed TAGHI ADL, 2Arash GIVCHI, 3Mohamad SARAEE, 4Amid ESHRAGHI
1,2, 3
Intelligent Databases, Data Mining and Bioinformatics Research Lab, Isfahan University of Technology, Isfahan, Iran
4
Medical University of Isfahan, Isfahan, Iran
1
t.ad@ ec.iut.ac.ir, a.givchi@ec.iut.ac.ir, 3saraee@cc.iut.ac.ir , 4A_Eshraghi@resident.mui.ac.ir
2

Abstract–Nephrolithiasis is a disease with a high and even Probability of Recurrence in each patient would be a
rising incidence. It has a high morbidity, generates high costs challenging factor for further treatments and analysis for
and has a high recurrence rate. Metabolic evaluation in renal medical society. This can be a challenging prediction task
stone formers allows the identification and quantification of risk due to high numbers of contributing attributes in formation of
factors and establishment of individual risk profiles. Based on
these individuals risk profiles, rational therapy for metaphylaxis
renal stones. This problem forces the medical community to
of renal stones lowers stone recurrence rate significantly. make use of data mining techniques to enhance the quality
The purpose of this article is metabolic investigation in and confidence of recurrence predictions.
patients with nephrolithiasis in Isfahan city- Iran. Different data In this article 3 steps including extracting association
mining algorithms such as Clustering and Classification were rules, clustering the data into appropriate numbers of clusters
employed for extracting knowledge in the form of decision rules. to find appropriate groups of patients and finally designating
These results evaluate the risk of morbidity and recurrence of a classifier based on the recurrence event attribute in the
the diseases. observed society and ranking the contributing features.
Some medical attributes gathered based on their medical Preprocessing the data and relevance analysis as the first
importance. The data mining tasks applied in this research have
been applied and tested over 406 observed samples collected at
step, would be an important phase to clear the data and skip
different clinics in the city of Isfahan. the impurities.
Association rule mining as a technique to extract hidden
Keywords: Renal Stone Recurrence, nephrolithiasis, rules which are readable and easily interpretable for medical
Association Rules, Clustering, Classification. expert, has been considered widely as an important technique
in data mining community. In our study, we have first
I. INTRODUCTION conducted the association rule mining which can be
considered as the most important way for a medical expert to
Metabolic evaluation in renal stone formers allows the find out the important relations among the different features
identification and quantification of risk factors and and properties of the data; Rules can reveal the associations
establishment of individual risk profiles. Based on these and correlations among various factors which are important
individuals risk profiles, rational therapy for metaphylaxis of on recurrence of the renal stone and effects, caused by other
renal stones lowers stone recurrence rate significantly. derangements for nephrolithiasis patients such as
This article will focus on the ways to extract useful hypocitraturia, hyperoxaluria, low urinary volume,
knowledge which can come handy in for division of different hyperuricosuria, hypercalciuria and cystinuria. This study
patients and predicting the future status of the new patients. would be conducted by providing the support and confidence
What have been proposed are based on different approaches measurements which are considered as two of the best
in data mining, which tries to extract high quality patterns. metrics to show the quality of each rule.
These patterns could be used as a prediction and analysis tool Second method which would be considered in our study is
in the studied region. clustering. After preprocessing the data, we have conducted
In this paper, Association mining, Clustering and hierarchical and partitional clustering techniques and
Classification techniques due to different usages and analysis compared them in terms of some popular quality
have be used to tackle the mining task from different aspects. measurements.
These data mining tasks and corresponded analysis could In hierarchical clustering study we will conduct the single
be so important due to different attributes which would be linkage, average linkage and complete linkage clustering
considered as effective ones for formation of renal stones. algorithms corresponding their cophenetic distance as the
22 key features based on their medical importance have measure of how the linkage algorithm would affect the
been derived by personal soliciting forms and medical Euclidean distance matrix of data. Dendrograms as the visual
examinations such as CT- scan. plots which could be easily considered by medical experts
would be provided for further analysis on the number of

64
Journal of Applied Computer Science & Mathematics, no. 11 (5) /2011, Suceava

appropriate clusters and additional discussions. TABLE 1: ASSOCIATION RULES


In partitional clustering technique, we would use the K- Support Confidence
Consequent Antecedent
% %
Means algorithm with various executions over the dataset to job = self employed and
find the best number of clusters. In this study, mean of recurrence =
volume.urine < 1355.000 5.95 96.15
YES
silhouette measurements as the quality factor in our study has and age > 32.500
been applied. job = self employed and
recurrence = smoking and volume.urine
Feature ranking and Classification of the dataset to find the YES < 1355.0 and calcium.urine
5.72 96
factors mostly impact the recurrence of the renal stone have > 66.500
been applied. Features based on their class separability recurrence =
job = self employed and
criteria would be ranked and their corresponding evaluation smoking and volume.urin < 7.32 87.5
YES
1355.000
metric would be discussed further.
cictein < 38.500 and weight
The Classification task would be conducted via Support < 80.500 and
Vector Machines (SVM) and cross validation technique using recurrence = aciduric.serum < 6.150 and
20.14 80.68
“handout” technique to enhance the quality of classification. NO aciduric.urine < 620.500
and calcium.urine <
K-Nearest Neighbor classification approach also has been
168.500
conducted on the dataset and corresponding accuracies have cictein < 38.500 and weight
been provided. < 80.500 and sodium.urine
recurrence =
Section 2, would be devoted to the data mining with the NO
< 220.500 and 18.54 80.25
association rules to find elements contributing to the calcium.serum < 9.650 and
calcium.urine < 144.500
recurrence of the renal stones .In section 3, methods for cictein < 38.500 and weight
conducting clustering over the dataset and how to find the < 80.500 and sodium.urine
recurrence =
best clusters through the data would be clarified. NO
< 220.500 and 17.16 84
Classification and feature ranking will be covered in section calcium.urine < 143.500
and age < 58.500
4. Last section would be devoted to the discussion and recurrence = NO and
conclusion. job = Gender = Female and citrat
9.84 97.67
homemaker > 321.500 and oxalat >
II. ASSOCIATION RULE MINING 28.700 and oxalat < 55.500
recurrence=NO and
smoking and Gender =
Association rule mining as a popular method, gives clear job =
Female and 11.21 93.88
and understandable rules could be easily interpreted by homemaker
sodium.urine > 167.500 and
medical community. This technique has been conducted on aciduric.urine > 438.500
the collected dataset and applied over the pre-specified recurrence=YES and
“recurrence event” attribute as the consequent of the fetched job = smoking and Gender =
12.13 94.34
homemaker Female and aciduric.urine <
rules. This attribute plays an important role in the prediction 568.500
of the future recurrence of the renal stone in patients who job = Employee and
have the potentiality of recurrence. Gender = aciduric > 4.550 and
6.41 100
An algorithm called Generalized Rule Induction (GRI) has Male sodium.urine > 119.500 and
keratinin.serum > 0.885
been applied over the dataset. job = Employee and
GRI extracts a set of rules from the data, pulling out the Gender = aciduric > 4.550 and age >
7.09 96.77
rules with the highest information content. Male 38.500 and volume.urine >
Information content is measured using an index that takes 1390.000
sodium < 138.500 and citrat
both the generality (support) and accuracy (confidence) of the > 273.500 and oxalat >
rules [1]. This method of association rule mining is Smoking 25.600 and aciduric > 4.350 9.15 85
considered as another version of the famous apriori algorithm and
[2]. The generated rules are within two groups: those which citrat < 457.500
sodium < 138.500 and citrat
have the recurrence attribute as yes and those which have the
> 273.500 and aciduric >
recurrence attribute as no. The following table provides Smoking 4.350 and 8.92 84.62
some of the most important rules generated. citrat < 457.500 and weight
< 83.500
III. CLUSTERING sodium < 138.500 and
oxalat > 25.600 and
smoking citrat > 273.500 and 8.7 84.21
First clustering method we would study over the data is citrat < 457.500 and
hierarchical clustering. calcium < 9.950
Single, complete and average linkage clustering techniques
as 3 of the popular divisive algorithms have been used in our
study [3].

65
Computer Science Section

TABLE 2: COPHENETIC DISTANCE OF HIERARCHICAL CLUSTERING


ALGORITHMS
Single Complete Average

0.8584 0.8438 0.9280


Cophenetic distance in each of the studied methods has
been considered in table 1. This table shows that using the
average linkage method could better fit the characteristics of
the dataset. One of the advantages of this method is that
dendrograms in figures 1, 2, 3 are visual tools in hands of the
expert and he/she can define the appropriate cutting level for
clustering based on his/ her experience. This process will lead
Figure 1: Single Linkage Dendrogram
to different numbers and sizes of clusters which could be
used to compare various characteristics and attributes which
individuals within each cluster have in common. The height
of the dendrogram shows the similarity of the temporary
clusters generated within the clustering process.
As an example, in average clustering dendrogram in figure
3, if the expert chooses the cutting level of 0.5, two clusters
would be generated; one cluster which only contains the
patient number 19 and the other cluster which consists of all
other patients; this process would demonstrate that, the
patient number 19 is far from the other cluster by the
difference of the level which these 2 clusters joined together,
and this difference has to be inspected in difference of the
values of attributes which these 2 clusters poses.
Second method of clustering we have conducted over the
Figure 2: Complete Linkage Dendrogram
dataset is a partitional clustering method. K-means as the
most popular clustering techniques can easily generate
appropriate numbers of clusters by a previous setup [4]. In
this method the most important fact is to discover what the
best number of clusters is. Silhouette plots are usually used to
visually discover the way clusters are distributed. We have
used the mean of silhouette measurements to find out the
quality of clusters in different numbers. Table 2 provides
these measurements for various numbers of clusters. This
study has demonstrated that the best number of the clusters
for this specific dataset would be 2. It is important to note
that Like many other types of numerical optimizations, the
solution that K-means reaches often depends on the starting
points, so we have made the k means algorithm to run for 10
times and then we have provided the average of the
Figure 3: Average Linkage Dendrogram
measurements in this table.
Clustering in our study was conducted using Matlab
You can see the dendrograms of each technique in figure 1,
toolbox [5].
2, 3 respectively.
The main popular method for evaluating the quality of a
IV. CLASSIFICATION
clustering algorithm in hierarchical clustering is to use the
cophenetic distance. A. Feature Rank
This metric is defined as the correlation between the Absolute value two-sample t-test with pooled variance
distance matrix of the data and the cophenetic matrix out of estimate [described in 5], has been used as class separability
the linkage process. criteria for key features. Class label for this task is the
Cophenetic matrix, is a square matrix which provides the “recurrence event”. Ranked features with their corresponding
distances level where each two of the samples in the t-test values are gathered in table 3.
dendrogram reach together [described in 5].

66
Journal of Applied Computer Science & Mathematics, no. 11 (5) /2011, Suceava

TABLE 3: SILHOUETTE METRICS WITH DIFFERENT CLUSTERING NUMBERS TABLE 5: CLASSIFICATION ACCURACY
WITH K-MEANS ALGORITHM
Classification
SVM 1-NN 2-NN 3-NN
2 clusters 3 clusters 4 clusters 5 clusters 6 clusters method

0.5378 0.4615 0.3553 0.4601 0.4400 0.5883± 0.5540± 0.5583± 0.5600±


Accuracy 7.0798e- 3.0116e- 3.6908e - 7.1895e -
004 005 007 007
B. Classification
In this task, support vector machine and K-nearest V. DISCUSSION AND CONCLUSION
neighbor as two of the famous classification approaches [6,
7] have been applied over the dataset with “handout” cross In this article, we have considered some approaches to
validation technique [described in 5]. mine the data which gathered from nephrolithiasis patients in
In the k-nearest neighbor method we have computed the Isfahan city. The mining approaches were divided into 3
appropriate number of the neighbors (the root of the number categories; in the first approach we extracted some important
of samples, which the degree of the root would be the number association rules which can be easily interpreted by the
of attributes plus one) equal to 1.3. experts and can be used to predict whether the renal stone in
In both of these methods we have used “recurrence event” an specific patient would be recurred or not. We also
as the class label. provided confidence and support of each rule for the expert
Classification is very important in our study because it is as the measure of reliability of each specific rule. Second
very important for the expert to categorize the new patients category devoted to clustering approaches. Hierarchical
based on the samples we already have. clustering algorithms were used to cluster the data and the
The accuracy of each classification method is provided in cophenteic distance as a criterion for the quality of the
table 4. clustering showed that average linkage provided better
As you can see in table 5 the SVM method has provided clustering. The cutting level in these methods could be
better accuracy rather than K nearest neighbor methods, defined by the expert and he/she can categorize the patients
although the accuracies are not so far from each other. based on this criterion. In the clustering algorithms we also
Recently, Data mining has been considered widely in tested K-means algorithm which demonstrated that the best
medical community as a vital approach beside the expert number of clusters which can highly increase the inter-class
experience. sperability and decrease the intra- class difference among the
individuals is 2.
TABLE 4: FEATURE RANK TABLE Last category was devoted to the classification task, in this
Absolute value two-sample t- method we testes SVM and KNN as two of the famous
Key Features test with pooled variance
estimate approaches which can be considered very handy for the
Gender 2.4788 experts regarding to visiting the new patients. New patients
Weight 3.3838 can be easily categorized by the expert with the constructed
Age 3.3931 classifier. In our study we have found the SVM method as the
Job 0.6240
better classifier rather than the KNN.
Smoking 1.5915
other disease 0.1290
Family history 2.7153 REFERENCES
Albumin 0.0028
Calcium 0.4053 [1]. Khabaza, T.; Shearer, C.; "Data mining with
Citrate 0.3910 Clementine," Knowledge Discovery in Databases, [IEE
ceratinin.serum 1.2886 Colloquium on] , vol., no., pp.1/1-1/5, 2 Feb 1995
Cyctin 0.2663 [2]. Agrawal R, “Mining association rules between sets of items in
Oxalate 4.3632 large databases”, Proceeding of the 1993 ACM SIGMOD
Potassium 2.5473 Conference, Washington, pp. 207-216, November 1993.
Sodium 0.9588 [3]. Julio F. Navarro, Carlos S. Frenk and Simon D. M. White “A
Uric acid 0.7768 Universal Density Profile from Hierarchical Clustering”, THE
volume.urine 0.1807 ASTROPHYSICAL JOURNAL, 490:493È508, 1997
ceratinin. Urine 1.4302
December 1
calcium.urine 1.9340
[4]. MacQueen, J. B. (1967). Some methods for classification and
sodium.urine 0.8003
Uric acid.urine 1.2606
analysis of multivariate observations. Proceedings of the Fifth
Berkeley Symposium on Mathematical Statistics and
Probability. Berkeley, University of California Press, 1, 281-
297.
[5]. http://www.mathworks.com/matlabcentral/linkexchange/
links/1304-data-mining-in-matlab 2010-08-23

67
Computer Science Section

[6]. K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B.Scholkopf, [7]. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
“An introduction to kernel-based learning algorithms,” IEEE Uthurusamy. Advances in Knowledge Discovery and Data
Trans. Neural Networks, vol. 12, no. 2, pp.181-201, 2001. Mining. AAAI Press/MIT Press, 1996.

Taghi Adl received the B.Sc. degree in Computer Engineering-Hardware, (2009) from department of Electrical and computer
engineering, Shahid Beheshti University, M.Sc. in Computer Architecture (2011) from department of Electrical and computer
engineering, Isfahan University of Technology. In 2010, he joined Data mining lab in Isfahan University of technology. His
current research interest includes data mining.

Arash Givchi received the B.Sc. degree in Computer Engineering-Software (2009) from department of Electrical and
computer engineering, Isfahan University, M.Sc. in Artificial Intelligence and Robotic (2011) from department of Electrical
and computer engineering, Isfahan University of Technology. In 2010, he joined Data mining lab in Isfahan University of
technology. His current research interest includes data mining and robotic.

Mohamad Saraee received his PhD from University of Manchester in Computation,. His main areas of research are Intelligent
databases, Mining advanced and complex data including medical and Bio, Text Mining and E-Commerce. He has published
extensively in each of these areas and served on scientific and organizing committee on number of journals and conferences.

Amid Eshraghi received the Doctorate degree in general physician(2002) from Medical Science of University of Mashhad,
specialist in internal medicine (2009) from Medical Science of University of Isfahan. He is studying gastroenterologist
subspecialist in Medical Science of University of Tehran.

68

You might also like