Information Sciences: Shafiul Alom Ahmed, Bhabesh Nath

Information Sciences 576 (2021) 609–641
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Identification of adverse disease agents and risk analysis using

frequent pattern mining
Shafiul Alom Ahmed ⇑, Bhabesh Nath
Department of Computer Science and Engineering, Tezpur University, Tezpur, Assam, India
a r t i c l e i n f o a b s t r a c t
Article history: Life-threatening illnesses such as cancer, cirrhosis of the liver, and hepatitis have become
Received 2 December 2020 crucial problems for humanity. The risk of mortality can be deflated by early detection of
Received in revised form 16 July 2021 symptoms and providing the best possible diagnosis. This critical role of detection and/or
Accepted 18 July 2021
diagnosis can be enhanced using one of the techniques used in data mining, such as peri-
Available online 21 July 2021
odic pattern mining, association rule mining, classification. Analyzing the commonly
occurring possible patterns or signs followed by performing the correlation analysis among
Keywords:
those patterns can be exhaustively practiced for early detection and improve the diagnosis.
FP-tree
FP-growth
Towards the adoption of association rule mining, devising a cost-effective and time-saving
Frequent pattern algorithm for mining frequent patterns plays an important role. In this paper, we propose
Pattern mining an approach to pattern mining called Improved Frequent Pattern Growth (Improved FP-
Data mining Growth). Firstly, it constructs an improvised frequent pattern tree data structure called
Frequent itemset Improved FP-tree. Moreover, Improved FP-Growth introduces a construction of conditional
Itemset mining FP-tree data structure layout called Improved Conditional Frequent Pattern Tree (Improved
Pattern analysis Conditional FP-Tree). Unlike the traditional FP-Growth method, it uses both top-down and
bottom-up approaches to efficiently generate frequent patterns without recursively con-
structing the improved conditional FP-tree. The experimental results emphasize the signif-
icance of the proposed Improved FP-Growth algorithm over a few traditional frequent
itemset mining algorithms those adopt the approach of recursive conditional FP-tree
construction.
Ó 2021 Elsevier Inc. All rights reserved.
1. Introduction
In the year 2018, the International Agency for Research on Cancer conducted a study on cancer in India. The agency
reported the incidence and mortality rates of the top ten cancer diseases in India [1], as shown in Figs. 1 and 2. From the
statistics reported in Fig. 1, it can be observed that Breast, Cervix Uteri, and Ovary cancers are the top three cancers having
high mortality-ratio among females in India. Similarly, Tongue, Oral Cavity, and Lung are the top three cancer diseases
among males with high mortality rates. The agency also reported that breast cancer (among women of all ages) and Lip, Oral
Cavity (among men of all ages) are the highest number of new cases in 2018 among all cancer diseases, as shown in Fig. 3a
and Fig. 3b correspondingly. Statistics published in 2018 by the International Cancer Research Agency, reported a total of
1157294 new cancer cases and 784821 cancer deaths over the past five years in India, including all forms of cancers across
⇑ Corresponding author.
E-mail addresses: tezu.shafiul@gmail.com (S.A. Ahmed), bnath@tezu.ernet.in (B. Nath).
https://doi.org/10.1016/j.ins.2021.07.061
0020-0255/Ó 2021 Elsevier Inc. All rights reserved.
Shafiul Alom Ahmed and B. Nath Information Sciences 576 (2021) 609–641
Fig. 1. Age-standardized (World) incidence rates per sex, top 10 cancers (data collected from [1]).
Fig. 2. Age-standardized (World) incidence and mortality rates, top 10 cancers (data collected from [1]).
all age groups and genders. Among the total registered cancer cases, this caused about 67.81% of deaths and is projected to
rise shortly.
Like cancer, hepatitis is also a predominant disease that impairs the liver and breeds cirrhosis. The WHO’s Global Hepati-
tis Report [2] reported 1.34 million deaths (worldwide) due to viral hepatitis in 2015. Among them, 96% deaths were due to
chronic HBV (66%) and HCV (30%) infections. Moreover, 0.8% deaths were due to hepatitis A and 3.3% due to hepatitis E.
Therefore, it is essential to detect the causal factors, examine and understand their action or trend for early detection, pre-
vention, and provide the appropriate diagnosis based on these growing fatalities.
To achieve rapid recovery and to reduce the death risk, it is crucial to recognize the disease or its symptoms at the early
stage. The diagnostic customs for these adverse diseases are very expensive, cumbersome, and appear to make mistakes.
Without depth evaluation of the possible causes or unrevealed insightful knowledge about the cause of the disease, deliv-
ering treatment based on intuition or guidance by medical practitioners may often lead to a compromise in diagnosis. In this
610
Fig. 3. Number of new cases among.
age of computers, health care networks store or update vast quantities of patient information (data) every day. Using any
traditional data analysis method, processing these large quantities of data and figuring out the useful information is not a
simple job. In 1997, Walter Gulbinat of the Department of Mental Health and Drug Abuse (MSD) of the World Health Orga-
nization (WHO) recognized the applicability of information retrieval and computational intelligence techniques in many
domains to solve medical problems [3] for the first time. The study by Koh. [4], Niak’su et al. [5] and Durairaj et al. [6] state
that data mining techniques are more compatible with medical data analysis. In order to improve the standard of medical
diagnosis WHO researchers primarily concentrate on surveillance, clinical management, and disease prevention, through
analyzing medical databases using data mining techniques to extract valuable knowledge or trends to explain the correla-
tions between them [7]. Of all the data mining techniques, the researchers used association rule mining most to evaluate the
medical datasets [8,9],10. Association rule mining operates on the support/confidence threshold system defined by the user.
It can be divided roughly into two phases. It analyses the dataset during the first phase and produces frequent patterns based
on the minimum support threshold followed by measuring the relation between these frequent patterns during the second
phase. The frequent pattern mining techniques adopted in several fields of data analysis have revealed that mining frequent
patterns from an extensive database is a computationally costly task. If the size of the database (dimensionality) is vast, and
the minimum threshold of support is shallow, it will generate immense number of patterns. It takes a considerable amount
of time and space to compute and to maintain specific patterns appropriately.
Researchers have been paying tremendous attention from the beginning to overcome the obstacles mentioned above of
frequent itemset mining. Agrawal et al. first drew attention to frequent itemset mining (Apriori Algorithm) in 1993 [11]. The
Apriori algorithm is a level-wise computation that uses several scans of databases and produces a large number of candidate
itemsets. Besides producing the full collection of frequent itemsets, it exercises a costly testing and prune-out approach to
discard the redundant and infrequent candidate itemset. Later on, several attempts were made by the researchers to suggest
an efficient method by implementing the Apriori approach to mine the frequent itemsets from large datasets. However, most
of the methods proposed suffer from the problem of multiple database scans, computation time (candidate itemsets), and
huge space requirement. A group of researchers, Han et al., developed a prefix path tree-based data structure method called
FP-Growth [12] to mitigate the issue of multiple database scans and the massive candidate itemsets generation in the year
2000. FP-Growth handled the multiple scan problem by restricting it to only two. Moreover, without generating any candi-
date itemset, it can generate the full set of frequent itemsets. Frequent itemset mining can also be represented in two steps
using the FP-Growth approach as detailed below.
Step 1: FP-tree Construction:
In a tree data structure called FP-tree, the tree construction algorithm materializes the database transactions in such a
way that more frequently occurring objects are located more adjacent to the root node as several prefix paths can be shared
more frequently. This compressed data structure of the tree is accomplished using only two scans of the database. The initial
scan is performed over the database to determine the total number of occurrences (support count or frequency) of individual
items in the database. After that, the items (infrequent) with support count less than the user’s support threshold are dis-
carded. The remaining frequent items are grouped in the descending order of their frequency count, and the items are
accordingly placed into a list structure called the header table along with their corresponding frequency counts. Afterwards,
the transactions are fetched one at a time during the second search over the database. The transaction items are rearranged
according to the pre-assigned header table items’ order, except the infrequent items. The rearranged frequent transaction
items are inserted within the FP-tree as tree nodes. For each entry or item in the header table, the header table maintains
a link pointing to the first occurrence of the item in the FP-tree. The tree nodes are maintained by links among the parent,
child, sibling, and the same items. During tree creation and in later stages, the header table and the node links are used to
ease the tree traversal.
Step 2: Mining the Frequent Itemsets:
The next step is to extract the critical information, i.e., extract frequent itemsets from the FP-tree after the database has
been successfully materialized into FP-tree. The itemset generation begins with the lowest header table item and traverses
611
upward until the top positioned item. For example, to mine the frequent pattern from the FP-tree for item ‘x’ (called the suf-
fix pattern), all the prefix paths (called the conditional pattern base) appearing in the tree, beginning with the nodes con-
taining the suffix pattern, i.e., item ‘x’ are discovered. The support count of each node within the node(x) conditional
pattern base is set as the support count of the node(x). The pattern bases are introduced into a special kind of projected data
structure called the conditional FP-tree after discovering the conditional pattern bases. If two pattern bases contain some
common prefixes, the prefixes share the same path in the conditional FP-tree. The support counts of the shared prefixes
are combined to compute the total support count. The tree is mined after creating a conditional FP-tree by invoking a pattern
growth method called FP-Growth. The FP-Growth carries out the mining by recurrently creating a conditional frequent pat-
tern tree for each frequent itemset or pattern, considering it to be a new suffix. It continues to invoke itself repeatedly until
all the frequent itemsets are determined for the base suffix pattern. For each frequent item of the FP-tree header table, the
process of repeated conditional FP-tree construction and its mining by recursive invoking of FP-Growth continues to produce
the entire set of frequent itemsets. The best thing about the FP-Growth algorithm is that it can discover the entire range of
frequent patterns without any data loss.
Despite having some shortcomings, the research community has well-acknowledged the use of FP-Growth algorithm to
effectively mine frequent patterns. FP-Growth’s key drawback is the recursive creation and repeated mining of conditional
FP-trees. The algorithm constructs multiple sub-conditional FP-tree structures for a single suffix pattern and holds them in
the main memory consecutively. Therefore, keeping multiple sub-trees in the main memory along with the FP-tree itself can
often lead to colossal search space and primary memory scarcity for large databases [13]. Researchers have tried several
improvements over just the FP-tree construction algorithm in order to improve the performance of overall FP-Growth. How-
ever, the academic community has given less attention to the mining problem. In this paper, a novel pattern growth
approach called Improved FP-Growth (IFP-Growth) is proposed to alleviate the issues mentioned above related to existing
frequent pattern mining approaches. IFP-Growth can boost the efficiency of both FP-tree construction and conditional FP-
tree mining algorithms. It is possible, to sum up the main contributions of this work as follows.
* Enhanced the FP-tree construction algorithm to minimize construction time of the Improved FP-tree (IFP-tree).
* Development of an efficient conditional FP-tree data structure called Improved Conditional FP-tree (ICFP-tree).
* Development of an efficient pattern growth algorithm called IFP-Growth to discover the frequent itemsets from the
ICFP-tree.
* Introduced approach to efficiently generating the frequent itemsets from different medical databases and identifying
the culpable agents.
2. Preliminaries
The main objective of frequent itemset mining (FIM) is to generate the frequently occurring patterns or itemsets from
transactional databases those are useful and meet users’ criteria in decision making. The usefulness and interestingness
of patterns generated are assessed by some most widely used measures as discussed below.
Let D be a transactional database with items I ¼ fi1 ; i2 . . . im g and set of all transactions T ¼ ft1 ; t2 . . . tn g. Each transaction t j
is a subset of I. A transaction t j is said to contain an itemset say i, if i is a subset of tj .
Definition 1 (Support Count (r)). Support count is the total number of transactions in the database that contain an itemset.
Mathematically, the support count r of an itemset P can be represented as:
rðPÞ ¼ jftj jP # tj ; tj 2 Tgj ð1Þ
Definition 2 (Support (Supp)). Support of an itemset P can be defined as the percentage of transactions in the transactional
database D that contains the itemset P. Mathematically, the support of an itemset P can be represented as:
rðpÞ
SuppðPÞ ¼ ð2Þ
T
Definition 3 (Frequent Pattern or Frequent Itemset). An itemset is said to be frequent if the support of the itemset is greater
than or equal to the user specified minimum support threshold minSupp. Formally, an itemset P is said to be frequent if it
satisfies the following constraint:
SuppðPÞ P minSupp ð3Þ
612
3. Related Work
3.1. Data mining techniques in disease analysis
The scientific community has used many computerized intelligence systems to classify and evaluate the agents respon-
sible for diseases. In the literature, we found that for early detection of diseases and their risk factors, data analysis tech-
niques such as decision tree, neural network, logistic regression, and support vector machine (SVM) are being used. The
strategies used by decision trees to assess disease risk require additional selection procedures, which are ineffective in incon-
sistent data management [13]. While complex non-linear correlations within itemsets can be easily analyzed using neural
networks, their usefulness is restricted by the constraint of high execution resources [14]. The SVM works well in predicting
the disease’s agents, but its computation is costly in terms of time and space [15]. Identification of disease agents using the
Table 1
Data analysis techniques for predicting disease agents.
Technique Disease analysed

Decision Tree, ANN and Logistic Regression, 2005, [16] This method predicts breast cancer survivability. The experimental
results showed that the decision tree induction method (C5) performed
the best with a higher classification accuracy than the ANN model and the
logistic regression model.
Fuzzy Decision Tree, 2008, [17] This method also predicts breast cancer survivability. Performance
comparisons suggest that, for cancer prognosis, hybrid fuzzy decision tree
classification is more robust and balanced than independently applied
crisp classification.
Association Rule (AR) and Neural Network (NN), 2009, [18] This method was proposed to detect breast cancer. This research
demonstrated that the AR could reduce the feature vector’s dimension,
and the AR + NN model can be used to obtain efficient automatic
diagnostic systems for other diseases.
Self organizing map(SOM), radial basis function network (RBF), general They predict Breast Cancer Survivability. RBF and PNN were proved to be
regression neural network (GRNN) and probabilistic neural network the best classifiers in the training set. However, the PNN gives the best
(PNN), 2010, [19] classification accuracy when the test set is considered.
Bayesian Neural Network, 2011, [20] This method detects breast cancer. The correct classification rate of this
method is 98.1%. This research offered that the preprocessing is necessary
on this data, and a combination of ReliefF and Bayesian network can be
used to obtain fast, automatic diagnostic systems for breast cancer.
Support Vector Machine Classifier and Rough Set-based Feature Selection, This method is used for breast cancer diagnosis. Experimental results
2011, [21] demonstrate that the proposed RS SVM can achieve very high
classification accuracy and detect a combination of five informative
features, which can give an essential clue to the physicians for breast
diagnosis.
Hybrid Method Based on SVM and Simulated Annealing (SVM-SA), 2012, This method helps in Hepatitis disease diagnosis. SVM-SA classification
[22] accuracy via 10-fold cross-validation is claimed to be 96.25%, and it is
very promising concerning the other classification methods in the
literature.
Rough Set and Extreme Learning Machine, 2013, [23] It diagnoses Hepatitis disease. The experimental results of this method
claim the highest 100.00% classification accuracy.
Feature Extraction Using a Hybrid of K-means and SVM, 2014, [24] This method was proposed for breast cancer diagnosis. It is a hybrid of K-
means and support vector machine (K-SVM) algorithms, is utilized to
recognize the hidden patterns of the benign and malignant tumors
separately, and obtains the new classifier to differentiate the incoming
tumors.
Feature Extraction by Hybrid of k-means and Extreme Learning Machine This method diagnoses Breast Cancer. It is a hybrid of K-means and
Algorithms, 2016, [25] support vector machine (K-SVM) algorithms are developed to extract
useful information and diagnose the tumor.
Back Propagation Neural Network (BPPN), and Radial Basis Neural This method classifies Breast cancer images. This method performs the
Networks (RBFN), 2017, [26] classification of the images using Back Propagation Neural Network
(BPPN) and radial basis neural networks (RBFN).
Naive Bayes, RBF Network, J48, 2018, [27] It predicts benign and malignant breast cancer. The results indicate that
the Naive Bayes is the best predictor with 97.36% accuracy, better than
RBF Network and J48.
Naive Bayes, K-nearest and Random Forest classifier, 2019, [28] This method detects Hepatitis (A, B, C, and E) Viruses. The Naive Bayes
has achieved an accuracy of 93.2%. In the Random Forest classifier, the
accuracy is 98.6% by using 10-fold cross-validation, and in K-nearest
neighbor, it becomes 95.8% by using 10-fold cross-validation.
Grey Wolf Optimization (GWO) algorithm, and SVM classifier, 2020, [29] Breast and Colon Cancer Classification. This methodology uses the
information gain (IG) to select the input patterns’ essential features. The
selected features (genes) are then reduced by applying the grey wolf
optimization (GWO) algorithm. Finally, the methodology employs a
support vector machine (SVM) classifier for cancer type classification.
613
frequently occurring itemsets (contributing agents) and the study of associations between these agents in medical databases
has got less attention. However, Ordonez et al.’s [8] work is recognized as more trustworthy compared to decision tree clas-
sification rules for predicting heart disease that is based on frequent itemset or association rule mining. Table 1 shows a few
medical data analysis techniques that evolved over the last two decades to predict the survivability and diagnosis of breast
cancer and hepatitis diseases.
Apart from breast cancer and hepatitis, in literature, we find many other works such as [30–34] those use different data
mining techniques to analyze several other diseases. Over the past decade, deep learning has demonstrated superior perfor-
mance in solving many problems in various medicine fields compared with other machine learning methods. Several deep
learning studies have been reported in recent years that have focused on diagnosis and treatment of cardiovascular disease
[35], and diagnostic and prognostic analysis of COVID-19 [36], disease inference from health-related questions [37]. How-
ever, deep learning is all about learning through deep neural networks on large datasets. On the contrary, frequent pattern
mining is just an algorithm or procedure that finds patterns among the features or items of a dataset and tells the relation-
ship among the items in patterns. ‘‘The frequent patterns can not be mined by deep learning, but they can be used as features
for training a deep network” [38].
3.2. Frequent itemset mining
The algorithm Apriori [11] is known to be the basis of frequent mining algorithms. Apriori and FP-Growth [12] are rec-
ognized as well-known and acceptable algorithms for frequent pattern mining among all the current approaches. However,
the Apriori algorithm uses a level-wise approach to the generation of candidate itemsets, so the range of all frequent itemsets
requires several numbers of repetitive database scans. Han et al. [39] subsequently suggested a tree data structure to deal
with the issues of the Apriori algorithm. It only reduced the number of database scans to two, and it uses the without can-
didate generation approach to discover the collection of all frequent itemsets. Although FP-Growth ’s output is noteworthy
concerning candidate generation approaches, it has some compelling disadvantages. The recursive construction of condi-
tional FP-trees contributes to the need for a large amount of memory and processing time. Notably, when the minimum sup-
port threshold is shallow for high dimensionality datasets, its performance degrades significantly. At some point in time, its
performance becomes almost identical to that of Apriori. Fp-tree construction approaches using single database scan con-
sume a significant amount of time for restructuring to maintain the FP-tree prefix paths in frequency descending order
(FP-tree property) [40]. Table 2 describes a description of the FP-tree-based frequent pattern mining algorithms, their advan-
tages and disadvantages.
Table 2
Summary of FP-Growth based frequent itemset generation algorithms in literature.
Algorithms Database Pros and Cons

Scan
H-mine [41] 2 H-mine generates the patterns from the hyper-structure by recursively invoking the FP-growth
algorithm. Therefore it suffers from the same recursive conditional FP-tree construction problem.
Opportunistic 2 Frequent itemset mining using Opportunistic Projection uses the concepts of both FP-Growth and H-
Projection [42] mine. Since the FP-Growth works better with dense datasets only, therefore, for sparse datasets based on
the characteristic of sub databases, the Opportunistic Projection method intelligently switches to array-
based technique to save to tree construction time.
COFI-tree [43] 2 COFI-tree intelligently mines the frequent patterns without recursively constructing the conditional FP-
trees. However, it generates the candidate patterns by considering one branch or prefix path of the
conditional FP-tree. Therefore, merging the frequency counts of each redundant candidate patterns
followed by pruning out the infrequent patterns consumes a notable amount of time.
Inverted Matrix [44] 2 It stores the transactional database in terms of a relational database model consisting of a 2-dimensional
array. The array is of size nm; n is the number of features or items, and m is the total transactions in the
database. However, the execution of different procedures to generate the candidate itemsets from this
table is very expensive, and sometimes for extensive scale databases, it becomes practically infeasible.
FP-Growth* [45] 2 This algorithm uses the concept of closed frequent itemset to minimize the overhead of computing the
frequencies of sub-itemsets. It uses too many data-structures to implement the algorithm.
CT-PRO [46] N It reduces the number of nodes in a tree up to half of the corresponding FP-tree. The main disadvantage of
the CT-PRO algorithm is that in the worst-case its complexity is Oð22n Þ.
nonordfp [47] 2 Since the frequency of each node of the trie is stored in an array, the traversal becomes more accessible
with just reading an array sequentially. The array needs a sequential memory space, and for massive or
dynamic datasets it becomes infeasible.
CFP-Growth [48] 2 It reduces the tree size and performs best when both the data structures CFP-tree and CFP-array fit into
the main memory. The CFP-array is static, i.e., once the array is built, it can not be changed or updated. So
it becomes inapplicable for mining frequent patterns from incremental datasets.
Improved FP- 6N The algorithm reduces the tree reconstruction time and also saves the memory requirement. It maintains
Growth [49] an address table for the nodes of the FP-tree to perform the searching efficiently. However, this algorithm
requires an appropriate tree-level value to divide the header table items, which affects the algorithm’s
mining performance.
(N denotes the length of the longest frequent itemset).
614
4. Proposed Method
Mining frequent itemsets using the proposed tree-based method can be detailed in two phases. In the initial phase, a
database is materialized into a prefix tree-based data structure called Improved FP-tree which is an improvement over
our proposed frequent pattern tree [50]. During the second step, frequent itemsets are identified from the tree data structure
by invoking a pattern growth approach. For any frequent item in the prefix tree header table, the mining process performs
two steps.
Step 1: Discover the conditional pattern bases from the prefix tree and construct a conditional pattern tree called
Improved Conditional FP-tree using those pattern bases.
Step 2. Generate the frequent items from the Improved conditional FP-tree by invoking the proposed pattern growth
mechanism called Improved FP-Growth.
The Improved FP-Growth method for mining FPs can be described in three phases. A flowchart of the proposed Improved
FP-Growth method is depicted in Fig. 4.
4.1. Improved FP-tree Construction:
The proposed Improved FP-Growth algorithm customizes the traditional FP-tree construction algorithm to enhance tree-
building efficiency. Unlike FP-tree, Improved FP-tree retains the nodes’ same item link in reverse order. The node-link from
the header table to the item say ‘x’, always points to the newly inserted Node(x) in the Improved FP-tree instead of pointing
to the very first node Node(x) inserted in the tree. For each new node inserted into the Improved FP-tree during development,
Fig. 4. Flowchart of the proposed Improved FP-Growth approach.
615
two major elements are taken care of: i. reverse maintenance of the same item node links and ii. header table node-link
pointing to the newly inserted node. Accordingly, it removes the same item node traversal, saving a considerable time.
Whenever a new node Node(y) is inserted, with the aid of header tables’ node-link, we can directly access the last node
or the latest Node(y0 ) inserted in the same item node list without crossing the entire list for item ‘y.’ Then the same Node
(y) item link is set to the node Node(y0 ). The header table node-link is changed to the newly added Node(y). The step-by-
step process for the improved FP-tree construction is outlined in Algorithm 1 and 2.
Algorithm 1: Improved_FP_Tree-Construction(minsupp, D)
Algorithm 2: insertInto-Improved_FP-tree(I, Root)
The working principle of Improved FP-tree construction is illustrated in this section by considering a small transactional
dataset D [Table 3] and the minimum support threshold to be 20%. Like FP-tree, the Improved FP-tree construction algorithm
also requires two database scans to construct the tree. Initially, the whole transactional database D is scanned once to fetch
the frequency of each item. Items with their corresponding frequencies are shown in Table 4. The items are then sorted based
on their frequency count in descending order. Since the minimum support is 20% and there are 10 transactions in the data-
base, therefore the items having frequency count greater than or equal to minimum support, i.e., 2 are considered as frequent
items and inserted into the header table. The rest of the infrequent items are discarded as illustrated in Table 5.
After inserting the frequent items into the header table in frequency descending order, the frequency counts are set to be
zero. After that, the root node of the Improved FP-tree is created. In the second database scan, the database transactions are
read and inserted into the Improved FP-tree one by one. The steps required to be performed to insert the transactions of D
into the tree are as follows.
616
Table 3
Database (D).
Tid Transaction
T1 {b,d}
T2 {a,b,e}
T3 {a,c,d,e}
T4 {a,d,c}
T5 {b,d,e}
T6 {a,b,d,e}
T7 {d,f}
T8 {b,d,e}
T9 {a,b,d}
T10 {b,c,e}
Table 4
Item Frequencies.
Item Frequency Count

‘a’ 5
‘b’ 7
‘c’ 3
‘d’ 8
‘e’ 6
‘f’ 1
Table 5
Frequent Items in Descending Sorted Order.
Item Frequency Count

‘d’ 8
‘b’ 7
‘e’ 6
‘a’ 5
‘c’ 3
(a) Since all the first transaction items {b,d} are frequent; therefore, they are just sorted according to the header table’s
item order, and the resulting sorted transaction is d,b. As the root node has no child branches yet, the sorted transaction
will be inserted as the Improved FP-tree’s first branch. While creating the tree nodes for each transaction item, their
counts are set to be 1. The frequency counts of the corresponding items in the header table are incremented by one,
and the header table node links are set accordingly. The resultant Improved FP-tree after inserting the first transaction
is shown in Fig. 5a.
(b) For the second transaction a,b,e, all the items are frequent. The sorted transaction is b,e,a. Since the root has no child_-
node with item label ‘b’, a new node (Node(b:1)) with item=‘b’ and count = 1 is created and inserted as a child_node of the
root. The frequency count of item ‘b’ in the header table is incremented by 1. The same item node link is set to the already
existing node pointed by the header table node link. The header table node link is then updated to the newly created node
Node(b:1). The rest of the items ‘e’ and ‘a’ are inserted as a child branch of Node(b:1), and their header table node links are
set. The Improved FP-tree, after inserting the second transaction is shown in Fig. 5b.
(c) All the frequent items of the third transaction are sorted according to the header table’s item order, and the resulting
sorted transaction is d, e, a, c. In Fig. 5b, we can see that the Improved FP-tree’s root node has a child_node Node(d:1).
Thus, for the first item ‘d’ of the sorted transaction, the count of Node(d:1) and the corresponding frequency count in
the header table are incremented by 1. Since Node(d:2) has no child_node with the item label ‘e,’ therefore, a Node
(e:1) is created and inserted as a child_node of Node(d:2). The same item link of the newly created Node(e:1) is set to
the already existing node pointed by the header table node-link, and the header table node link is then updated to
Node(e:1). Similarly, Node(a:1) and Node(c:1) are created and inserted as a child of Node(e:1). Their header table fre-
quency counts, header table node links, and same item node links are updated accordingly. The resultant Improved
FP-tree is portrayed in Fig. 5c.
(d) For the first item ‘d’ of the sorted fourth transaction {d,a,c}, the root node of Improved FP-tree in Fig. 5c already has a
child_node Node(d:2). Therefore, the count of the node and the corresponding header table frequency count is incre-
mented by 1. However, Node(d:3) does not have any child node with the item label ‘a’. A new Node(a:1) is created
and inserted as a child_node of Node(d:3). Their header table frequency counts, header table node links, and same item
617
Fig. 5. Improved FP-tree construction for database D.
node links are updated accordingly. Similarly, item ‘c’ is also handled, and the resultant Improved FP-tree after inserting
the fourth transaction is shown in Fig. 5d.
(e) After sorting the fifth transaction as {d,b,e}, it can be seen in Fig. 5d that a prefix path already shares items ‘d’ and ‘b’.
Therefore, the node counts, and the corresponding header table frequency counts are incremented by 1. Since Node(b:2)
has no child_node, so for item ‘e’, Node(e:1) is created and inserted as a child node of Node(b:2). After that, the header
table frequency count, header table node-link, and same item node-link of newly created Node(e:1) are updated. The
resultant Improved FP-tree is illustrated in Fig. 5e.
(f) Likewise, for the items ‘d’, ‘b’, and ‘e’ of sixth sorted transaction {a,b,d,e}, the counts of the prefix path shared nodes in
Fig. 5e are just incremented by 1. The header table frequency counts of the corresponding items are also incremented by
1. Then Node(a:1) is created and inserted as a child_node of Node(e:2), and the corresponding header table frequency
count, header table node-link, and same item node-link are updated. The resultant Improved FP-tree after inserting
the sixth transaction is illustrated in Fig. 5f.
(g) In the same procedure transaction {d,f} will increment the support of ‘d’ only, as ‘f’ is infrequent. Transaction eight will
increase the support of each node of entire existing tree path {d,b,e}. Insertion of ninth transaction will create a Node(a:1)
618
after increasing the support of nodes in prefix path {d,b}. Tenth transaction will introduce a new node Node(c:1) in the
prefix path {b,e} after incrementing the support count of them. The final resultant Improved FP-tree is shown in Fig. 5g.
4.2. Improved FP-Growth
As mentioned above, mining the frequent items from Improved FP-tree is performed in two steps. The steps are illus-
trated in Section 4.2.1 and Section 4.2.2 by considering the Improved FP-tree (Fig. 4g) constructed from the transactional
database D with a minimum support threshold (minsupp) of 20%.
4.2.1. Construction of Improved Conditional FP-tree

After successfully constructing the Improved FP-tree, the next step is to construct a conditional pattern tree and mine the
frequent patterns from the conditional pattern tree by considering one item of the header table at a time of Improved FP-
tree. Our proposed approach constructs an efficient conditional pattern tree like data structure called ‘‘Improved Conditional
FP-tree”. Along with the identical information (item and count) maintained in FP-tree nodes, the Improved Conditional FP-
tree nodes maintain a supplementary information named as relative item count (RelCount) to improve the performance of
mining process. To construct the Improved Conditional FP-tree for an item say ‘X’, first step is to create the root node of the
Improved Conditional tree with item=‘X’ and count = the global frequency count ‘X’ in the header table Improved FP-tree.
Then the item ‘X’ with its global frequency count is inserted into the first (zeroth index) position of the header table of
Improved Conditional FP-tree. The corresponding header table node link is also set to the root node. The second step is to
discover all the pattern bases of item ‘X’ from the Improved FP-tree. Here the pattern base is the path or list of tree nodes
starting from the parent node of Node(X) to the top node of the path, excluding the root node. While discovering the pattern
bases Pb of item ‘X’, if the count of Node(Xi) is Pb(Xi) then the counts of all the nodes throughout the pattern base is set as
Pb(Xi). Throughout the discovery of Pb of item ‘X’, the frequency count of each item appearing in the pattern bases are com-
puted and stored in a temporary list. The list is then sorted in descending order based on the item frequencies. The items
having frequency count Pminsupp are inserted into the header table of Improved Conditional FP-tree from the second posi-
tion onwards in the same frequency descending order of sorted list. Thereafter, the pattern bases are consecutively sorted
with respect to the header table item order and inserted into the Improved Conditional FP-tree. The detailed procedure of
constructing the Improved Conditional FP-tree is illustrated in Algorithm 3 and 4.
To get better insight of Algorithm 3 and 4, in this section, we are going to illustrate how the Improved Conditional FP-tree
is constructed for the item ‘a’ from Improved FP-tree (Fig. 5g) of database D.
Algorithm 3: Improved-Conditional-FP_Tree-Construction(minsupp, Improved FP-Tree, X)
619
Algorithm 4: Insert-Into-Improved-Conditional_FP-tree(I; Count, Root)
(a) All the pattern bases for item ‘a’ is extracted from the Improved FP-tree of Fig. 5g. Simultaneously, the total counts of
each item present in the pattern bases are computed and maintained in a list. After that, the infrequent items are dis-
carded, and the list is sorted in descending order with respect to the item frequency counts, as described in Table 6.
(b) Thereafter, an empty header table of size equal to the length of sorted item list i.e 4 (index 0 to 3) is created.The items
of the sorted list are inserted into the header table accordingly.
(c) Subsequently, the root node with item=‘a’, count = 5 and RelCount = 0 (root(a:5:0)) is created and the node link from
the 0th index of the header table is set to the root node and the header table frequency count is also set as 5. The resultant
tree with only the root node is shown in Fig. 6a.
(d) After successful construction of the root node and header table, the first pattern base {d,b,a} is sorted according to the
header table item order excluding item ‘a’. Item ‘a’ is discarded because it has been already considered and inserted as
root node in the Improved Conditional FP-tree. Then the remaining sorted items i.e., {d,b} are inserted as a sub-branch
of root. Initially, a node Node(d:1:0) is created and inserted as the child_node of root. The header table frequency count
and the node link for item ‘d’ are set accordingly. Similarly for item ‘b’, Node(b:1:0) is created and inserted as child_node
of node Node(d:1:0), resulting into the first prefix path of Improved Conditional FP-tree. The Improved Conditional FP-
tree after inserting the first pattern base is shown in Fig. 6b.
(e) For the second sorted pattern base {d,b,e}, items ‘‘d’ and ‘b’ are already exist in the first branch of Fig. 6b. Therefore, the
count of the nodes and the corresponding header table frequency counts are simply incremented. Then for the last item, a
node with item=‘e’, count = 1 and RelCount = 0 is created and inserted as the child_node of Node(b:2:0). The header table
count of item ‘e’ is incremented and the node link is set to Node(e:1:0). The resulting Improved Conditional FP-tree is
shown in Fig. 6c.
(f) By following the same Improved Conditional FP-tree node insertion procedure, the remaining pattern bases are
inserted into the tree and the resulting Improved conditional FP-trees are shown in Fig. 6d, Fig. 6e and Fig. 6f.
Table 6
Pattern Base Processing.
Pattern Base Count Item Count Sorted Item Order

{d,b,a} 1 a=5 ‘a’
{d,b,e,a} 1 b=3 ‘d’
{d,a} 1 d=4 ‘b’
{d,e,a} 1 e=3 ‘e’
{b,e,a} 1
620
Fig. 6. Improved Conditional FP-tree construction for item ‘a’.
Fig. 7. Conventional conditional FP-tree for the item ‘a’.
The corresponding Conditional FP-tree for the item ‘a’ constructed by conventional FP-Growth algorithm is shown in
Fig. 7. Conventional conditional FP-tree is a prefix tree in nature, while the proposed conditional tree is a suffix tree.
By comparing the conditional pattern trees constructed using our proposed approach and the conventional Conditional
FP-tree construction approach, represented in Fig. 6f and Fig. 7, it can be observed that our proposed approach is capable
of reducing the size of Improved Conditional FP-tree compared to the conventional Conditional FP-tree. Moreover, a single
621
Improved Conditional FP-tree (Fig. 6f) is enough to mine the set of all frequent itemsets for a item (‘a’). Whenever a Improved
Conditional FP-tree is mined, the tree is removed or deleted from main memory to free up space for constructing an
Improved Conditional FP-tree for the next item in the header table of Improved FP-tree. On the other hand, for each base
pattern of the conventional Conditional FP-tree (Fig. 7) (such as {a,b}, {a,e}, {a,b,d}, {a,e,b}. . .. . .), the FP-Growth algorithm
recursively constructs multiple Conditional FP-trees. Therefore, multiple Conditional FP-trees are needed to be maintained
in main memory at the same time to mine the set of all frequent itemsets for a single item. Recursive construction of Con-
ditional FP-tree leads to the creation of multiple nodes for a single item, which consumes a significant amount of time and
space.
4.2.2. Mining of Improved Conditional FP-tree

The prime objective of constructing the compressed Improved Conditional FP-tree is to obtain maximum scalability and
enhance the performance of frequent itemset mining algorithm. To mine the frequent items from the Improved Conditional
FP-tree, an efficient and cost effective pattern growth approach has been introduced called Improved FP-Growth. Along with
the Improved Conditional FP-tree structure, Improved FP-Growth algorithm additionally uses a stack to keep track of the
conditional base patterns, for which the frequent itemsets are yet to be mined from the Improved Conditional FP-tree. Each
entry in the stack maintains an array[] of items to store a base pattern, total count of the base pattern and the header table
index of the first item in the base pattern. The benefits of using the additional information ‘‘RelCount” and the stack are: (i)
restricts recursive conditional pattern bases extraction from the conditional FP-tree, (ii) eliminates the recursive conditional
FP-tree construction for each conditional base pattern, (iii) doesn’t generate any redundant or infrequent itemsets and (iv)
requires less space and computation time.
To generate the frequent itemsets from the Improved Conditional FP-tree, the Improved FP-Growth procedure is called for
each item, starting from top to the lower most item of the header table. Before calling the Improved FP-Growth procedure to
mine the frequent itemsets, the main function creates the stack with an item set as base pattern from header table, the cor-
responding header table index and its header table frequency count. Together with the minimum support threshold, header
table and the stack, a flag value zero is also passed to the called procedure Improved FP-Growth. Flag value zero and the
header table index zero indicate that the procedure is being called from the main function and to generate the frequent item-
set with the single item, for which the procedure is called for i.e., item of the root node. Flag value other than zero indicates
that the procedure is being called recursively from itself. The step by step procedure for generating the frequent itemsets
from the Improved Conditional FP-tree is illustrated in Algorithm 5.
Fig. 8. Frequent itemset mining for item=‘a’ with respect to base pattern [‘d’].
Fig. 9. Frequent itemset mining for item=‘a’ with respect to base pattern [‘b’].
622
Algorithm 5: Improved_FP-Growth (minSupp, HTable, top, flag)
623
To better understand how the Improved GP-Growth has utilized the header table, stack and ‘‘RelCount” to efficiently gen-
erate the frequent itemsets from Improved Conditional tree without recursive construction of conditional pattern trees, in
this section, we discuss the working principle of Improved FP-Growth to generate the frequent itemsets for item ‘a’ by con-
sidering the Improved Conditional FP-tree (of item ‘a’) of Fig. 6f.
i. Initially for item ‘a’, the Improved FP-Growth procedure is called with top (array[a], count = 5, index = 0) and the flag =
0. Since both top!index and flag vales are zero, the pattern [a:5] is generated from top!array[a] and then top is deleted
from main memory to free up the space using steps 3 to 5 of Algorithm 5.
ii. For the second item in header table i.e., ‘d’, the procedure is called with top (array[d], count = 4, index = 1) and flag = 0.
Since index = 1, the RelCount of all the nodes along the same item paths above the header table index = 1 are set to 0 by
using steps 7 to 13, and is shown in Fig. 8a. Then, RelCounts of all the nodes along all the prefix paths are updated using
steps 14 to 23. Thereafter, header table count of index = 0 is incremented by accumulating the RelCount of each node in
the same item node list by using steps 24 to 29, as shown in Fig. 8b. Then, top of the stack will be popped out at tempTop
to generate base pattern. Since the header table frequency count of index = 0 is 4PminSupp, therefore a stack node with
(array[a,d], count = 4, index = 0) is created and inserted into the stack using steps 30 to 39, as shown in Fig. 8c. As the
stack contains only one entry with index = 0, therefore it is popped out and the frequent itemset [a,d:4] is generated.
iii. Next the procedure is invoked with top(array[b], count = 3, index = 2) and flag = 0. Using steps 7 to 13, the RelCount of
all the nodes along all the prefix paths are updated to 0, as shown in Fig. 9a. The RelCount of all the nodes along all the
prefix paths are updated using steps 14 to 23. Subsequently, the header table frequency counts of each index less than
top!index are incremented by accumulating the RelCount of each node in the same item lists of corresponding header
Fig. 10. Frequent itemset mining for item=‘a’ with respect to base pattern [‘d’,‘b’].
Fig. 11. Frequent itemset mining for item=‘a’ with respect to base pattern [‘e’].
Fig. 12. Frequent itemset mining for item=‘a’ with respect to base pattern [‘d’,‘e’].
624
Fig. 13. Frequent itemset mining for item=‘a’ with respect to base pattern [‘b’,‘e’].
Table 7
Frequent Itemsets for item ‘a’ (minSupp = 20%).
Frequent Itemset Support(%)

‘d’ [a,d] 40%
[a,b] 30%
‘b’ [a,d,b] 20%
[a,e] 30%
‘e’ [a,d,e] 20%
[a,b,e] 20%
table node links by using steps 24 to 29, as shown in Fig. 9b. As the header table frequency counts of both the items are
greater than or equal to minSupp, stack nodes with (array[d,b], count = 2, index = 1) and (array[a,b], count = 3, index = 0)
are created and inserted into the stack by using steps 30 to 39, as shown in Fig. 9c.
Now the top of the stack is popped out to tempTop. Since tempTop!index==0 and tempTop!count = 3 PminSupp, a
pattern [a,b:3] is generated and afterwards the tempTop is freed.
Since the stack is not empty, the top is again popped out to tempTop. But, tempTop!index==1 and therefore the pro-
cedure is recursively called with tempTop (array[d,b], count = 2, index = 1) and flag = 1. Similarly, the RelCount of all
the nodes along all the prefix paths are initialised with 0, using steps 7 to 13, as shown in Fig. 10a. The RelCount of all
the nodes along all the prefix paths are updated using steps 14 to 23. Subsequently, the header table frequency counts
are updated by accumulating the RelCount of root by using steps 24 to 29, as shown in Fig. 10b. Since, the header table
frequency count item ‘a’ 2PminSupp, a stack node with (array[a,d,b], count = 2, index = 0) is created and inserted into
the stack by using steps 30 to 39, as shown in Fig. 10c. As the stack contains only one entry with index = 0, therefore it
is popped out and the frequent itemset [a,d,b:2] is generated.
Similarly, by using the Improved FP-Growth for the last item ‘e’ of header table, the frequent itemsets [a,e:3], [a,d,e:2] and
[a,b,e:2] from the Figures [11a, 11b,11c], Figures [12a, 12b,12c], and Figures [13a, 13b,13c] are found respectively.
When the relative counts of each item above the top!index i.e., 2 of header table are accumulated, the total count of
index 1 item ‘d’ is 1; which is less the minSupp value. That means the item ‘d’ is infrequent with respect to base pattern
[b,e]. Therefore, item ‘d’ will not take part in generating the candidate base pattern [d,b,e]. All the frequent itemsets for item
‘a’ from the Improved Conditional FP-tree (Fig. 6f) by invoking the Improved FP-Growth are represented in Table 7.
5. Experimental Results Evaluation
To evaluate the performance of the proposed method, this section demonstrate the experimental results and performance
of (a) the Improved FP-tree construction method discussed in Section 4.1, and (b) the Improved Conditional FP-tree construc-
tion and its mining using the proposed Improved FP-Growth method discussed in Section 4.2.1 and Section 4.2.2, respec-
tively. The experimental results of our proposed methods are evaluated in terms of total time taken to execute the
algorithm and the space requirement with respect to state of the art algorithms.
5.1. Experimental Environment and Datasets
All the tree construction and pattern growth algorithms are coded in C and run on Ubuntu-18.04.2 with 2.67 GHz CPU and
8 GB main memory. The Improved FP-tree construction algorithm’s performance is evaluated by comparing its execution
time with the time taken by conventional FP-tree construction algorithms of FP-Growth [12,41]. Because most of the algo-
rithms without candidate generation approach use the conventional FP-tree data structure for efficiently storing the data-
base using two database scans. The analysis of memory requirement by both the proposed and FP-tree construction
625
Fig. 14. Execution time (Connect-4).
algorithms are not considered because, for a database, the proposed Improved FP-tree structure and conventional FP-tree
structure generate the same number of tree nodes for identical minimum support thresholds. The comparative performance
analysis of the proposed Improved FP-Growth has been performed with respect to both execution time and main memory
requirement based on different criteria. We have considered only the pattern mining phase of all the pattern growth algo-
rithms for an unbiased evaluation. Each algorithm is executed five times for each minimum support threshold, and the aver-
age time of all the runs is considered for objective performance analysis. To assess the significance of the Improved FP-tree
construction algorithm over other alternative tree construction algorithms, we have conducted experiments on both real and
synthetic datasets, as well as dense and sparse datasets. The datasets are retrieved from the UCI Machine Learning Repository
and FIMI Repository.
5.2. Execution Time of Improved FP-tree Construction
5.2.1. Effect of Database Size

To assess the Improved FP-tree construction algorithm’s performance with respect to database size update, we have ini-
tially conducted experiments in one dense real-life database and two sparse databases. The dense database ‘‘Connect-4”,
consists of 67557 transactions and 129 attributes. Each transaction is of average size 43, i.e., each transaction contains 43
attributes. On the other hand, the synthetic sparse database ‘‘T40I10D100K” consists of 100 K transactions and 1 K attributes.
The average length of each transaction is 40. The sparse real-life database named ‘‘Kosarak” contains 990002 transactions.
The total number of attributes in the database is 41270, and 8.1 is the average length of each transaction.
For the ‘‘Connect-4” database, to begin with, we have considered the first 15 K transactions as input and executed the tree
construction algorithm. After that, each time, the size of the input database size is updated with the next set of 15 K trans-
actions. Finally, the whole database is taken as input. For a fair comparison, we have considered the minimum support to be
1, i.e., if an item occurs at least once in the database, it will be taken into account to construct the tree. The time taken by the
proposed Improved FP-tree construction algorithm and the conventional FP-tree construction algorithm of FP-Growth for a
different-sized set of transactions is illustrated in Fig. 14.
The transactions of a dense database contain a small number of attribute values or items. However, the transactions con-
tain almost similar items. Therefore, there exists a high possibility that a tree prefix path or its sub-part is shared by multiple
database transactions. Since most of the tree nodes are shared by multiple transaction items, hence lead to a minimum num-
ber of tree nodes creation and relatively smaller same item node lists. The same item node list is traversed if a new node is
created in the tree. Though the same item node lists are relatively smaller for dense databases, it does not prevent from
traversing the same item node list in the case of conventional FP-tree. On the contrary, in our proposed Improved FP-tree
construction algorithm, it is not required to traverse the whole same item node list every time a new node is inserted into
Table 8
Same item node list traversed by FP-tree construction algorithm (Connect-4).
Sl. No. Database Size No. of Same Item List Traversal

1 15 K 83543
2 30 K 162544
3 45 K 255182
4 60 K 324536
5 67 K 359291
626
Fig. 15. Execution time (TCT40I10D100K).
the Improved FP-tree. The same item node-link maintenance from the header table in the reverse order of conventional FP-
tree structure prevents it from traversing the whole same item node list. As the link points to the most recently inserted or
the last node of a same item node list, therefore, with the header table node link’s help, we can directly access the last list
node. It saves a significant amount of time compared to conventional FP-tree. Table 8 represents the number of same item
node list traversal by the FP-tree construction algorithms.
A sparse database can be considered as contradictory to a dense database. That means the transactions contain a rela-
tively large number of distinct items. Therefore, the possibility of sharing a prefix path is significantly less as compared
to the dense database. Which increases the size (breadth) of the FP-tree and leads to relatively longer same item node lists.
Therefore, in the case of a sparse database, the performance of the FP-tree construction algorithm atrophies drastically. Sim-
ilarly, for the sparse database ‘‘T40I10D1-00 K”, we have considered the minimum support to be 1, and each time 20 K trans-
actions increment the input database slot size. The time taken by the proposed Improved FP-tree construction algorithm and
the conventional FP-tree construction algorithm of FP-Growth for the different sized sets of transactions for the
‘‘T40I10D100K” database is illustrated in Fig. 15. Moreover, the number of same item node lists traversed by the FP-tree con-
struction algorithms is highlighted in Table 9.
Similarly, for the sparse database ‘‘Kosarak”, the minimum support is considered to be 1, and the input database slot size
is incremented by 200 K transactions consecutively. The time taken by the proposed Improved FP-tree construction algo-
rithm and the conventional FP-tree construction algorithm of FP-Growth for the different sized sets of transactions for
the ‘‘Kosarak” database is illustrated in Fig. 16. Moreover, the number of same item node lists traversed by the FP-tree con-
struction algorithms is highlighted in Table 10.
From the figures Fig. 14Fig. 15 and Fig. 16, it can be observed that the proposed Improved FP-tree construction algorithm
outperforms the conventional FP-tree construction algorithm for both dense and sparse database. Improved FP-tree acquired
great convenience with respect to execution time over conventional FP-tree construction algorithm as it avoids the repetitive
traversal of the same item node lists every time a new node is inserted into the Improved FP-tree.
5.2.2. Effect of Minimum Support Threshold

In this section, we are going to describe the results obtained from the next set of observations i.e., how the changes in
minimum support threshold effect the tree construction algorithms. Figs. 17–19 depict the effect of support threshold
changes on execution times for databases ‘‘Connect-4”, ‘‘T40I10D100K” and ‘‘Kosarak”, respectively. For ‘‘Connect-4” data-
base, the experiments are conducted by considering the support thresholds 10%, 20%,. . ..,50%. Since ‘‘T40I10D100K” and
‘‘Kosarak” databases are sparse, we have considered the support thresholds to be 1%, 2%,. . ..,5%. From the Figs. 17–19, it
can be seen that the proposed Improved FP-tree construction algorithm remarkably outperforms conventional FP-tree con-
struction algorithm for inconsistent support thresholds. Lower the support threshold, higher the number of frequent items,
Table 9
Same item node list traversed by FP-tree construction algorithm traversal (TCT40I10D100K).

1 20 K 724562
2 40 K 1439489
3 60 K 2149355
4 80 K 2858308
5 100 K 3562960
627
Fig. 16. Execution time (Kosarak).
Table 10
Same item node list traversed by FP-tree construction algorithm (Kosarak).

1 200 K 1045427
2 400 K 2088278
3 600 K 3087006
4 800 K 4097851
5 990 K 5029590
thus resulting in increase of tree size. Therefore, for sparse databases with smaller support threshold values, the performance
of FP-tree construction algorithm demotes extremely. Still for very lower support threshold values, the performance of
Improved FP-tree construction algorithm is more prominent. Tables 11–13 represent the number of same item list traversal
performed by FP-tree construction algorithm of conventional FP-growth algorithm. On the other hand, it is not at all required
to traverse the same item lists by our proposed Improved FP-tree construction algorithm. This is achieved by efficiently uti-
lizing the header table same item node links to avoid the repetitive traversal of same item node lists. From the analysis, it can
be observed that the proposed algorithm outperforms in both dense and sparse databases.
5.3. Evaluation of Improved FP-Growth
The real and synthetic databases, used to evaluate the performance of proposed pattern growth algorithm are presented
in Table 14.
Fig. 17. Execution time (Connect-4).
628
Fig. 18. Execution time (TCT40I10D100K).
Fig. 19. Execution time (Kosarak).
5.3.1. Execution Time Evaluation

To asses the effectiveness of our proposed Improved FP-Growth approach, we have performed experiments on several
databases and compared the performance with three novel pattern growth approaches specifically Apriori [11], FP-
Growth [12] and COFI-tree [43] mining. From the experimental analysis discussed in Section 5.2, it is observed that the pro-
posed Improved FP-tree construction algorithm consumes negligible amounts of time compared to FP-Growth’s FP-tree con-
struction algorithm. Both the FP-Growth and COFI-tree mining algorithms construct the frequent pattern tree using the same
tree construction algorithm. Therefore, to perform a fair comparison and to get a better view, only the execution times con-
sumed by the pattern growth phases of all the approaches are considered for comparison. Because, in most of the cases it is
observed that only the time taken by the FP-tree construction algorithm of FP-Growth and COFI-tree mining are much higher
than the overall time taken by the Improved FP-tree construction and Improved FP-Growth phases of the proposed method.
To endorse the analysis, we have conducted experiments on the real and synthetic databases described in Table 14 with
varying minimum support threshold settings. The execution time observations of each algorithm for mining the frequent
patterns from each database with different minimum support thresholds are illustrated through Fig. 20a to Fig. 20c, respec-
tively. From these figures, it can be observed that our proposed Improved FP-Growth method outperforms all the competent
algorithms for all the databases. Due to the requirement of multiple database scans and generation of enormous number of
candidate itemsets and their test/prune out mechanism, Apriori algorithm consumes a huge amount of time. For a very low
minimum support, performance of Apriori algorithm demotes abominably. On the other hand, recursive construction of con-
ditional FP-trees and their repetitive mining by FP-Growth for mining the frequent patterns for a single item consumes a
significant amount of time. Moreover, merging of redundant itemsets is also a costly task. Though, COFI-tree has managed
to discard the recursive construction of conditional FP-trees, but it processes one prefix path of the COFI-tree at a time. For
each prefix path, the nodes are maintained in temporary list. Thereafter, it generates all possible itemsets. Like FP-Growth,
COFI-tree also suffers from redundant itemset generation. In addition of eliminating the recursive conditional tree construc-
629
Table 11
Same item node list traversed by FP-tree construction algorithm (Connect-4).
Sl. No. Support Threshold No. of Same Item List Traversal

1 10% 249981
2 20% 137239
3 30% 37005
4 40% 13054
5 50% 5715
Table 12
Same item node list traversed by FP-tree construction algorithm traversal
(TCT40I10D100K).

1 1% 3473940
2 2% 3259406
3 3% 2955170
4 4% 2574171
5 5% 2240743
Table 13
Same item node list traversed by FP-tree construction algorithm (Kosarak).

1 1% 293355
2 2% 72165
3 3% 9498
4 4% 3423
5 5% 804
Table 14
Databases used.
Database No. of Transactions No. of Items Average tran. length Type

T10I4D100K 100000 870 10 Sparse
Retail 88162 21387 10.3 Sparse
Chess 3196 75 37 Dense
tion, our method has intelligently managed to generate the frequent itemsets only. Improved FP-Growth does not generate
any redundant itemset thus eliminates the merging phase.
As mentioned above, Apriori algorithm suffers from the huge number of candidate itemset generation. Therefore, the
maintenance of candidate itemsets consumes huge amount of memory. For dense database ‘‘Chess” with higher dimension-
ality, the Apriori algorithm could not mine all the frequent patterns for lower minimum support values.
5.3.2. Memory Requirement Evaluation

In this section, we are going to demonstrate the experiments conducted to asses the memory consumed by Improved FP-
Growth with respect to FP-Growth and COFI-tree. The comparison of memory consumption between different methods for
sparse and dense databases of Table 14 are presented graphically in Fig. 21a, Fig. 21b and Fig. 21c, respectively.
The memory consumption of pattern tree based approaches depend on the structure of conditional pattern tree used to
generate the frequent patterns. Most of the FP-Growth approaches recursively construct the conditional pattern trees in fre-
quency independent order. Which reduces the possibility of prefix path sharing and causing inflation of the tree size. More-
over, COFI-tree mining approach additionally creates a list for each prefix path of the COFI-tree to generate the itemsets. The
maintenance of redundant itemsets also requires reasonable amount of space. On the other hand, our proposed approach
neither constructs the conditional FP-trees recursively nor uses any additional list to generate the itemsets. Moreover, it does
not generate any redundant itemsets. Therefore, from Fig. 21a, Fig. 21b and Fig. 21c it can be seen that our Improved FP-
Growth consumes relatively less amount of memory for any sized database regardless of sparse or dense.
5.3.3. Correctness and Completeness Analysis

Whenever a new approach is developed to solve a problem, it is very much important to analyze whether the proposed
approach is providing the complete and correct output or not. To establish correctness and completeness of the proposed
630
Fig. 20. Execution time analysis for databases T10I4D100K, Retail, and Chess.
approach, we have conducted an experiment to compare the set of frequent patterns generated by our proposed Improved
FP-Growth approach with frequent patterns generated by FP-Growth, and COFI-tree for different support threshold values.
FP-Growth has been considered as the benchmark algorithm, as it generates the complete and correct set of frequent pat-
terns. To perform the analysis, we have experimented on the databases illustrated in Table 14. While conducting the exper-
iment described in Section 5.3.1, the total number of frequent patterns generated by the different algorithms for varying
support threshold settings are reported in Table 15. From the experiment as well as from Table 15, it can be seen that
our proposed Improved FP-Growth algorithm generates the same number of frequent patterns as generated by FP-
Growth and COFI-tree algorithms.
To assess the correctness of the Improved FP-Growth, we have to check if the set of frequent patterns generated by our
proposed approach are identical to the set of frequent patterns generated by FP-Growth or not. That means each frequent
pattern generated by the Improved FP-Growth must be present in the set of frequent patterns generated by FP-Growth,
and if the pattern is found, the support counts of both the patterns must be identical. However, based on the different tree
traversal approaches and itemset generation mechanisms used by the algorithms are different, therefore, the itemsets are
also generated in different orders. Also, the order of items in the frequent patterns is different. Therefore, we have imple-
mented a program in C language, which takes the complete set of frequent patterns generated by both approaches. Then
the program sequentially considers one transaction at a time and searches it in the complete set of frequent patterns gen-
erated by FP-Growth. If the pattern is found, then the support counts of both the patterns are compared to check if their sup-
port counts are identical or not. The program is executed for each set of frequent patterns generated by both the algorithms
for the same support threshold values. From the experiment, it is observed that all the frequent patterns generated by the
proposed Improved FP-Growth approach are present in the set of frequent patterns generated by the FP-Growth algorithm.
Hence, it proves that the proposed Improved FP-Growth generates the set of frequent patterns which is complete and
correct.
6. Case study: analysis of adverse disease agents using Improved FP-Growth
It is extremely important to identify life-threatening adverse diseases and reduce the risk level through proper diagnosis
in its early stage. Therefore, to establish the usefulness of our proposed method in real life scenario, we have conducted an
experiment on two adverse disease databases using our proposed method to predict the agents contributing to these dis-
631
Fig. 21. Memory consumption analysis for databases T10I4D100K, Retail, and Chess.
Table 15
Number of frequent patterns generated by different algorithms for varying support threshold values.
Dataset Algorithms 1% 2% 3% 4% 5%
FP-Growth 385 155 60 26 10
T10I4D100K COFI-tree 385 155 60 26 10
Improved FP-Growth 385 155 60 26 10
Dataset Algorithms .5% 1% 1.5% 2% 2.5%
FP-Growth 580 159 84 55 38
Retail COFI-tree 580 159 84 55 38
Dataset Algorithms 50% 55% 60% 65% 70%
FP-Growth 1272932 574998 254944 111239 48731
Chess COFI-tree 1272932 574998 254944 111239 48731
eases and also to analyse the survival possibility. This experiment has been conducted to identify the frequently co-occurring
patterns among the patient informations/attributes and the disease. To support this experiment, two popular medical data-
bases viz., Wisconsin Breast Cancer and Hepatitis, archived from UCI Machine Learning Repository have been used to gen-
erate the frequent patterns using our proposed approach. These frequent patterns might assist the medical professionals to
provide effective diagnosis to the patients.
6.1. Analysis of breast cancer disease agents and risk identification
Breast cancer is the most common disease diagnosed among women. In this section, we have used our method to analyse
the different agents responsible for breast cancer.
6.1.1. Database description

The Wisconsin Breast Cancer database contains 669 instances, each instance consist of 32 attribute values. But only 10
attributes among 32 are considered for diagnostic experiments. The attributes along with their functionality and their
respective value ranges are illustrated in Table 16. The last attribute i.e., Class of Table 16 has been assigned two class labels
632
Table 16
Information of Wisconsin Breast Cancer database.
Sl. Attribute Description Domain

No.
1 Clump Thickness: Assesses if cells are mono- or multi-layered 1–10
2 Uniformity of Cell Size: Evaluates the consistency in size of the cells in the sample. 1–10
3 Uniformity of Cell Estimates the equality of cell shapes and identifies marginal variances. 1–10
Shape:
4 Marginal Adhesion: Quantifies how much cells on the outside of the epithelial tend to stick together. 1–10
5 Single Epithelial Cell Relates to cell uniformity, determines if epithelial cells are significantly enlarged. 1–10
Size:
6 Bare Nuclei: Calculates the proportion of the number of cells not surrounded by cytoplasm to those that are. 1–10
7 Bland Chromatin: Rates the uniform ‘‘texture” of the nucleus in a range from fine to coarse. 1–10
8 Normal Nucleoli: Determines whether the nucleoli are small and barely visible or larger, more visible, and more 1–10
plentiful.
9 Mitoses: Describes the level of mitotic (cell reproduction) activity. 1–10
10 Class Identifies the stage of cancer Benign,
Malignant
Fig. 22. Execution time and memory consumption analysis for Wisconsin Breast Cancer database.
as benign and malignant, to identify the stage of breast cancer. 65.5% of the total instances are classified under benign class
and rest of the 34.5% instances are classified as malignant. Depending on the varying characteristics of cells, experts have
assigned the attribute values except Class within the range 1 to 10. The database contains 16 instances with missing values.
The missing value of an attribute in an instance is replaced by the most frequently occurred value of the same attribute in the
database.
6.1.2. Experimental Analysis

This section elaborates the performance analysis of our proposed approach by comparing with FP-Growth and COFI-tree
approaches on Wisconsin Breast Cancer database. The comparative performance analysis has been done based on the time
spent by the different approaches to generate the frequent patterns, memory consumption and total number of frequent pat-
terns generated.
Table 17
Performance comparison of different algorithms for Breast Cancer database for different support thresholds.
Algorithm 1% 2% 3% 4% 5%
Tree construction time (in seconds) .003620 .003504 .003439 .003288 .003096
FP-Growth Mining time (in seconds) 1.462497 .941128 .800369 .971905 .390043
Itemsets generated 11434 6286 4133 3381 2584
COFI-tree Mining time (in seconds) .411376 .191482 .111939 .093265 .073637
Improved FP-Growth Mining time (in seconds) .071468 .041421 .027425 .033325 .015945
633
[a] Execution time: Since the database size is small, so to assess the performance of our proposed approach, we have
conducted the experiment based on change of support threshold. To figure out the impact of varying support threshold
on execution time, we considered all the instances of Wisconsin Breast Cancer database and executed the algorithm with
different support thresholds increasing by degrees. Based on the support threshold, different sets of frequent patterns are
generated. Fig. 22a illustrates the performance comparison of different algorithms based on execution time for varying
support threshold values. From Fig. 22b, it can be observed that Improved FP-Growth outperformed other approaches
for any support threshold.
[b] Memory usage: In this section, we are going to demonstrate the experiments conducted to asses the memory con-
sumed by different algorithms to mine the frequent patterns from Wisconsin Breast Cancer database for different support
thresholds. Fig. 22b illuminates the amount of space consumed by different algorithms. From the figure, it can be
observed that COFI-tree mining has outperformed Apriori and FP-Growth for all support threshold updates. Due to the
use of additional lists for each individual prefix paths of COFI-trees to mine frequent patterns, it consumes relatively more
space than our proposed approach. Therefore, the Improved FP-Growth has a great advantage over other frequent mining
approaches with respect to space also.
[c] Frequent itemsets generated: The most essential and crucial part of the experiment is to analyse the itemsets pro-
duced by different algorithms. The aim of our experiment is to pinpoint the useful patterns with respect to the agents
liable for breast cancer. These informative patterns may help the medical professionals or doctors to get better in-
depth of breast cancer disease and to estimate the risk level. To analyse the correctness and completeness of the frequent
patterns generated from Wisconsin Breast Cancer database by our proposed approach, we have compared the results with
FP-Growth and COFI-tree frequent itemset mining algorithms. A comparison of tree construction time, total mining time
and total number of itemsets generated from Wisconsin Breast Cancer database for varying support thresholds for differ-
ent algorithms is illustrated in Table 17
Itemset Analysis: The usefulness or significance of the frequent itemsets are estimated with the help of interestingness
measure i.e., the support. In this experiment, to extract the most interesting itemsets among the huge number of frequent
itemsets, we have considered the frequent itemsets co-occurred with any of the two class labels i.e., ‘‘Class = 2” (Benign)
or ‘‘Class = 4” (Malignant) only. Our proposed approach is capable of generating the complete set of frequent itemsets by
constructing the Improved FP-tree and mining the patterns by considering the minimum support threshold as 1. But, as
mentioned in Table 9, for minimum support threshold = 1%, it generates 11434 number of itemsets. There are total 5081
and 595 number of itemsets co-occurring with ‘‘Class = 2” and ‘‘Class = 4”, respectively. However, most of the itemsets are
having very low support value (Interestingness) and only a few number of itemsets are having very high interestingness
values. Therefore, it is required to intelligently choose the minimum support threshold so that possibly only the itemsets
with high interestingness are generated. On the other hand, 65.5% of the total instances of Wisconsin Breast Cancer data-
base are classified under benign class and rest of the 34.5% instances are classified as malignant. Thus, if we consider a
support threshold value less than 34.5%, it will unnecessarily generate huge number of itemsets co-occurring with
‘‘Class = 2” having low interestingness values. On the contrary, if we choose a support threshold value greater than
Table 18
Frequent itemsets generated by Improved FP-Growth from Breast Cancer database with respect to ‘‘Class = 4 (Malignant)”.
Sl. No. Minimum Support Itemset Class Support (%)

1 Class = 4: 241 34.47
2 Class = 4 Mitoses = 1: 134 19.17
3 Class = 4 NorNucleoli = 1: 41 5.86
4 BlChromatin = 3 Class = 4: 36 5.15
5 BNuclei = 10 Class = 4: 129 18.45
6 BNuclei = 10 Class = 4 Mitoses = 1: 68 9.72
7 ClThickness = 5 Class = 4: 45 6.43
8 BlChromatin = 7 Class = 4: 66 9.44
9 BlChromatin = 7 Class = 4 BNuclei = 10: 39 5.57
10 SECellSize = 3 Class = 4: 43 6.15
11 5% ClThickness = 10 Class = 4: 69 9.87
12 ClThickness = 10 Class = 4 BNuclei = 10: 39 Malignant 5.57
13 ClThickness = 10 Class = 4 Mitoses = 1: 35 5
14 UniCellSize = 10 Class = 4: 67 9.58
15 UniCellSize = 10 Class = 4 BNuclei = 10: 35 5
16 NorNucleoli = 10 Class = 4: 61 8.72
17 UniCellShape = 10 Class = 4: 58 8.29
18 UniCellShape = 10 Class = 4 UniCellSize = 10:48 6.86
19 UniCellShape = 10 Class = 4 BNuclei = 10: 36 5.15
20 MarAdhesion = 10 Class = 4: 54 7.72
21 MarAdhesion = 10 Class = 4 BNuclei = 10: 41 5.86
23 ClThickness = 8 Class = 4: 42 6
634
Table 19
Frequent itemsets generated by Improved FP-Growth from Breast Cancer database with respect to ‘‘Class = 2 (Benign)”.

1 Class = 2: 458 65.52
2 Class = 2 Mitoses = 1: 445 63.66
3 NorNucleoli = 1 Class = 2: 402 57.51
4 NorNucleoli = 1 Mitoses = 1 Class = 2: 394 56.36
5 BNuclei = 1 Class = 2: 401 57.36
6 BNuclei = 1 Mitoses = 1 Class = 2: 394 56.36
7 BNuclei = 1 Class = 2 NorNucleoli = 1: 363 51.93
8 50% BNuclei = 1 Mitoses = 1 Class = 2 NorNucleoli = 1: 357 Benign 51,07
9 MarAdhesion = 1 Class = 2: 375 53.64
10 MarAdhesion = 1 Mitoses = 1 Class = 2: 369 52.78
12 SECellSize = 2 Mitoses = 1 Class = 2: 354 50.64
13 UniCellSize = 1 Class = 2: 380 54.36
14 UniCellSize = 1 Class = 2 Mitoses = 1: 374 53.50
15 UniCellSize = 1 Class = 2 NorNucleoli = 1: 355 50.87
16 UniCellShape = 1 Class = 2: 351 50.21
35.5% and less than 65.5%, it will not generate a single itemset co-occurring with ‘‘Class = 4”. Hence, to deal with this
problem we have executed the algorithm twice by considering two minimum support threshold vales 5% and 50% to gen-
erate the frequent itemsets co-occurring with ‘‘Class = 4” and ‘‘Class = 2”, respectively. The frequent itemsets generated
by FP-Growth, COFI-tree mining and Improved FP-Growth algorithm has been presented disjointly in Tables 18, 19,
24–27 respectively.
From the Tables 18, 24, 26 and Tables 19, 25, 27 it can be observed that all the three algorithms generate same number of
identical frequent itemsets for given minimum threshold support values (5% and 50%). But, based on their tree traversal
approaches and itemset generation mechanism, the itemsets are generated in different orders for different algorithms.
For analysis, we have considered the set of itemsets generated by Improved FP-Growth algorithm elucidated in Tables
18 and 19. The support of the itemsets are computed with respect to the total number of transactions in the database
i.e., the global support. For minimum support threshold = 5%, all the three algorithms have generated the itemsets with
respect to ‘‘Class = 4 (Malignant)”, containing the frequent items ‘‘Mitoses = 1”, ‘‘Normal Nucleoli (NorNucleoli)=1”,
‘‘Bland Chromatin (BlChromatin)=3”, ‘‘Bare Nuclei (BNuclei)=10”, ‘‘Clump Thickness (ClThickness)=5”, ‘‘BlChromatin =
7”, ‘‘Single Epithelial Cell Size (SECellSize)=3”, ‘‘ClThickness = 10”, ‘‘Uniformity of Cell Size (UniCellSize)=10”, ‘‘NorNucle-
oli = 10”, ‘‘Uniformity of Cell Shape (UniCellShape)=10”, ‘‘Marginal Adhesion (MarAdhesion)=10”, ‘‘SECellSize = 4”,
‘‘ClThickness = 8” and ‘‘SECellSize = 6”. Itemset 1 of Table 18 indicates that approximately 34.5% breast cancer patients
suffer from severe conditions. Itemset 2 ‘‘Class = 4 Mitoses = 1: 134” defines that 19% patients with Mitoses = 1 have
the tumor in malignant condition. Itemset 3 indicates that if NorNucleoli = 1, only 5.85% chances of having severe tumor
condition. Similarly, itemset 4 affirms only 5% patients with BlChromatin = 3 in malignant tumor condition. Itemset 5
claims that if BNuclei = 10, there is comparatively of high possibility (18.45%) of malign condition of the tumor.
On the other hand itemset 6 defines that if BNuclei = 10 and Mitoses = 1, the possibility of severe tumor condition reduces
to 9.72% (compared to itemsets 2 and 5). Likewise, itemsets 7 and 8 indicates that merely 6% or 9% patients having a
malign tumor condition if ClThickness = 5 or BlChromatin = 7, respectively. The itemset 9 indicates that only 5.57% of
breast cancer patients suffer from severe tumor condition if BlChromatin = 7 and BNuclei = 10. Similarly, we can analyse
the fierceness of the tumor and identify the agents or factors of breast cancer disease by delicately exploring the rest of
the itemsets of Table 18.
For minimum support threshold 50%, the frequent itemsets extracted from the Wisconsin Breast Cancer database with
respect to ”Class = 2 (benign)” by all the three approaches contains the frequent items ‘‘Mitoses = 1”, ‘‘NorNucleoli = 1”,
‘‘BNuclei = 1”,‘‘MarAdhesion = 1”, ‘‘SECellSize = 2”. ‘‘UniCellSize = 1” and ‘‘UniCellShape = 1” as illustrated in Table 19.
Itemset 1 depicts that 65.5% patients suffer from breast cancer disease in benign or mild condition. Itemset 2 containing
frequent item Mitoses = 1 indicates that 64% patients have tumor in benign state if Mitoses = 1. Itemset 3 indicates that if
the NorNucleoi = 1, 57.5% chances of having the tumor in benign state. Itemset 4 indicates affirms that 56% patients have
tumor in early stage if NorNucleoi = 1 and Mitoses = 1. Itemsets 5 and 6 indicate that more than 56% patients have tumor
in reliable risk level if BNucleoi = 1 or BNucleoi = 1 and Mitoses = 1. Itemset 7 and 8 assures that 51% patients have tumor
in normal state if BNucleoi = 1, NorNucleoli = 1 or BNucleoi = 1, NorNucleoli = 1 and Mitoses = 1. Though the tumor is in
benign state, rigorous analysis of the itemsets may help to identify the agents responsible for boosting the risk level.
6.2. Survivability Analysis of hepatitis disease
Hepatitis is the malevolent state of liver primarily caused by viral infection. Other than viral infection, there are some
other factors responsible for hepatitis such as autoimmune hepatitis, hepatitis due to medications, drugs, toxins and alcohol.
In this study, we have applied our proposed method to identify the frequently occurring itemsets/patterns between the
635
Table 20
Information of Hepatitis database.
Sl.No. Attribute Domain

1 CLASS DIE, LIVE
2 AGE 10–80
3 SEX MALE, FEMALE
4 STEROID NO, YES
5 ANTIVIRALS NO, YES
6 FATIGUE NO, YES
7 MALAISE NO, YES
8 ANOREXIA NO, YES
9 LIVER BIG NO, YES
10 LIVER FIRM NO, YES
11 SPLEEN PALPABLE NO, YES
12 SPIDERS NO, YES
13 ASCITES NO, YES
14 VARICES NO, YES
15 BILIRUBIN 0.39–4.00
16 ALK PHOSPHATE 33–250
17 SGOT 13–500
18 ALBUMIN 2.1–6.0
19 PROTIME 10–90
20 HISTOLOGY NO, YES
Fig. 23. Execution time and memory consumption analysis for Hepatitis database.
patient attributes and the class attribute, that might assist the medical personnel to identify the agents responsible for hep-
atitis and also predict the chances of survival.
6.2.1. Database description

The Hepatitis database is consist of 155 instances with 20 identical attributes. The instances are classified in two classes
‘‘Die” and ‘‘Live”. Out of 155, 123 instances are classified as ‘‘Live” and rest of the 32 instances are classified as ‘‘Die”. It also
contains 167 missing values and the problem is solved by the same way as mentioned in Section 6.1.1. Different attribute of
Hepatitis database and their corresponding values are illustrated in Table 20.
Table 21
Performance comparison of different algorithms for Hepatitis database.
Algorithm 1% 2% 3% 4% 5%
Tree construction time (in seconds) ..096402 .093507 .085246 .009538 .006750
FP-Growth Mining time (in seconds) 6.710334 3.341131 1.139476 1.117934 1.071352
Tree construction time (in seconds) ..096402 .093507 .085246 .009538 .006750
COFI-tree Mining time (in seconds) 1.713526 .130188 .072815 .071459 .027818
Improved FP-Growth Mining time (in seconds) .049558 .024462 .008323 .007657 .003969
636
6.2.2. Experimental Analysis
[a] Execution time: Since the database size is small, so to asses the performance of our proposed approach we have con-
ducted the experiment based on change of support threshold. To figure out the impact of varying support threshold on
execution time, we considered all the instances of Wisconsin Breast Cancer database and executed the algorithm with
different support thresholds increasing by degrees. Based on the support threshold, different sets of frequent patterns
are generated. Fig. 23a illustrates the performance comparison of different algorithms based on execution time for vary-
ing support threshold values. From Fig. 23a, it can be observed that Improved FP-Growth outperformed other approaches
for any support threshold.
[b] Memory usage: In this section we are going to demonstrate the experiments conducted to assess the memory con-
sumed by different algorithms to mine the frequent patterns from Wisconsin Breast Cancer database for different support
thresholds. Fig. 23b illuminates the amount of space consumed by different algorithms. From the figure it can be seen that
COFI-tree mining has outperformed Apriori and FP-Growth for all support threshold updates. Due to the use of additional
lists for each individual prefix paths of COFI-trees to mine frequent patterns, it consumes relatively more space than our
proposed approach. Therefore, Improved FP-Growth has a great advantage over other frequent mining approaches with
respect to space also.
[c] Frequent itemsets generated: Improved FP-Growth algorithm is capable of extracting the complete set of frequent
itemsets without generating the redundant itemsets. The aim of our experiment is to extract the useful patterns with
respect to the agents liable for hepatitis. To analyse the correctness and completeness of the frequent patterns generated
from hepatitis database by our ‘‘Improved FP-Growth”, we have compared the results of Improved FP-Growth with the
results of FP-Growth and COFI-tree frequent itemset mining algorithms. A comparison of tree construction time, total
mining time and total number of itemsets generated from hepatitis database for varying support thresholds for different
algorithms is illustrated in Table 21.
Itemset analysis: This experiment is carried out to analyse the itemsets produced from hepatitis database by different
algorithms under consideration for this experiment and also our proposed approach. The main aim of this experiment
is to analyse these informative itemsets to get better insight of hepatitis disease and to estimate the survivability of hep-
atitis patients. Like the previous experiment, in this case also we have used two minimum support threshold values to
avoid generation of unnecessary less interesting itemsets. In this experiment, to analyse the survivability rate of the hep-
atitis patients, we have extracted the most interesting itemsets or frequent itemsets co-occurred with any of the two
most relevant class labels i.e., ‘‘Class = 1” (Die) or ‘‘Class = 2” (Live) only. Out of 155 instances of hepatitis database,
123 instances are classified as ‘‘Live” and rest of the 32 instances are classified as ‘‘Die”. Therefore, by using the same
strategy used in the previous experiment, here we have considered the two minimum support thresholds to be 15%
and 60% to generate the frequent itemsets co-occurring with ‘‘Class = 1” and ‘‘Class = 2” respectively. The frequent item-
sets generated by FP-Growth, COFI-tree mining and Improved FP-Growth algorithm from hepatitis database has been pre-
sented disjointly in Tables 22, 23, 28–31 respectively. For analysis, we have considered the set of itemsets generated by
Improved FP-Growth algorithm elucidated in Tables 22 and 23. The frequent itemsets generated for minimum support
threshold 15% are composed of frequent items ‘‘SEX = 1 (MALE)”, ‘‘FATIGUE = 1 (NO)”, ‘‘ANTIVIRALS = 2 (YES)”, ‘‘LIVER-
Table 22
Frequent itemsets generated by Improved FP-Growth from Hepatitis database with respect to ‘‘CLASS = 1 (DIE)”.

1 CLASS = 1: 32 20.64
2 CLASS = 1 SEX = 1: 32 20.64
3 CLASS = 1 FATIGUE = 1: 30 19.35
4 CLASS = 1 SEX = 1 FATIGUE = 1: 30 19.35
5 CLASS = 1 ANTIVIRALS = 2: 30 19.35
6 CLASS = 1 SEX = 1 ANTIVIRALS = 2: 30 19.35
7 CLASS = 1 FATIGUE = 1 ANTIVIRALS = 2: 28 18.06
8 CLASS = 1 SEX = 1 FATIGUE = 1 ANTIVIRALS = 2: 28 18.06
9 CLASS = 1 LIVER-BIG = 2: 29 18.70
10 CLASS = 1 SEX = 1 LIVER-BIG = 2: 29 18.70
11 15% CLASS = 1 FATIGUE = 1 LIVER-BIG = 2: 27 Die 17.41
12 CLASS = 1 SEX = 1 FATIGUE = 1 LIVER-BIG = 2: 27 17.41
13 CLASS = 1 ANTIVIRALS = 2 LIVER-BIG = 2: 27 17.41
14 CLASS = 1 SEX = 1 ANTIVIRALS = 2 LIVER-BIG = 2: 27 17.41
15 CLASS = 1 FATIGUE = 1 ANTIVIRALS = 2 LIVER-BIG = 2: 25 16.12
16 CLASS = 1 SEX = 1 FATIGUE = 1 ANTIVIRALS = 2 LIVER-BIG = 2: 25 16.12
17 CLASS = 1 HISTOLOGY = 2: 25 16.12
18 CLASS = 1 SEX = 1 HISTOLOGY = 2: 25 16.12
19 CLASS = 1 ANTIVIRALS = 2 HISTOLOGY = 2: 24 15.48
20 CLASS = 1 SEX = 1 ANTIVIRALS = 2 HISTOLOGY = 2: 24 15.48
21 CLASS = 1 LIVER-BIG = 2 HISTOLOGY = 2: 24 15.48
22 CLASS = 1 SEX = 1 LIVER-BIG = 2 HISTOLOGY = 2: 24 15.48
637
Table 23
Frequent itemsets generated by Improved FP-Growth from Hepatitis database with respect to ‘‘CLASS = 2 (LIVE)”.

1 CLASS = 2: 123 79.35
2 CLASS = 2 ASCITES = 2: 117 75.48
3 CLASS = 2 VARICES = 2: 116 74.83
4 CLASS = 2 ASCITES = 2 VARICES = 2: 111 71.61
5 CLASS = 2 SEX = 1: 107 69.03
6 CLASS = 2 ASCITES = 2 SEX = 1: 101 65.16
7 CLASS = 2 VARICES = 2 SEX = 1: 101 65.16
8 CLASS = 2 ASCITES = 2 VARICES = 2 SEX = 1: 96 61.93
9 CLASS = 2 SPLEEN-PALPABLE = 2: 105 67.74
10 CLASS = 2 ASCITES = 2 SPLEEN-PALPABLE = 2: 101 65.16
11 CLASS = 2 VARICES = 2 SPLEEN-PALPABLE = 2: 102 65.80
12 CLASS = 2 ASCITES = 2 VARICES = 2 SPLEEN-PALPABLE = 2: 98 63.22
13 60% CLASS = 2 SEX = 1 SPLEEN-PALPABLE = 2: 93 Live 60
14 CLASS = 2 LIVER-BIG = 2: 101 65.16
15 CLASS = 2 ASCITES = 2 LIVER-BIG = 2: 96 61.93
16 CLASS = 2 VARICES = 2 LIVER-BIG = 2: 96 61.93
17 CLASS = 2 ANTIVIRALS = 2: 101 65.16
18 CLASS = 2 ASCITES = 2 ANTIVIRALS = 2: 95 61.29
19 CLASS = 2 VARICES = 2 ANTIVIRALS = 2: 94 60.64
20 ANOREXIA = 2 CLASS = 2: 101 65.16
21 ANOREXIA = 2 VARICES = 2 CLASS = 2: 97 62.58
22 ANOREXIA = 2 ASCITES = 2 CLASS = 2: 98 63.22
23 ANOREXIA = 2 VARICES = 2 ASCITES = 2 CLASS = 2: 95 61.29
24 SPIDERS = 2 CLASS = 2: 94 60.64
25 SPIDERS = 2 VARICES = 2 CLASS = 2: 94 60.64
BIG = 2 (YES)”, and ‘‘HISTOLOGY = 2 (YES)” only. Itemsets 1 and 2 assures that approximately 20% (male) patients with
hepatitis are more likely to die. Itemsets 3 and 4 indicates that even if there is no fatigue syndrome but the patient is
male, still there is 19% chances of fatality. itemsets 5 and 6 reveals that a patient taking antiviral treatment has 19% pos-
sibility of death. Itemsets 7 and 8 testify that a patient (male/female) with no fatigue syndrome but taking antiviral treat-
ment has 18% probability of death. Itemsets 9 and 10 indicates that patients with swollen liver are 18% likely to die.
Itemsets 11 and 12 testify that patients with no fatigue syndrome but swollen liver have 17% fatality incidence. Itemsets
13 to 16 indicate that patients with no fatigue syndrome but with swollen liver and taking antiviral treatments have more
than 16% chances of fatality. Itemsets 17 and 18 indicate that the patients with liver histology have 16% possibility of
death. Itemsets 19 and 20 indicates that if the patients have liver histology and also taking antiviral treatments have
15% chances of death. Similarly, the itemsets 21 and 22 indicate that patients with liver histology and also swollen liver
have 15% possibility of death.
The frequent itemsets with respect to ‘‘CLASS = LIVE” of Table 23 are generated by Improved FP-Growth algorithm from
Hepatitis database for minimum support threshold 60%. The itemsets are composed of frequent items ‘‘ASCITES = 2
(YES)”, ‘‘VARICES = 2 (YES)”, ‘‘SE X = 1 (MALE)”, ‘‘SPLEEN-PALPABLE = 2 (YES)”,‘‘LIV ER-BIG = 2 (YES)”, ‘‘ANTIVIRALS = 2
(YES)”, ‘‘ANOR EXIA = 2 (YES)”, and ‘‘SPIDERS = 2 (YES)”. The itemset 1 defines that patients having hepatitis with mild
fibrosis are more likely to survive. In itemset 2, ‘‘ASCITES = YES” indicates that if there is presence of fluid between abdo-
men and organs then there is 75% chances of survivability. Itemset 3 with ‘‘VARICES = YES” indicates that patients with
enlarged veins have 74% survival possibility. Compared to itemsets 2 and 3, itemset 4 testifies that the presence of fluid
between abdomen and organs and enlarged veins together reduces the survivability rate of the patient by approximately
3%. Itemset 5 indicates that male hepatitis patients with mild fibrosis have 69% survival possibility. Itemsets 6 and 7 indi-
cate that male hepatitis patients with presence of fluid between abdomen and organs or enlarged veins have 65% prob-
ability of survival. But, itemset 8 testifies that male hepatitis patients with presence of fluid between abdomen and organs
and also enlarged veins reduces the survival probability to 62%. Itemset 9 indicates that patients with enlarged spleen
have 67% chances of recovery. Itemsets 10 to 12 indicate that patients with enlarged spleen and having the presence
of fluid between abdomen and organs and/or enlarged veins have survival chances between 63% to 66%. Itemset 13 indi-
cates that male patients with enlarged spleen have 60% possibility of survival. Itemset 14 indicates that patients with only
swollen liver have 65% survival possibility. Likewise, itemsets 17 and 20 indicate that patients taking antiviral treatment
or with abnormally low body weight (‘‘ANOREXIA = YE S”) or eating disorder have 65% survival possibility. Similarly,
itemset 23 indicates that patients with eating disorder, enlarged veins and presence of fluid between abdomen and
organs have 61% survival possibility. Itemset 24 with ‘‘SPIDERS = YES” indicates that patients having swollen blood ves-
sels have 60% chances of survivability. Lastly, itemset 25 indicates that patients with swollen blood vessels and enlarged
veins also has survival possibility of 60%.
638
7. Conclusion
In this paper, we have introduced an Improved FP-tree construction algorithm and an efficient pattern growth algorithm
to intelligently mine the complete set of frequent patterns without generating redundant or infrequent itemsets. The pro-
posed Improved FP-tree construction algorithm has immensely improved the performance of tree construction time by
resourcefully using node-link, maintained in header table to manage the same item node list in the FP-tree. Every time a
new node is inserted into the Improved FP-tree, our algorithm has updated the node-link in such a way that it bypasses
the traversal of the same item node list and enables us to directly access the last node of the same item node list. Therefore,
it saves a significant amount of time. Though our proposed tree data structure requires the same amount of space as con-
ventional FPtree, the construction time is much more prominent than the conventional FP-tree construction algorithm.
Moreover, the main aim of the improved pattern growth algorithm was to construct an adequate conditional FP-tree struc-
ture to mine only the frequent itemsets without recursively constructing the sub-conditional FP-trees and their mining. The
Improved FP-Growth has contributed to boosting the mining performance by using the additional field ‘‘RelCount” in each
node of the Improved Conditional FP-tree. Rather than recursively constructing the sub-conditional FP-trees, our algorithm
updates the ‘‘RelCount” of each prefix path of the improved conditional FP-tree and updates the relative header table counts
by traversing the same item node lists for each item in the header table. Finally, it generates frequent itemsets with the help
of header table item counts only.
The experimental results show that the Improved FP-tree construction algorithm outperforms conventional FP-tree con-
struction algorithm in terms of runtime in all cases for both sparse and dense databases. On the other hand, from the exper-
imental results based on mining time, it can be observed that for sparse database Improved FP-Growth algorithm
outperformed the existing algorithms. However, for dense databases, the performance of improved FP-Growth is more
prominent than both FP-growth and COFI-tree mining algorithms. Since it does not recursively construct the sub-
conditional FP-trees or generates candidate itemsets from each prefix path of the conditional FP-tree using any additional
lists like the COFI-tree mining algorithm, thus it achieves great compactness in terms of space complexity. The experimental
results show that in the case of a dense database, the Improved FP-Growth algorithm requires significantly less space than
COFI-tree or FP-Growth. Thus, our proposed frequent itemset mining algorithm has achieved a good trade-off between min-
ing time and memory requirement of frequent pattern generation.
Since life-threatening diseases have been raising the casualties exponentially day-by-day. Besides, with the aid of in-
depth exploration of the frequent itemsets produced by our proposed approach, we established few factors responsible
for these diseases and evaluated patients’ survivability possibilities. If we perform experiments on those disease databases
for other suitable minimum support thresholds, it may reveal more useful information, which may help the diagnosis clinics
and medical experts diagnose breast cancer and hepatitis diseases. Shortly, we will make an effort to identify the issues
related to scalable and incremental mining. Moreover, in the near future, we would like to use the generated, validated
and useful frequent patterns generated by our proposed approach to develop an efficient predictive model based on deep
learning for risk identification and survivability analysis of different adverse diseases.
Author’s Contributions
Shafiul Alom Ahmed: Participated in all experiments, organized the study, designed the research plan, coordinated the
data, result and complexity-analysis and contributed to the writing of the manuscript.
Bhabesh Nath: Designed the research plan, organized the study, and participated in complexity and result-analysis.
Ethics
This article is original and contains unpublished material. The corresponding author confirms that all of the other authors
have read and approved the manuscript and no ethical issues involved.
CRediT authorship contribution statement
Bhabesh Nath: Conceptualization, Methodology, Software, Supervision, Validation, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgement
The authors would like to acknowledge Tezpur University for supporting the whole research. Moreover, we would like to
thank all friends and colleagues in the Department of Computer Science and Engineering, Tezpur University, for their sup-
639
ports. The authors would also like to acknowledge Maulana Azad National Fellowship (MANF), UGC, Govt. of India for the
financial support to successfully conduct the research.
Appendix A. Supplementary data
Supplementary data associated with this article can be found, in the online version, athttps://doi.org/10.1016/j.ins.2021.
07.061.
References
[1] International_Agency_for_Research_on_Cancer, WHO india-fact-sheets (2018). https://gco.iarc.fr/today/data/factsheets/populations/356-india-fact-

sheets.pdf.
[2] W.H. Organization et al, Global hepatitis report 2017, World Health Organization, 2017.
[3] W. Gulbinat, What is the role of who as an intergovernmental organisation. In: The coordination of telematics in healthcare, World Health
Organisation. Geneva, Switzerland at http://www. hon. ch/libraray/papers/gulbinat. html.
[4] H.C. Koh, G. Tan, et al, Data mining applications in healthcare, J. Healthcare Inform. Manage. 19 (2) (2011) 65.
[5] O. Niakšu, O. Kurasova, Data mining applications in healthcare: research vs practice, Databases Inf. Syst. BalticDB&IS 58 (2012) 2012.
[6] M. Durairaj, V. Ranjani, Data mining applications in healthcare sector: a study, Int. J. Sci. Technol. Res. 2 (10) (2013) 29–35.
[7] J. Nahar, T. Imam, K.S. Tickle, Y.-P.P. Chen, Association rule mining to detect factors which contribute to heart disease in males and females, Expert Syst.
Appl. 40 (4) (2013) 1086–1093.
[8] C. Ordonez, Association rule discovery with the train and test approach for heart disease prediction, IEEE Trans. Inf Technol. Biomed. 10 (2) (2006) 334–
343.
[9] P.K. Anooj, Clinical decision support system: risk level prediction of heart disease using weighted fuzzy rules and decision tree rules, Central Eur. J.
Computer Sci. 1 (4) (2011) 482–498.
[10] R. Alizadehsani, J. Habibi, M.J. Hosseini, H. Mashayekhi, R. Boghrati, A. Ghandeharioun, B. Bahadorian, Z.A. Sani, A data mining approach for diagnosis of
coronary artery disease, Computer Methods Programs Biomed. 111 (1) (2013) 52–61.
[11] R. Agrawal, T. Imielinski, A. Swami, Tmining association rules between sets of items in large databases, ACM SIGMOD International Conference on
Management of Data 22 (1993) 207–216.
[12] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation:a frequent-pattern tree approach, Proceedings of ACMSIGMOD, Dallas, TX
(2000) 1–12.
[13] C. Ordonez, Comparing association rules and decision trees for disease prediction, in, in: Proceedings of the international workshop on Healthcare
information and knowledge management, 2006, pp. 17–24.
[14] J.V. Tu, Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes, J. Clinical
Epidemiol. 49 (11) (1996) 1225–1231.
[15] K. Polat, S. Günesß, A. Arslan, A cascade learning system for classification of diabetes disease: Generalized discriminant analysis and least square support
vector machine, Expert Systems Appl. 34 (1) (2008) 482–487.
[16] D. Delen, G. Walker, A. Kadam, Predicting breast cancer survivability: a comparison of three data mining methods, Artif. Intelligence Med. 34 (2) (2005)
113–127.
[17] M. U. Khan, J. P. Choi, H. Shin, M. Kim, Predicting breast cancer survivability using fuzzy decision trees for personalized healthcare, in: 2008 30th
Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2008, pp. 5148–5151.
[18] M. Karabatak, M.C. Ince, An expert system for detection of breast cancer based on association rules and neural network, Expert Systems Appl. 36 (2)
(2009) 3465–3469.
[19] A.S. Sarvestani, A. Safavi, N. Parandeh, M. Salehi, Predicting breast cancer survivability using data mining techniques, 2010 2nd International
Conference on Software Technology and Engineering, Vol. 2, IEEE, 2010, pp. V2–227.
[20] A. Fallahi, S. Jafari, An expert system for detection of breast cancer using data preprocessing and bayesian network, Int. J. Adv. Sci. Technol. 34 (2011)
65–70.
[21] H.-L. Chen, B. Yang, J. Liu, D.-Y. Liu, A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis, Expert Syst.
Appl. 38 (7) (2011) 9014–9022.
[22] J.S. Sartakhti, M.H. Zangooei, K. Mozafari, Hepatitis disease diagnosis using a novel hybrid method based on support vector machine and simulated
annealing (svm-sa), Computer Methods Programs Biomed. 108 (2) (2012) 570–579.
[23] Y. Kaya, M. Uyar, A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease, Appl. Soft
Computing 13 (8) (2013) 3429–3438.
[24] B. Zheng, S.W. Yoon, S.S. Lam, Breast cancer diagnosis based on feature extraction using a hybrid of k-means and support vector machine algorithms,
Expert Syst. Appl. 41 (4) (2014) 1476–1482.
[25] S. Chidambaranathan, Breast cancer diagnosis based on feature extraction by hybrid of k-means and extreme learning machine algorithms, ARPN J.
Eng. Appl. Sci. 11 (7) (2016) 4581–4586.
[26] S. Kaymak, A. Helwan, D. Uzun, Breast cancer image classification using artificial neural networks, Procedia Computer Sci. 120 (2017) 126–131.
[27] V. Chaurasia, S. Pal, B. Tiwari, Prediction of benign and malignant breast cancer using data mining techniques, J. Algorithms Comput. Technol. 12 (2)
(2018) 119–126.
[28] M. Nilashi, H. Ahmadi, L. Shahmoradi, O. Ibrahim, E. Akbari, A predictive method for hepatitis disease diagnosis using ensembles of neuro-fuzzy
technique, J. Infection Public Health 12 (1) (2019) 13–20.
[29] M. Loey, M.W. Jasim, H.M. El-Bakry, M.H.N. Taha, N.E.M. Khalifa, Breast and colon cancer classification from gene expression profiles using data mining
techniques, Symmetry 12 (3) (2020) 408.
[30] J. Chen, K. Li, H. Rong, K. Bilal, N. Yang, K. Li, A disease diagnosis and treatment recommendation system based on big data mining and cloud
computing, Inf. Sci. 435 (2018) 124–149.
[31] Z. Gao, Y. Yang, M.R. Khosravi, S. Wan, Class consistent and joint group sparse representation model for image classification in internet of medical
things, Comput. Commun. 166 (2021) 57–65.
[32] L. Nie, L. Zhang, L. Meng, X. Song, X. Chang, X. Li, Modeling disease progression via multisource multitask learners: A case study with alzheimer’s
disease, IEEE Trans. Neural Networks Learn. Syst. 28 (7) (2017) 1508–1519, https://doi.org/10.1109/TNNLS.2016.2520964.
[33] J. Ramírez, J. Górriz, D. Salas-Gonzalez, A. Romero, M. López, I. Álvarez, M. Gómez-Río, Computer-aided diagnosis of alzheimer’s type dementia
combining support vector machines and discriminant set of features, Inf. Sci. 237 (2013) 59–72.
[34] U.R. Acharya, H. Fujita, M. Adam, O.S. Lih, V.K. Sudarshan, T.J. Hong, J.E. Koh, Y. Hagiwara, C.K. Chua, C.K. Poo, et al, Automated characterization and
classification of coronary artery disease and myocardial infarction by decomposition of ecg signals: A comparative study, Inf. Sci. 377 (2017) 17–29.
[35] Y. Cao, Z. Liu, P. Zhang, Y. Zheng, Y. Song, L. Cui, Deep learning methods for cardiovascular image, J. Artificial Intell. Syst. 1 (1) (2019) 96–109.
[36] S. Wang, Y. Zha, W. Li, Q. Wu, X. Li, M. Niu, M. Wang, X. Qiu, H. Li, H. Yu, et al., A fully automatic deep learning system for covid-19 diagnostic and
prognostic analysis, European Respiratory Journal 56 (2).
640
[37] L. Nie, M. Wang, L. Zhang, S. Yan, B. Zhang, T.-S. Chua, Disease inference from health-related questions via sparse deep learning, IEEE Trans. Knowl. Data
Eng. 27 (8) (2015) 2107–2119.
[38] M. H. Nadimi-Shahraki, Can we use deep learning for frequent pattern mining? (03 2020).
[39] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining Knowledge
Discovery 8 (1) (2004) 53–87.
[40] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Efficient single-pass frequent pattern mining using a prefix-tree, Inf. Sci. 179 (5) (2009) 559–583.
[41] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, D. Yang, H-mine: Hyper-structure mining of frequent patterns in large databases, in: Proceedings 2001 IEEE
International Conference on Data Mining, IEEE, 2001, pp. 441–448.
[42] J. Liu, Y. Pan, K. Wang, J. Han, Mining frequent item sets by opportunistic projection, in: Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, ACM, 2002, pp. 229–238.
[43] M. El-Hajj, O. R. Zaïane, Non-recursive generation of frequent k-itemsets from frequent pattern tree representations, in: International Conference on
Data Warehousing and Knowledge Discovery, Springer, 2003, pp. 371–380.
[44] M. El-Hajj, O.R. Zaïane, Inverted matrix: Efficient discovery of frequent items in large datasets in the context of interactive mining, in, in: Proceedings of
the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2003, pp. 109–118.
[45] G. Grahne, J. Zhu, Efficiently using prefix-trees in mining frequent itemsets., in: FIMI, Vol. 90, 2003.
[46] Y. G. Sucahyo, R. P. Gopalan, CT-PRO: A bottom-up non recursive frequent itemset mining algorithm using compressed fp-tree data structure., in: FIMI,
Vol. 4, 2004, pp. 212–223.
[47] B. Racz, Nonordfp: An FP-growth variation without rebuilding the fp-tree, Proceedings of IEEE ICDM workshop on frequent itemset mining
implementations.
[48] B. Schlegel, R. Gemulla, W. Lehner, Memory-efficient frequent-itemset mining, in, in: Proceedings of the 14th International Conference on Extending
Database Technology, ACM, 2011, pp. 461–472.
[49] K.-C. Lin, I.-E. Liao, Z.-S. Chen, An improved frequent pattern growth method for mining association rules, Expert Syst. Appl. 38 (2011) (2011) 5154–
5161.
[50] S.A. Ahmed, B. Nath, Is single scan based restructuring always a suitable approach to handle incremental frequent pattern mining?, J Computer Sci. 17
(3) (2021) 205–220, https://doi.org/10.3844/jcssp.2021.205.220. https://thescipub.com/abstract/jcssp.2021.205.220.
641

Information Sciences: Shafiul Alom Ahmed, Bhabesh Nath

Uploaded by

Copyright:

Available Formats

You might also like

Information Sciences: Shafiul Alom Ahmed, Bhabesh Nath

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Sciences: Shafiul Alom Ahmed, Bhabesh Nath

Uploaded by

Copyright:

Available Formats

Information Sciences 576 (2021) 609–641

Contents lists available at ScienceDirect

Identification of adverse disease agents and risk analysis using

Fig. 3. Number of new cases among.

rðPÞ ¼ jftj jP # tj ; tj 2 Tgj ð1Þ

SuppðPÞ P minSupp ð3Þ

3.1. Data mining techniques in disease analysis

Technique Disease analysed

3.2. Frequent itemset mining

Algorithms Database Pros and Cons

(N denotes the length of the longest frequent itemset).

4.1. Improved FP-tree Construction:

Fig. 4. Flowchart of the proposed Improved FP-Growth approach.

Algorithm 2: insertInto-Improved_FP-tree(I, Root)

Item Frequency Count

Item Frequency Count

Fig. 5. Improved FP-tree construction for database D.

4.2. Improved FP-Growth

4.2.1. Construction of Improved Conditional FP-tree

Algorithm 3: Improved-Conditional-FP_Tree-Construction(minsupp, Improved FP-Tree, X)

Algorithm 4: Insert-Into-Improved-Conditional_FP-tree(I; Count, Root)

Pattern Base Count Item Count Sorted Item Order

Fig. 6. Improved Conditional FP-tree construction for item ‘a’.

Fig. 7. Conventional conditional FP-tree for the item ‘a’.

4.2.2. Mining of Improved Conditional FP-tree

Algorithm 5: Improved_FP-Growth (minSupp, HTable, top, flag)

Frequent Itemset Support(%)

5. Experimental Results Evaluation

5.1. Experimental Environment and Datasets

Fig. 14. Execution time (Connect-4).

5.2. Execution Time of Improved FP-tree Construction

5.2.1. Effect of Database Size

Sl. No. Database Size No. of Same Item List Traversal

Fig. 15. Execution time (TCT40I10D100K).

5.2.2. Effect of Minimum Support Threshold

Sl. No. Database Size No. of Same Item List Traversal

Fig. 16. Execution time (Kosarak).

Sl. No. Database Size No. of Same Item List Traversal

5.3. Evaluation of Improved FP-Growth

Fig. 17. Execution time (Connect-4).

Fig. 18. Execution time (TCT40I10D100K).

Fig. 19. Execution time (Kosarak).

5.3.1. Execution Time Evaluation

Sl. No. Support Threshold No. of Same Item List Traversal

Sl. No. Support Threshold No. of Same Item List Traversal

Sl. No. Support Threshold No. of Same Item List Traversal

Database No. of Transactions No. of Items Average tran. length Type

5.3.2. Memory Requirement Evaluation

5.3.3. Correctness and Completeness Analysis

6. Case study: analysis of adverse disease agents using Improved FP-Growth

6.1. Analysis of breast cancer disease agents and risk identification

6.1.1. Database description

Sl. Attribute Description Domain

6.1.2. Experimental Analysis

Sl. No. Minimum Support Itemset Class Support (%)

Sl. No. Minimum Support Itemset Class Support (%)

6.2. Survivability Analysis of hepatitis disease