You are on page 1of 12

Engineering Applications of Artificial Intelligence 91 (2020) 103591

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence


journal homepage: www.elsevier.com/locate/engappai

Ensemble-based active learning using fuzzy-rough approach for cancer


sample classification✩
Ansuman Kumar 1 , Anindya Halder 1 , ∗
Department of Computer Applications, North-Eastern Hill University, Tura Campus, Meghalaya 794002, India

ARTICLE INFO ABSTRACT


Keywords: Background and Objective: Classification of cancer from gene expression data is one of the major research areas
Ensemble learning in the field of machine learning and medical science. Generally, conventional supervised methods are not able
Active learning to produce desired classification accuracy due to inadequate training samples present in gene expression data
Cancer classification
to train the system. Ensemble-based active learning technique in this situation can be effective as it determines
Gene expression data
few informative samples by all the base classifiers and ensemble the decisions of all the base classifiers to get
Fuzzy set
Rough set
the most informative samples. Most informative samples are labeled by the subject experts and those are added
to the training set, which can improve the classification accuracy.
Method: We propose a novel ensemble-based active learning using fuzzy-rough approach for cancer sample
classification from microarray gene expression data. The proposed method is able to deal with the uncertainty,
overlap and indiscernibility usually present in the subtype classes of the gene expression data and can improve
the accuracy of the individual base classifier in presence of limited training samples.
Results: The proposed method is validated using eight microarray gene expression datasets. The performance
of the proposed method in terms of classification accuracy, precision, recall, 𝐹1 -measures and kappa is
compared with six other methods. The improvements in accuracy achieved by the proposed method compared
to its nearest competitive methods are 2.96%, 9.34%, 0.93%, 3.69%, 7.2% and 4.53% respectively for Colon
cancer, Prostate cancer, SRBCT, Ovarian cancer, DLBCL and Central nervous system datasets. Results of the
paired 𝑡-test justify the statistical relevance of the results in favor of the proposed method for most of the
datasets.
Conclusion: The proposed method is an effective general purpose ensemble-based active learning adopting
the fuzzy-rough concept and therefore can be applied for other classification problem in future.

1. Introduction confined to the expert’s observation in differentiating distinct cancer


subtype classes as the most cancers are molecularly different and follow
Cancer is one of the terrible disease caused by abnormal and un- distinct clinical procedure.
controlled growth of cells. Cancer cells usually behave differently than
Microarray (Stekel, 2003; Maroulis et al., 2006) data is often being
normal cells and can spread to other parts of the body. It is the
second-leading cause of death worldwide and an approximately 9.6 adopted to classify cancer samples in order to provide a low cost
million people die every year from cancer according to the Union diagnosis. Microarray measures thousands of gene expression profiles
for International Cancer Control (UICC), Switzerland (https://www. simultaneously (Stekel, 2003). In microarray data the number of genes
worldcancerday.org/what-cancer). Therefore, cancer sub-type classes present is very large as compared to the number of samples (Du
classification at initial stage has become a vital area of research world-
et al., 2014) which yields the problem of curse-of-dimensionality. Ad-
wide to the researchers and scientists.
Cancer classifications by the traditional methods are based on the ditionally, the subtypes of cancer classes are also often indiscernible,
morphological appearance of the tumor and the clinical test. Tradi- vague, overlapping and ambiguous (Pawlak, 1991). Therefore, tradi-
tional methods for diagnostic process are often time consuming, usually tional computational methods may not achieve desired accuracy as the
expensive and sometime inaccurate. These methods are also often number of training samples are not adequate. Hence, it becomes crucial

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work.
For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2020.103591.
∗ Corresponding author.
E-mail addresses: ansuman.kumar@nehu.ac.in (A. Kumar), anindya.halder@nehu.ac.in (A. Halder).
1
Both authors contributed equally.

https://doi.org/10.1016/j.engappai.2020.103591
Received 26 December 2019; Received in revised form 5 February 2020; Accepted 26 February 2020
Available online 9 March 2020
0952-1976/© 2020 Elsevier Ltd. All rights reserved.
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

to design classifier that would handle the above mentioned challenges The structure of the remaining article is organized as follows. The
to produce high classification accuracy (Lu and Han, 2003). background theory related to the proposed method is briefly given in
Several machine learning algorithms have been widely applied to Section 2. Section 3 presents description of the proposed ensemble-
microarray gene expression data analysis using supervised (i.e., classifi- based active learning using fuzzy-rough nearest neighbor classifier
cation) (Dettling and Buhlmann, 2003; Rapaport et al., 2007; Tan et al., (EnALFRNN) method. Section 4 describes the experimental evaluation
2011; Maniruzzaman et al., 2019), unsupervised (i.e., clustering) (Det- and Section 5 reports the experimental results and discussions. Finally,
tling and Buhlmann, 2002; Sturn et al., 2002; Jiang et al., 2004), the concluding remarks and future direction of work are given in
semi-supervised clustering (Doan et al., 2011; Priscilla and Swamy- Section 6.
nathan, 2013; Wang and Pan, 2014), semi-supervised classification (Shi
and Zhang, 2011; Halder and Misra, 2014; Maulik and Chakraborty, 2. Background
2014), active learning based classification (Halder et al., 2015; Liu,
2004; Vogiatzis and Tsapatsoulis, 2008; Halder and Kumar, 2019; The proposed method ensemble-based active learning using fuzzy-
Kumar and Halder, 2019a), and ensemble based classification (Dettling rough nearest neighbor classifier (EnALFRNN) is a merger of fuzzy set,
and Buhlmann, 2003; Yang et al., 2010; Osareh and Shadgar, 2013; rough set, active learning and ensemble learning thus brief descriptions
Tan and Gilbert, 2003; Xiao et al., 2018) frameworks. Researchers of those are presented below:
have also done uncertainty processing using fuzzy set for multi-criteria
2.1. Fuzzy set theory
decision making based on belief entropy (Xiao, 2019b), workflow
scheduling in distributed environment (Xiao et al., 2019), and for
Classical crisp sets are unable to handle imprecise and vague data.
pattern classification problem (Xiao, 2019a) in recent past.
Therefore, Zadeh (1965) introduced fuzzy set theory as an extension of
Generally, classical supervised learning algorithms (Bishop, 2006)
crisp set to deal with the imprecise and vague data. Let 𝑋 be the finite
require large number of training samples to classify the unlabeled
non empty universal set. A fuzzy set A is mapping from the universe
samples. Although labeled samples are often expensive, time consuming
𝑋 ∈ [0, 1]. The value of 𝐴(𝑥) for 𝑥 ∈ 𝑋 is membership degree for
and difficult to obtain, however the unlabeled samples are relatively
belonging of 𝑥 in 𝐴.
easy to collect. Moreover, in microarray gene expression data, classes
The t-norm  and the t-conorm  are extension of the conjunction ∧
present are also often vague, indiscernable, imprecise and overlapping
and the disjunction ∨ operations respectively, which map [0, 1]2 → [0, 1]
in nature (Maji and Pal, 2012). Therefore, the traditional classifiers
and fulfill the following conditions (Radzikowska and Kerre, 2002):
often fail to achieve desired accuracy for cancer classification. There-
⋄  and  are commutative
fore, researchers have tried to improve the prediction accuracy by using
⋄  and  are associative
the ensemble based classification (Dettling and Buhlmann, 2003; Yang
⋄  and  are increasing
et al., 2010; Osareh and Shadgar, 2013; Tan and Gilbert, 2003) where a ⋄ ∀𝑥 ∈ 𝑋 ∶  (𝑥, 1) = 𝑥 and ∀𝑥 ∈ 𝑋 ∶ (𝑥, 0) = 𝑥.
few base classifiers’ decisions are combined (by majority voting Polikar, Three popular t-norm operators ∀𝑥, 𝑦 ∈ 𝑋 are defined as follows
2006, weighted majority voting Polikar, 2006 etc. techniques). Alterna- (Radzikowska and Kerre, 2002):
tively, people have also tried to enhance the classification accuracy by ⋄ Minimum operator 𝑀 (𝑥, 𝑦) = 𝑚𝑖𝑛(𝑥, 𝑦)
adopting the concept of active learning (Halder et al., 2015; Liu, 2004; ⋄ Algebraic product 𝑃 (𝑥, 𝑦) = 𝑥 ∗ 𝑦
Vogiatzis and Tsapatsoulis, 2008; Halder and Kumar, 2019; Kumar ⋄ Lukasiewicz t-norms 𝐿 (𝑥, 𝑦) = 𝑚𝑎𝑥(0, 𝑥 + 𝑦 − 1)
and Halder, 2019a). In active learning the most informative samples Three well-known t-conorm operators ∀𝑥, 𝑦 ∈ 𝑋 are defined as fol-
are computationally being chosen to get their labels from the experts lows (Radzikowska and Kerre, 2002):
which in turn are added to the limited training set. Thereby, the active ⋄ Maximum operator 𝑀 (𝑥, 𝑦) = 𝑚𝑎𝑥(𝑥, 𝑦)
learning method iteratively increases the number of training samples ⋄ Probabilistic sum 𝑝 (𝑥, 𝑦) = 𝑥 + 𝑦 − 𝑥 ∗ 𝑦
which ultimately help to improve the prediction accuracy. ⋄ Lukasiewicz t-conorm 𝐿 (𝑥, 𝑦) = 𝑚𝑖𝑛(1, 𝑥 + 𝑦)
Motivated from the individual advantages of ensemble and active The implication → is extended by fuzzy implicator , which is mapping
learning techniques and to improve the accuracy further, in this article  ∶ [0, 1]2 → [0, 1] that satisfies (1, 0) = 0 and (1, 1) = (0, 1) =
a novel ensemble-based active learning technique using fuzzy-rough (0, 0) = 1. Lukasiewicz implicator (Radzikowska and Kerre, 2002) 
approach is proposed to classify cancer samples from gene expression is used in this article and it is defined as:
data which can handle the above said challenges such as (i) ambigu- ⋄ Lukasiewicz implicator (𝑥, 𝑦) = 𝑚𝑖𝑛(1, 1 − 𝑥 + 𝑦), ∀𝑥, 𝑦 ∈ 𝑋.
ity, overlappingness, vagueness and indiscernibility, (ii) low predictive
accuracy produced by individual classifier, and (iii) scarcity of the 2.2. Rough set theory
clinically labeled samples.
In this context, ensemble-based active learning technique is sup- Rough set theory (Pawlak, 1982) is a mathematical idea to deal
posed to be useful as it judiciously combines the advantages of en- with imprecise and insufficient information. It can handle uncertainty,
semble learning and active learning strategy where ensemble learning indiscernibility and incompleteness in the datasets (Pawlak, 1991). It
technique amalgamates the decisions of multiple base classifiers to begins with the notion of an approximation space, which is a pair
produce the final decision which is expected to be better than any ⟨𝑋, 𝑅⟩, where 𝑋 is the non-empty universe of discourse and 𝑅 is an
individual base classifier. Whereas, active learning technique compu- equivalence relation defined on 𝑋, and 𝑅 is reflexive, symmetric and
tationally selects very few most informative samples with the help of transitive. For each subset 𝐴 (𝐴 ⊆ 𝑋) (Maji and Pal, 2007), its lower
ensemble based multiple base classifiers and that unlabeled samples are approximation is defined as the union of all the equivalence classes
labeled by the subject experts, subsequently added with small number which are fully included inside the class 𝐴 and the upper approximation
of training samples to improve the classification accuracy. is defined as the union of equivalence classes which have non-empty
Although, ensemble-based active learning techniques were applied intersection with the class 𝐴. The lower and upper approximations are
for different areas like gesture recognition (Schumacher et al., 2012), formally defined as follows:
image classification (Beluch et al., 2018), web documents classifica-
𝑅 ↓ 𝐴 = {𝑥 ∈ 𝑋|[𝑥]𝑅 ⊆ 𝐴}, (1)
tion (Schnitzer et al., 2014) etc., with promising results, however this
technique is not exposed so far for microarray gene expression data
analysis. To the authors’ best knowledge the proposed method in this
𝑅 ↑ 𝐴 = {𝑥 ∈ 𝑋|[𝑥]𝑅 ∩ 𝐴 ≠ ∅}, (2)
article is the first of its kind to address the cancer classification problem
from gene expression data adopting the concept of ensemble-based where [𝑥]𝑅 denotes the equivalence classes. The tuple ⟨𝑅 ↓ 𝐴, 𝑅 ↑ 𝐴⟩
active learning technique using rough-fuzzy theory. represents the rough set.

2
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

2.3. Fuzzy-rough set theory 3.1. Ensemble-based active learning step

Fuzzy-rough set theory is a combination of fuzzy set concept and In this step, ensemble technique is adopted with active learning
rough set concept. Fuzzy set can handle vague information whereas, strategy. Let 𝐿 = {⟨𝑙⃗𝑗 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑗 ⟩| 𝑗 = 1 𝑡𝑜 |𝐿|} be the finite set of
rough set can handle incomplete information. These two concepts are labeled samples distributed in 𝑁 number of classes present in the
reciprocal to each other. Fuzzy-rough set theory is the pair of lower dataset, where 𝑙⃗𝑗 denote the 𝑗th labeled pattern, 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑗 is the class
and upper approximations of a set 𝐴 in a universe 𝑋 on which a fuzzy label of 𝑙⃗𝑗 pattern, and |𝐿| is the cardinality of 𝐿. In order to keep
relation 𝑅 is defined. The fuzzy-rough lower and upper approximations diversity in the training set here in this proposed ensemble technique,
of 𝐴 are defined as follows (Radzikowska and Kerre, 2002): labeled set 𝐿 is divided into 𝐿1 , 𝐿2 , … , 𝐿𝑃 subsets using bootstrap
𝑥) = inf (𝑅(⃗
(𝑅 ↓ 𝐴)(⃗ 𝑥, 𝑦),
⃗ 𝐴(𝑦)),
⃗ (3) strategy. The algorithm then selects 𝑘-nearest neighbor (𝑘𝑁𝑁) labeled
𝑦∈𝑋 samples of each unlabeled samples 𝑈 with respect to 𝐿1 , 𝐿2 , … , 𝐿𝑃
labeled subsets details of which are given below.
𝑥) = sup  (𝑅(⃗
(𝑅 ↑ 𝐴)(⃗ 𝑥, 𝑦),
⃗ 𝐴(𝑦)),
⃗ (4)
𝑦∈𝑋 3.1.1. Selection of 𝑘-nearest neighbor labeled samples
where  is the Lukasiewicz implicator,  is the Lukasiewicz 𝑡-norms Select the 𝑘𝑁𝑁 labeled samples from the labeled subset 𝐿𝑝 (∀𝑝 =
and 𝑅(⃗ ⃗ is the valued similarity of patterns 𝑥⃗ and 𝑦,
𝑥, 𝑦) ⃗ inf is infimum 1 𝑡𝑜 𝑃 ) nearest to each of the unlabeled sample 𝑢⃗𝑖 ∈ 𝑈 , based on the
and sup is supremum. Euclidean distance (Duda et al., 2000) as follows:

𝑑𝑖𝑠𝑡𝑢⃗ 𝑙⃗ = ‖𝑢⃗𝑖 − 𝑙⃗𝑗 ‖, ∀ 𝑖, ∀ 𝑗, (5)


𝑖 𝑗
2.4. Active learning
where 𝑢⃗𝑖 ∈ 𝑈 is the 𝑖th unlabeled sample and 𝑙⃗𝑗 ∈ 𝐿𝑝 is the 𝑗th labeled
Active learning is a branch of machine learning and it is also known sample from the 𝑝th labeled subset 𝐿𝑝 .
as query learning (Settles, 2010). Active learning technique solves the
problems of scarcity of the labeled samples by asking queries in the 3.1.2. Computation of fuzzy-rough lower and upper approximations
form of unlabeled samples to be labeled by an expert (e.g., a human The values of fuzzy-rough lower (𝑅 ↓ 𝐶𝑗 )(𝑢⃗𝑖 ) and upper (𝑅(↑ 𝐶𝑗 )(𝑢⃗𝑖 )
annotator) (Settles, 2010). approximations of each unlabeled sample 𝑢⃗𝑖 ∈ 𝑈 for belonging to each
Active learning is being applied in many recent machine learning class 𝐶𝑗 (∀𝑗 = 1 𝑡𝑜 𝑁) are calculated respectively as follows:
problems where unlabeled data is huge but labeled data is limited
and/or expensive to obtain. A good literature on the active learning (𝑅 ↓ 𝐶𝑗 )(𝑢⃗𝑖 ) = inf (𝑅(𝑢⃗𝑖 , 𝑙⃗𝑥 ), 𝐶𝑗 (𝑙⃗𝑥 )), ∀ 𝑖, ∀ 𝐶𝑗 , (6)
𝑙⃗𝑥 ∈𝑘𝑁𝑁
techniques is published by Settles (2010). Active learning has been an
effective research field in machine learning in the recent past but is still = min (min(1, 1 − 𝑅(𝑢⃗𝑖 , 𝑙⃗𝑥 ) + 𝐶𝑗 (𝑙⃗𝑥 ))),
𝑙⃗𝑥 ∈𝑘𝑁𝑁
relatively new in the field of bioinformatics.

2.5. Ensemble learning (𝑅 ↑ 𝐶𝑗 )(𝑢⃗𝑖 ) = sup  (𝑅(𝑢⃗𝑖 , 𝑙⃗𝑥 ), 𝐶𝑗 (𝑙⃗𝑥 )), ∀ 𝑖, ∀ 𝐶𝑗 , (7)
𝑙⃗𝑥 ∈𝑘𝑁𝑁

Ensemble learning combines the decisions of the multiple base clas- = max (max(0, 𝑅(𝑢⃗𝑖 , 𝑙⃗𝑥 ) + 𝐶𝑗 (𝑙⃗𝑥 ) − 1)),
sifiers in such a way that the combination result can improve the overall 𝑙⃗𝑥 ∈𝑘𝑁𝑁

classification accuracy over any individual base classifier (Kuncheva, where  is the Lukasiewicz implicator,  is the Lukasiewicz 𝑡-norms
2004). The heterogeneity among the base classifiers and diversity in and 𝑅(𝑢⃗𝑖 , 𝑙⃗𝑥 ) is computed as (Jensen and Cornelis, 2011):
the training dataset are the basic ideas for success of ensemble tech-
∑ 2
nique (Kuncheva, 2004). Some popular ensemble methods are Bagging, 𝑙⃗𝑗 ∈𝑘𝑁𝑁
(‖𝑢⃗𝑖 − 𝑙⃗𝑗 ‖) 𝑚−1
Boosting and Random Forest (Polikar, 2006). Ensemble methods have 𝑅(𝑢⃗𝑖 , 𝑙⃗𝑥 ) = 2
, (8)
the ability to deal with small sample size and high dimensionality (Yang (‖𝑢⃗𝑖 − 𝑙⃗𝑥 ‖) 𝑚−1
et al., 2010). Therefore, ensemble methods have been widely applied
where ‖𝑢⃗𝑖 − 𝑙⃗𝑥 ‖ is the distance of the 𝑖th unlabeled sample 𝑢⃗𝑖 ∈ 𝑈 from
to microarray gene expression data. A notable review of ensemble
the 𝑥th labeled sample 𝑙⃗𝑥 ∈ 𝑘𝑁𝑁 (𝑘-nearest neighbor labeled sample
methods applied in bioinformatics may be found in Yang et al. (2010).
of 𝑖th unlabeled sample 𝑢⃗𝑖 ) and 𝑚 (1 < 𝑚 < ∞) is the fuzzifier.
𝐶𝑗 (𝑙⃗𝑥 ) is computed as:
3. Ensemble-based active learning using fuzzy-rough nearest
{
neighbor classifier 1, if 𝑙⃗𝑥 ∈ 𝐶𝑗
𝐶𝑗 (𝑙⃗𝑥 ) = (9)
0, Otherwise.
The proposed ensemble-based active learning using fuzzy-rough
nearest neighbor classifier (EnALFRNN) comprises of (i) ensemble- Higher value of (𝑅 ↓ 𝐶𝑗 )(𝑢⃗𝑖 ) reflects that many of 𝑢⃗𝑖 ’s neighbors
based active learning step and (ii) ensemble-based testing step. In belong to class 𝐶𝑗 , while high value of (𝑅 ↑ 𝐶𝑗 )(𝑢⃗𝑖 ) represents that at
the first step, ensemble-based active learning approach is adopted to least one (or some) of 𝑢⃗𝑖 ’s neighbor(s) belong(s) to class 𝐶𝑗 .
search the ‘most informative’ unlabeled samples by taking the con-
sensus (intersection) of 𝑃 number informative sample sets to get the 3.1.3. Find the informative unlabeled samples
labels from the subject experts, so that the labeled ‘most informative The set of informative samples 𝐼𝑝𝑛 = {𝑖⃗𝑛𝑗 ∣ 𝑗 = 1 𝑡𝑜 |𝐼𝑝𝑛 |} are
samples’ can be augmented iteratively to the training set to enhance computed from unlabeled set 𝑈 with respect to each labeled subset
the classification accuracy. In the ensemble-based testing step, the test 𝐿𝑝 (∀𝑝 = 1 𝑡𝑜 𝑃 ) in the 𝑛th iteration. Computation is made based on the
samples are classified by all the base classifiers to a certain class based average value of fuzzy-rough lower and upper approximations of the
on the final extended training set (obtained from the ensemble-based unlabeled sample for belonging to a certain class as provided below.
active learning step). Thereafter, the final ensemble decisions of the Suppose, the average value of fuzzy-rough lower and upper ap-
test samples are made by majority voting technique applied on the proximations of the unlabeled sample 𝑢⃗𝑖 for belonging to a class 𝐶𝑗
predictions of different base classifiers. is 𝑎𝑣𝑔𝑖𝑗 = [((𝑅 ↓ 𝐶𝑗 )(𝑢⃗𝑖 ) + (𝑅 ↑ 𝐶𝑗 )(𝑢⃗𝑖 ))∕2] and the highest average
The block diagram of the proposed EnALFRNN method is shown value of fuzzy-rough lower and upper approximations of the unlabeled
in Fig. 1 and detailed descriptions of the methodology are furnished sample for belonging to a class 𝐶ℎ is 𝑚𝑎𝑥ℎ = max𝑗 (𝑎𝑣𝑔𝑖𝑗 ), ∀𝑖. The degree
below: of similarity of an unlabeled sample 𝑢⃗𝑖 for belonging to class 𝐶𝑗 and

3
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

Fig. 1. Block diagram of the proposed ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN).

the highest belonging class 𝐶ℎ is represented by the ratio of 𝑎𝑣𝑔𝑖𝑗 and 𝐼𝑝𝑛 as follows:
𝑚𝑎𝑥ℎ and the ratio value lies between 0 and 1. Higher the value of the
ratio, more is the similarity of the unlabeled sample with two classes 𝐼 𝑛 = 𝐼1𝑛 ∩ 𝐼2𝑛 … ∩ 𝐼𝑃𝑛 (10)
𝐶𝑗 and 𝐶ℎ ; hence less is the confidence of that unlabeled sample for
𝑎𝑣𝑔
belonging to any class. Therefore, if the value of ratio 𝑚𝑎𝑥𝑖𝑗 (∀ 𝑗 ≠ ℎ) is 3.1.5. Convergence of ensemble-based active learning step

greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 value (close to 1) then the corresponding sample The ensemble-based active learning step is continued until conver-
is selected as informative sample 𝑢⃗𝑖 and added to the set of informative gence. The process is said to converge when no new most informative
samples 𝐼𝑝𝑛 ; Otherwise, 𝑢⃗𝑖 is not added to the set of informative samples sample remains to be appended to the labeled set.
𝐼𝑝𝑛 in the 𝑛th iteration.
3.2. Ensemble-based testing step
3.1.4. Find the most informative unlabeled samples
Once the informative sample sets 𝐼𝑝𝑛 (∀𝑝 = 1 𝑡𝑜 𝑃 ) are selected (from Once the ensemble-based active learning step is converged, then the
the Sub Section 3.1.3) in the 𝑛th iteration with respect to each labeled set of test samples 𝑇 = {𝑡⃗𝑖 ∣ 𝑖 = 1 𝑡𝑜 |𝑇 |} (where |𝑇 | is the cardinality
subsets 𝐿𝑝 (∀𝑝 = 1 𝑡𝑜 𝑃 ) the most informative samples are determined of 𝑇 ) is tested to assign the class labels based on the enlarged set of
by taking the intersection (consensus) of all the informative sample sets labeled samples 𝐿 obtained from the first step as follows:

4
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

The ensemble-based testing step begins with the creation of new 𝐹 𝑖𝑛𝑎𝑙𝐶𝑙𝑎𝑠𝑠𝐿𝑎𝑏𝑒𝑙(𝑡⃗𝑖 )
∑ ∑
labeled subsets 𝐿1 , 𝐿2 , … , 𝐿𝑃 using bootstrap method applied on en- ⎧𝐶𝑗 ; 𝑖𝑓 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙𝑗𝑝 (𝑡⃗𝑖 ) > 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙ℎ𝑝 (𝑡⃗𝑖 ),
larged training set 𝐿 obtained from Step 1 (Ensemble-based Active ⎪ 𝑝∈𝑃 𝑝∈𝑃
Learning step). Then the fuzzy-rough nearest neighbor (FRNN) clas- ⎪ ∀ℎ = 1 𝑡𝑜 𝑁 (ℎ ≠ 𝑗)
=⎨ (( ) ) (16)
sifiers are applied as a base classifier trained respectively with new ⎪ (𝑅↓𝐶𝑗 )(𝑡⃗𝑖 )+(𝑅↑𝐶𝑗 )(𝑡⃗𝑖 )
labeled subsets 𝐿1 , 𝐿2 , … , 𝐿𝑃 as follows: ⎪arg ∀𝑗,∀𝑝
max 2
; 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
⎩ 𝑝

3.2.1. Selection of 𝑘-nearest neighbor labeled samples where 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙𝑗𝑝 (𝑡⃗𝑖 ) is a 𝑝th FRNN base classifier predicted test sample
Select the 𝑘𝑁𝑁 labeled samples nearest to each of the test sample 𝑡⃗𝑖 to class 𝐶𝑗 , 𝑃 is the number of classifiers and 𝑁 is the number of
(𝑡⃗𝑖 ). Euclidean distance (Duda et al., 2000) measure is used to com- classes present.
pute the distance from the labeled sample to test sample as defined
in Eq. (11). The entire procedure of the proposed ensemble-based active learn-

𝑑𝑖𝑠𝑡𝑡⃗ 𝑙⃗ = ‖𝑡⃗𝑖 − 𝑙⃗𝑗 ‖, ∀ 𝑡⃗𝑖 , ∀ 𝑗, (11) ing using fuzzy-rough nearest neighbor (EnALFRNN) classifier is sum-
𝑖 𝑗
marized in Algorithm 1.
where 𝑡⃗𝑖 is the test sample and 𝑙⃗𝑗 ∈ 𝐿𝑝 is the 𝑗th labeled sample. Algorithm 1: Ensemble-based Active Learning using Fuzzy-Rough
Nearest Neighbor (EnALFRNN) Classifier
3.2.2. Calculation of fuzzy-rough lower and upper approximations Input : Set of labeled samples
The values of fuzzy-rough lower (𝑅 ↓ 𝐶𝑗 )(𝑡⃗𝑖 ) and upper (𝑅 ↑ 𝐶𝑗 )(𝑡⃗𝑖 ) 𝐿 = {< 𝑙⃗𝑗 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑗 > | 𝑗 = 1 𝑡𝑜 |𝐿| 𝑎𝑛𝑑 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑗 ∈ 𝐶},
approximations of test sample 𝑡⃗𝑖 for belonging to each class 𝐶𝑗 (∀𝑗 = Set of unlabeled samples 𝑈 = {𝑢⃗𝑗 | 𝑗 = 1 𝑡𝑜 |𝑈 |} and
1 𝑡𝑜 𝑁) are calculated respectively as follows: Set of test samples 𝑇 = {𝑡⃗𝑖 | 𝑖 = 1 𝑡𝑜 |𝑇 |}.

(𝑅 ↓ 𝐶𝑗 )(𝑡⃗𝑖 ) = inf (𝑅(𝑡⃗𝑖 , 𝑙⃗𝑥 ), 𝐶𝑗 (𝑙⃗𝑥 )), ∀ 𝑡⃗𝑖 , ∀ 𝐶𝑗 , (12) Parameter : 𝑘 and 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (user defined).
𝑙⃗𝑥 ∈𝑘𝑁𝑁
Temporary variables : 𝐼 𝑛 (Set of most informative samples at 𝑛𝑡ℎ
= min (min(1, 1 − 𝑅(𝑡⃗𝑖 , 𝑙⃗𝑥 ) + 𝐶𝑗 (𝑙⃗𝑥 ))); iteration),
𝑙⃗𝑥 ∈𝑘𝑁𝑁 𝑎𝑣𝑔𝑖𝑗 , 𝑚𝑎𝑥ℎ , 𝑓 𝑙𝑎𝑔 (binary) and 𝑏𝑣𝑎𝑟 (binary).
Output : Predicted class information of each test sample 𝑡⃗𝑖 ∈ 𝑇 .
(𝑅 ↑ 𝐶𝑗 )(𝑡⃗𝑖 ) = sup  (𝑅(𝑡⃗𝑖 , 𝑙⃗𝑥 ), 𝐶𝑗 (𝑙⃗𝑥 )), ∀ 𝑡⃗𝑖 , ∀ 𝐶𝑗 , (13) Method :
𝑙⃗𝑥 ∈𝑘𝑁𝑁
Ensemble-based active learning phase
= max (max(0, 𝑅(𝑡⃗𝑖 , 𝑙⃗𝑥 ) + 𝐶𝑗 (𝑙⃗𝑥 ) − 1)),
𝑙⃗𝑥 ∈𝑘𝑁𝑁
1: Initialize 𝑓 𝑙𝑎𝑔 = 1, 𝑛 = 1 and 𝐼 𝑛 = ∅;
where  is the Lukasiewicz implicator,  is the Lukasiewicz 𝑡-norms 2: while(𝑓 𝑙𝑎𝑔 == 1)
and 𝑅(𝑡⃗𝑖 , 𝑙⃗𝑥 ) is computed as (Jensen and Cornelis, 2011): 3: Labeled set 𝐿 divided into 𝐿1 , 𝐿2 , …, 𝐿𝑃 subsets using bootstrap
4: for 𝑝 = 1 𝑡𝑜 𝑃
∑ 2
(‖𝑡⃗𝑖 − 𝑙⃗𝑗 ‖) 𝑚−1 5: 𝐼𝑝𝑛 =FindInformativeSamples(𝐿𝑝 , 𝑈 )
𝑙⃗𝑗 ∈𝑘𝑁𝑁
𝑅(𝑡⃗𝑖 , 𝑙⃗𝑥 ) = , (14) 6: 𝐼 𝑛 = 𝐼 𝑛 ∩ 𝐼𝑝𝑛 ;
2
(‖𝑡⃗𝑖 − 𝑙⃗𝑥 ‖) 𝑚−1 7: end for
8: if (𝐼 𝑛 ≠ ∅) then
where ‖𝑡⃗𝑖 − 𝑙⃗𝑥 ‖ is the distance of the test sample 𝑡⃗𝑖 from the 𝑥th labeled 9: Most informative set 𝐼 𝑛 is labeled by the subject expert
pattern 𝑙⃗𝑥 ∈ 𝑘𝑁𝑁 (𝑘-nearest neighbor labeled sample of test sample 𝑡⃗𝑖 ) 10: 𝐿 = 𝐿 + 𝐼𝑛 ; 𝑈 = 𝑈 − 𝐼𝑛;
and 𝑚 (1 < 𝑚 < ∞) is the fuzzifier. 𝐶𝑗 (𝑙⃗𝑥 ) is computed using Eq. (9). 11: 𝑛 = 𝑛 + 1 ; 𝐼 𝑛 = ∅;
12: else
13: 𝑓 𝑙𝑎𝑔 = 0;
3.2.3. Assignment of class information to the test samples 14: end if
The test sample 𝑡⃗𝑖 (∀ 𝑖 = 1 𝑡𝑜 |𝑇 |) is assigned to a particular 15: end while
class for which the average value of fuzzy-rough lower and upper FindInformativeSamples(𝐿𝑝 , 𝑈 )
approximations is highest. The assigned class (𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙𝑗𝑝 (𝑡⃗𝑖 )) of a test 16: for each unlabeled pattern 𝑢⃗𝑖 ∈ 𝑈
sample 𝑡⃗𝑖 is determined for 𝑝th base classifier 𝐹 𝑅𝑁𝑁𝑝 (𝑡⃗𝑖 ) as follows: 17: for each class 𝐶𝑗
( ) 18: Calculate the value of (𝑅 ↓ 𝐶𝑗 )(𝑢⃗𝑖 ) using Eq. (6)
(𝑅 ↓ 𝐶𝑗 )(𝑡⃗𝑖 ) + (𝑅 ↑ 𝐶𝑗 )(𝑡⃗𝑖 ) 19: Calculate the value of (𝑅 ↑ 𝐶𝑗 )(𝑢⃗𝑖 ) using Eq. (7)
𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙𝑗𝑝 (𝑡⃗𝑖 ) = arg max , ∀𝑡⃗𝑖 , ∀𝑝. (15)
𝑗 2 20: Average value, 𝑎𝑣𝑔𝑖𝑗 =((𝑅 ↓ 𝐶𝑗 )(𝑢⃗𝑖 ) + (𝑅 ↑ 𝐶𝑗 )(𝑢⃗𝑖 ))∕2
𝑝
21: end for
22: Determine the highest of 𝑎𝑣𝑔𝑖𝑗 ∀𝑗 as, 𝑚𝑎𝑥ℎ = max(𝑎𝑣𝑔𝑖𝑗 )
𝑗
3.2.4. Majority voting for final class assignment 23: for each class 𝐶𝑗 (𝑗 ≠ ℎ)
(( 𝑎𝑣𝑔 ) )
Once the predictions are made by all the individual FRNN classifiers 24: if 𝑖𝑗
> 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then
for the test sample 𝑡⃗𝑖 (∀ 𝑖 = 1 𝑡𝑜 |𝑇 |) then the final class is assigned to 25:
𝑚𝑎𝑥 ℎ
𝑏𝑣𝑎𝑟 = 1;
that test sample 𝑡⃗𝑖 by adopting the majority voting policy applied on 26: else
the individual FRNN classifiers’ decisions. In the majority voting, all the 27: 𝑏𝑣𝑎𝑟 = 0;
base classifiers vote one class labeled for the test sample 𝑡⃗𝑖 , and the final 28: break;
class labeled is assigned to the test sample 𝑡⃗𝑖 to that class who received 29: end if
the maximum number of votes. If the test sample 𝑡⃗𝑖 received same 30: end for
31: if (𝑏𝑣𝑎𝑟 == 1) then
number of (maximum) votes for the two or more than two classes, then
32: Add unlabeled sample 𝑢⃗𝑖 into the set of informative samples 𝐼𝑝𝑛
the test sample 𝑡⃗𝑖 is assigned final class labeled for which the average
33: end if
value of fuzzy-rough lower and upper approximations is highest for all 34: end for
the base classifiers. Assignment of the final class labeled to the test 35: return Set of informative samples 𝐼𝑝𝑛
sample 𝑡⃗𝑖 (∀ 𝑖 = 1 𝑡𝑜 |𝑇 |) is done as follows.

5
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

Table 1
Ensemble-based testing phase Summary of eight benchmark microarray gene expression datasets.
Datasets # total # genes # classes
1: Extended labeled set 𝐿 divided into 𝐿1 , 𝐿2 , …, 𝐿𝑃 subsets using
samples
bootstrap
2: for 𝑝 = 1 𝑡𝑜 𝑃 Colon cancer 62 2000 2
Prostate cancer 102 6033 2
3: < 𝑇 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑝 >=FRNNClassifier(Lp , T)
SRBCT 63 2308 4
4: end for Ovarian cancer 253 15 154 2
5: DLBCL tumor 77 7129 2
< 𝑇 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓 >=MajorityVoting(< 𝑇 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓1 >,. . . , < 𝑇 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑃 >) Central nervous system 60 7129 2
using the Eq. (16) Lung cancer 203 12 600 5
Leukemia 72 3571 2
FRNNClassifier(𝐿𝑝 , 𝑇 )
6: for 𝑖 = 1 𝑡𝑜 |𝑇 |
7: Select 𝑘-nearest neighbor labeled samples (𝑘𝑁𝑁) of each of test
sample 𝑡⃗𝑖 ∈ 𝑇 (as described in Section 3.2.1). 4.1. The datasets
8: Compute the value of ((𝑅 ↓ 𝐶𝑗 )(𝑡⃗𝑖 )) and ((𝑅 ↑ 𝐶𝑗 )(𝑡⃗𝑖 )) of each test
sample
The experiments are carried out on eight freely available benchmark
𝑡⃗𝑖 ∈ 𝑇 for belonging to each class 𝐶𝑗 (∀𝑗) using the Eqs. (12) and (13),
microarray gene expression datasets namely, Colon cancer, Prostate
respectively.
9: Assign the class labeled information(< 𝑡⃗𝑖 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑝 >) to each test cancer, Small round blue cell tumors (SRBCT), Ovarian cancer, Diffuse
sample large b-cell lymphoma (DLBCL), Central nervous system (CNS), Lung
𝑡⃗𝑖 ∈ 𝑇 using Eq. (15). cancer (Harvard-1) and Leukemia. These datasets are collected from
10: end for www.stat.ethz.ch/dettling/bagboost.html (Dettling, 2004) and http://
11: return Test set with Class labeled information (< 𝑇 , 𝐶𝑙𝑎𝑠𝑠𝐼𝑛𝑓𝑝 >). datam.i2r.astar.edu.sg/datasets/krbd/index.html (Technology Agency
for Science and Research, 0000). The datasets comprise of samples de-
scribed by gene expression values with class information. We have used
small number of labeled samples for the training purpose and the re-
3.2.5. Challenges addressed by the proposed method maining of the samples are used as the test data. Detailed explanations
The challenges described in the introduction section are addressed of the used datasets are given below.
by the proposed method in the following way. (i) To deal the inher- Colon cancer dataset (Alon et al., 1999) consists of 62 samples in
ent vagueness, indiscernibility, ambiguity and overlappingness usually which 40 samples are from cancerous patients and 22 samples are from
present in the subtype classes of microarray gene expression cancer normal patients. Each sample is described by 2000 genes.
dataset the fuzzy-rough set concept is used to compute the fuzzy-rough Prostate cancer dataset (Singh et al., 2002) is having 102 samples
lower and upper approximations in Section 3.1.2, and subsequently distributed in two classes. The number of observations for prostate
used in finding the informative unlabeled samples (in Section 3.1.3) cancer tissues and normal patients tissues are 52 and 50 respectively.
followed by finding the most informative unlabeled samples (in Sec- The expression profile contains 6033 genes.
tion 3.1.4) in the Ensemble based active learning phase (Section 3.1) . SRBCT dataset (Khan et al., 2001) contains 63 observations in
Similarly, the fuzzy-rough concept is also applied in the ensemble based which 23 observations are from Ewing’s sarcoma (ES) type, 20 obser-
vations are rhabdomyosarcoma (RS), 12 observations are from neurob-
testing phase (Section 3.2) to compute the fuzzy-rough lower and upper
lastoma (NB) type and 8 observations are of Burkitt’s lymphoma (BL)
and lower approximations (in Section 3.2.2) which subsequently are
class. There are 2308 genes present in each observation.
used to assign class information to the test samples (in Sections 3.2.3
Ovarian cancer dataset (Technology Agency for Science and Re-
and 3.2.4). (ii) The low predictive accuracy produced by the individual
search, 0000) comprises of 203 samples out of which 91 samples are
classifier is improved by the proposed method by adopting the concept
normal and 162 samples are ovarian cancers. Each sample consists of
of ensemble approach in the active learning phase (in Section 3.1) to
expression values of 15 154 genes.
find the most informative unlabeled samples as it takes the consensus
Diffuse large b-cell lymphoma (DLBCL)-Harvard dataset (Tech-
(in Section 3.1.4) of all the informative samples produced by the base nology Agency for Science and Research, 0000) is having 77 samples
classifiers. Likewise, the ensemble strategy is also adopted in the testing scattered into two classes. The number of samples for DLBCL and
phase (Section 3.2) to find the consensus decision by the majority follicular lymphoma are 58 and 19 respectively. There are 7129 genes
voting policy (in Section 3.2.4) applied on individual FRNN classifier’s present in each sample.
decisions/predictions. (iii) The scarcity of the labeled samples are Central nervous system dataset (Technology Agency for Science
handled by the proposed method using the active learning strategy and Research, 0000) contains 60 samples of embryonal tumor and each
adopted in the first phase (Section 3.1) to choose the most informative sample is characterized by 7129 genes. Dataset is distributed in two
samples in order to label them by the experts which in turn are classes of which 21 are survivors and 39 are failures. Survivor patients
added to the training set to improve the learning process. It is worthy are alive after treatment while the failures are those who succumbed
to mention here that instead of choosing the samples randomly for to their disease.
labeling, the informative samples chosen (for experts labeling) by the Lung cancer (Harvard-1) dataset (Technology Agency for Science
active learning technique yields better prediction accuracy compared and Research, 0000) consists of 203 samples in which 139 samples are
to random sampling as reported in the literatures (Halder and Kumar, of lung adenocarcinomas, 20 samples are from pulmonary carcinoids,
2019; Kumar and Halder, 2019b; Vogiatzis and Tsapatsoulis, 2008). 21 samples belongs to squamous cell lung carcinomas, 6 samples are
part of small-cell lung carcinomas and 17 are normal lung samples.
Each sample contains expression values of 12 600 genes.
4. Experimental evaluation
Leukemia dataset (Golub et al., 1999) is having of 72 observa-
tions. Among them, 47 observations are from lymphoblastic leukemia
In this section details of the datasets used in the present investiga- and 25 observations are from myeloid leukemia. Each observation is
tion along with the other compared methods are reported followed by represented by 3571 genes.
performance evaluation measures. Finally, experimental setup is also A brief overview of the datasets used for the experiment is summa-
summarized. rized in Table 1.

6
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

4.2. The methods compared 𝑖th class. The precision 𝑝𝑖 of 𝑖th class and overall precision are defined
respectively as:
The performances of the proposed method are compared against six 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 𝑖𝑛𝑡𝑜 𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠
state-of-the-art techniques out of which three are ensemble learning 𝑝𝑖 = , (18)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 𝑖𝑛𝑡𝑜 𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠
based techniques namely, Bagging IBK, Bagging SMO and ensemble
based fuzzy-rough nearest neighbor (EnFRNN); two are active learning
1 ∑
𝑁
based classifiers namely, active learning using fuzzy 𝑘-nearest neighbor
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑝, (19)
classifier (ALFKNN) and active learning using rough fuzzy classifier 𝑁 𝑖=1 𝑖
(ALRFC); and one is supervised learning method namely, fuzzy-rough
nearest neighbor (FRNN) classifier. where 𝑁 is the number of classes.
Bagging IBK is a bagging ensemble (Breiman, 1996) method using Recall: Recall is the ratio of number of test samples correctly
instance based learning with (parameter 𝑘) IBK (Aha et al., 1991) as a classified into 𝑖th class to the total number of test samples that are truly
base classifier. IBK is also known as 𝑘-nearest neighbor (𝑘-NN) (Keller present in 𝑖th class. The recall 𝑟𝑖 of the 𝑖th class and overall recall are
et al., 1985) classifier and it is one of the lazy-learning method used defined respectively as:
from WEKA (http://www.cs.waikato.ac.nz/ml/weka/). 𝑘-NN is one of 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 𝑖𝑛𝑡𝑜 𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠
the simplest method for classification. In this method, class label of 𝑟𝑖 = , (20)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑡𝑟𝑢𝑙𝑦 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑛 𝑖𝑡ℎ 𝑐𝑙𝑎𝑠𝑠
the test sample is assigned based on the 𝑘-nearest neighbors labeled
samples of that test sample, where 𝑘 is the positive number.
1 ∑
𝑁
In Bagging SMO (Platt, 1998), bagging model is used as an ensemble
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑟, (21)
technique and sequential minimal optimization (SMO) with polynomial 𝑁 𝑖=1 𝑖
kernel function is used in support vector machine (SVM) (Platt, 1998)
as a base classifier. SMO model uses a small analytic quadratic program- where 𝑁 is the number of classes.
ming (QP) in each step whereas SVM algorithm requires the solution Macro averaged 𝐹1 measure: Macro averaged 𝐹1 measure is de-
of a very large numerical QP optimization problem. rived from precision and recall. (𝐹1 )𝑖 , the harmonic mean between
EnFRNN (Kumar and Halder, 2019b) method combines the predic- precision and recall, of the 𝑖th class is defined as:
tions of all the FRNN (Jensen and Cornelis, 2011) base classifier with 2 × 𝑝𝑖 × 𝑟𝑖
the help of majority voting that improves the classification accuracy. (𝐹1 )𝑖 = , (22)
𝑝𝑖 + 𝑟𝑖
Here fuzzy set deals the vagueness, ambiguity and rough set deals the
uncertainty, incompleteness and indiscernibility present in the gene The macro averaged 𝐹1 measure is computed by first computing the
expression data. 𝐹1 scores for each class and then averaging these per class scores to
ALFKNN (Halder et al., 2015) method uses active learning strat- compute the global mean. Macro averaged 𝐹1 measure is defined as:
egy to select most confusing samples from the unlabeled microarray
1 ∑
𝑁
gene expression datasets for labeling by the subject expert. In this 𝑀𝑎𝑐𝑟𝑜 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝐹1 = (𝐹 ) , (23)
method, selection of most confusing samples and classification of the 𝑁 𝑖=1 1 𝑖
test samples are determined with the help of fuzzy 𝑘-nearest neighbor where 𝑁 is the number of classes. Macro averaged 𝐹1 gives equal
(𝑘-NN) (Keller et al., 1985) method.
weight to each class. The value of macro averaged 𝐹1 lies between 0 and
In ALRFC (Halder and Kumar, 2019) method, rough fuzzy approach
1. Closer the value of macro averaged 𝐹1 to 1, better is the classification.
is used to select most confusing samples from the pool of unlabeled
Micro averaged 𝐹1 measure: Micro averaged 𝐹1 measure is com-
samples to be labeled by the experts and that samples with class
puted by first creating a global contingency table whose cell values are
information is added with training labeled datasets which increases the
classification accuracy. In this method, classification of the test samples the sum of the corresponding cells in the per-class contingency tables.
is done with the help of fuzzy membership value of the test samples for Then use this global contingency table to compute the micro averaged
belonging to particular class. performance scores. Micro averaged 𝐹1 gives equal weightage on each
FRNN (Jensen and Cornelis, 2011) is the supervised learning sample. Micro averaged 𝐹1 is defined as:
method. In FRNN method, nearest neighbor approach is used to com- ∑ 1 ∑𝑁
2 × 𝑁1 𝑁 𝑖=1 𝑝𝑖 × 𝑁 𝑖=1 𝑟𝑖
pute the values of lower and upper approximations of the test samples. 𝑀𝑖𝑐𝑟𝑜 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝐹1 = , (24)
1 ∑𝑁 1 ∑𝑁
A test sample is assigned to a particular class for which the average 𝑁 𝑖=1 𝑝𝑖 + 𝑁 𝑖=1 𝑟𝑖
value of lower and upper approximations is highest. where 𝑁 is the number of classes. The value of micro averaged 𝐹1 lies
between 0 and 1. Closer the value of micro averaged 𝐹1 to 1, better is
4.3. Performance validity measures the classification.
Cohen’s kappa: Cohen (1960) developed the kappa coefficient
To assess the performance of the proposed method, various validity to measure the classification accuracy. The comparison between the
measures namely, (𝑖) percentage accuracy, (𝑖𝑖) precision, (𝑖𝑖𝑖) recall, (𝑖𝑣) pre-defined class label information and newly generated class label
macro averaged 𝐹1 measure, (𝑣) micro averaged 𝐹1 measure (Halder information, it gives a confusion matrix 𝑀 of (𝑁 ×𝑁) dimension (where
et al., 2013), and (𝑣𝑖) kappa (Cohen, 1960) are used.
𝑁 is the number of classes).
Percentage accuracy: It is the percentage ratio of correctly clas-
Cohen’s kappa coefficient is defined as:
sified test samples to the total number of test samples. It is defined
∑ ∑𝑁
as: 𝜂 𝑁𝑖=1 𝑀𝑖𝑖 − 𝑖=1 𝑀𝑖+ 𝑀+𝑖
𝑘𝑎𝑝𝑝𝑎 = ∑𝑁 , (25)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 2
𝜂 − 𝑖=1 𝑀𝑖+ 𝑀+𝑖
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇 𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
where 𝜂 is the number of test samples, 𝑀+𝑖 is the sum of the elements in
× 100. (17)
the 𝑖th column of 𝑀 and 𝑀𝑖+ the sum of the elements in the 𝑖th row of
Precision: Precision is the ratio of number of test samples correctly 𝑀. The value of kappa coefficient lies in the range [−1, +1]. The value
classified into 𝑖th class to the total number of test samples classified into of kappa coefficient closer to +1, better is the classification.

7
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

Table 2
Summary of the experimental setup (in terms of number of training samples, number of samples added in the pool during
active learning phase and ensemble-based active learning phase and number of test data) in the investigations performed by
the different methods viz., FRNN, ALFKNN, ALRFC, Bagging IBK, Bagging SMO, EnFRNN and proposed EnALFRNN on eight
microarray gene expression datasets.
Datasets Methods Number of Number of Number of
training samples samples in pool test data
FRNN 4 – 58
ALFKNN 4 16 42
ALRFC 4 18 40
Colon cancer Bagging IBK 4 – 58
Bagging SMO 4 – 58
EnFRNN 4 – 58
Proposed EnALFRNN 4 6 52
FRNN 4 – 98
ALFKNN 4 10 88
ALRFC 4 11 87
Prostate cancer Bagging IBK 4 – 98
Bagging SMO 4 – 98
EnFRNN 4 – 98
Proposed EnALFRNN 4 7 91
FRNN 8 – 55
ALFKNN 8 12 43
ALRFC 8 12 43
SRBCT Bagging IBK 8 – 55
Bagging SMO 8 – 55
EnFRNN 8 – 55
Proposed EnALFRNN 8 11 44
FRNN 4 – 249
ALFKNN 4 18 231
ALRFC 4 20 229
Ovarian cancer Bagging IBK 4 – 249
Bagging SMO 4 – 249
EnFRNN 4 – 249
Proposed EnALFRNN 4 8 241
FRNN 4 – 73
ALFKNN 4 15 58
ALRFC 4 12 61
DLBCL Bagging IBK 4 – 73
Bagging SMO 4 – 73
EnFRNN 4 – 73
Proposed EnALFRNN 4 15 58
FRNN 4 – 56
ALFKNN 4 12 44
Central ALRFC 4 16 40
nervous Bagging IBK 4 – 56
system Bagging SMO 4 – 56
EnFRNN 4 – 56
Proposed EnALFRNN 4 18 38
FRNN 10 – 193
ALFKNN 10 15 178
ALRFC 10 18 175
Lung cancer Bagging IBK 10 – 193
Bagging SMO 10 – 193
EnFRNN 10 – 193
Proposed EnALFRNN 10 18 175
FRNN 4 – 68
ALFKNN 4 10 58
ALRFC 4 11 57
Leukemia Bagging IBK 4 – 68
Bagging SMO 4 – 68
EnFRNN 4 – 68
Proposed EnALFRNN 4 14 54

4.4. Experimental set up values of the proposed method are empirically adjusted to get the
optimum results. Similarly, the parameter values of other methods are
In this work, we have reported the average results of 10 simulation also empirically determined for optimum performance.
runs on eight microarray gene expression datasets achieved by the Table 2 summarizes the experimental setup that describes the num-
seven methods. FRNN, ALFKNN, ALRFC, EnFRNN and the proposed ber of training samples used, number of samples added in the pool
EnALFRNN methods are implemented in MATLAB and executed in during active learning phase (for ALFKNN and ALFRC), number of
Windows based computer with main memory 4 GB and processor samples added in the pool during ensemble-based active learning phase
speed 2.40 GHz. Whereas, Bagging IBK and Bagging SMO methods are (for the proposed EnALFRNN method), and the number of test data
executed in same machine using WEKA (Weka 3.8) package (publicly used in the investigations for different microarray gene expression
available at http://www.cs.waikato.ac.nz/ml/weka/). The parameter datasets. All the methods (viz., FRNN, ALFKNN, ALRFC, Bagging IBK,

8
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

Table 3
Summary of the average experimental results (in terms of accuracy, precision, recall, Macro 𝐹1 , Micro 𝐹1 and kappa) of 10 runs achieved
by various methods namely, FRNN, ALFKNN, ALRFC, Bagging IBK, Bagging SMO, EnFRNN and the proposed EnALFRNN performed on eight
microarray gene expression datasets viz., Colon cancer, Prostate cancer, SRBCT, Ovarian cancer, DLBCL, Central nervous system, Lung cancer
and Leukemia.
Datasets Methods Accuracy Overall Overall Macro Micro Kappa
(%) precision recall 𝐹1 𝐹1
FRNN 90.86 ± 4.74 0.9078 0.9128 0.9006 0.9098 0.8040
ALFKNN 88.34 ± 5.66 0.8836 0.8828 0.8729 0.8827 0.7496
ALRFC 90.95 ± 1.36 0.9083 0.9133 0.9029 0.9105 0.8070
Colon Bagging IBK 66.54 ± 5.45 0.6790 0.5922 0.5697 0.6291 0.2006
cancer Bagging SMO 55.58 ± 2.30 0.6273 0.6276 0.5551 0.6274 0.2018
EnFRNN 96.85 ± 2.48 0.9667 0.9661 0.9643 0.9662 0.9288
EnALFRNN 99.81 ± 0.60 0.9972 0.9986 0.9979 0.9979 0.9957
FRNN 86.12 ± 7.96 0.8613 0.8738 0.8594 0.8675 0.7224
ALFKNN 73.77 ± 7.60 0.7338 0.7851 0.7176 0.7577 0.4662
ALRFC 71.62 ± 6.33 0.7183 0.7240 0.7146 0.7211 0.4347
Prostate Bagging IBK 72.86 ± 4.72 0.7505 0.7310 0.7230 0.7405 0.4590
cancer Bagging SMO 84.40 ± 4.10 0.8544 0.8453 0.8429 0.8497 0.6887
EnFRNN 90.64 ± 3.84 0.9070 0.9150 0.9058 0.9110 0.8130
EnALFRNN 99.98 ± 0.05 0.9978 0.9978 0.9978 0.9978 0.9955
FRNN 83.09 ± 5.56 0.8586 0.8197 0.8129 0.8386 0.7677
ALFKNN 89.62 ± 10.06 0.9227 0.8930 0.8899 0.9075 0.8589
ALRFC 98.62 ± 2.14 0.9900 0.9880 0.9886 0.9890 0.9806
SRBCT Bagging IBK 73.18 ± 4.64 0.7457 0.7732 0.7304 0.7589 0.6287
Bagging SMO 86.36 ± 6.94 0.8554 0.8873 0.8561 0.8692 0.8061
EnFRNN 89.15 ± 5.16 0.9156 0.8552 0.8648 0.8841 0.8479
EnALFRNN 99.55 ± 0.96 0.9969 0.9944 0.9954 0.9957 0.9935
FRNN 90.76 ± 7.04 0.9149 0.9145 0.9027 0.9144 0.8101
ALFKNN 94.05 ± 4.30 0.9317 0.9407 0.9355 0.9361 0.8712
ALRFC 79.68 ± 6.70 0.7789 0.7851 0.7786 0.7819 0.5591
Ovarian Bagging IBK 72.54 ± 4.67 0.7488 0.6842 0.6783 0.7126 0.3812
cancer Bagging SMO 91.79 ± 0.22 0.9156 0.9045 0.9094 0.9100 0.8189
EnFRNN 95.27 ± 2.52 0.9556 0.9486 0.9489 0.9519 0.8983
EnALFRNN 98.96 ± 0.53 0.9894 0.9882 0.9887 0.9888 0.9774
FRNN 68.80 ± 9.78 0.6994 0.6729 0.6339 0.6855 0.3200
ALFKNN 79.01 ± 6.97 0.8264 0.7765 0.7588 0.7994 0.5420
ALRFC 87.26 ± 6.26 0.7992 0.8478 0.7857 0.8170 0.5868
DLBCL Bagging IBK 76.07 ± 5.14 0.6730 0.6975 0.6741 0.6843 0.3581
Bagging SMO 82.62 ± 5.47 0.7664 0.7024 0.7017 0.7285 0.4338
EnFRNN 74.35 ± 7.67 0.7615 0.7017 0.6936 0.7299 0.4156
EnALFRNN 94.46 ± 7.31 0.9581 0.9366 0.9401 0.9469 0.8825
FRNN 51.24 ± 7.45 0.5513 0.5569 0.4983 0.5540 0.0897
ALFKNN 52.90 ± 5.10 0.5490 0.5476 0.5103 0.5483 0.0828
ALRFC 64.94 ± 5.56 0.6002 0.6218 0.5981 0.6102 0.2079
Central Bagging IBK 59.50 ± 4.22 0.5815 0.5812 0.5708 0.5813 0.1557
nervous Bagging SMO 57.50 ± 2.63 0.5000 0.5004 0.4848 0.5000 0.0039
system EnFRNN 53.65 ± 7.44 0.5030 0.5034 0.4864 0.5030 0.0476
EnALFRNN 69.47 ± 9.55 0.6150 0.6262 0.6074 0.6203 0.2325
FRNN 68.94 ± 7.46 0.8300 0.6364 0.6613 0.7195 0.5145
ALFKNN 87.61 ± 2.78 0.8687 0.7714 0.7957 0.8158 0.7634
ALRFC 92.27 ± 2.88 0.8971 0.8838 0.8846 0.8956 0.8424
Lung Bagging IBK 82.48 ± 4.96 0.7668 0.5904 0.6330 0.6634 0.5836
cancer Bagging SMO 82.95 ± 4.34 0.5586 0.5136 0.5216 0.5332 0.5722
EnFRNN 77.66 ± 6.02 0.8646 0.6240 0.6572 0.7244 0.5556
EnALFRNN 91.50 ± 1.74 0.8947 0.8469 0.8495 0.8691 0.8316
FRNN 81.76 ± 11.95 0.8408 0.8356 0.8106 0.8381 0.6425
ALFKNN 93.62 ± 2.48 0.9338 0.9291 0.9287 0.9312 0.8577
ALRFC 98.88 ± 0.97 0.9844 0.9906 0.9872 0.9875 0.9744
Leukemia Bagging IBK 90.00 ± 3.88 0.9001 0.8824 0.8863 0.8908 0.7739
Bagging SMO 91.38 ± 4.67 0.9400 0.8785 0.8960 0.9076 0.7956
EnFRNN 88.28 ± 7.04 0.8933 0.8739 0.8741 0.8835 0.7533
EnALFRNN 97.72 ± 1.24 0.9748 0.9741 0.9739 0.9744 0.9468

Bagging SMO, EnFRNN and the proposed EnALFRNN) are initially 5. Results and discussions
started with same number of training samples for impartial comparison.
However, needless to mention here that depending upon the number Table 3 summarizes the average experimental results of 10 sim-
of classes and total number of samples present in the datasets, the ulations on eight gene expression datasets in terms of percentage
number of training samples differ in different datasets. Initial training accuracy, precision, recall, macro 𝐹1 , micro 𝐹1 , and kappa achieved by
samples are so chosen that ensures that at least two samples come the proposed and comparing methods. Bold font values represent the
from each class. Once the active learning phase and ensemble-based best results obtained by the methods and the standard deviations of
active learning phase are converged, the number of informative samples accuracies of 10 simulations are also shown using ± sign corresponding
added as training pattern in the pool varies in different methods. to each percentage accuracy in Table 3.

9
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

Fig. 2. Boxplots of percentage accuracies obtained using different methods viz., FRNN, ALFKNN, ALRFC, Bagging IBK, Bagging SMO, EnFRNN and proposed EnAFRNN performed
on: (a) Colon cancer, (b) Prostate cancer, (c) SRBCT, (d) Ovarian cancer, (e) DLBCL, (f) Central nervous system, (g) Lung cancer and (h) Leukemia microarray gene expression
datasets.

It can be seen from the summarized experimental results (shown in micro 𝐹1 and kappa). The improvements in accuracy achieved by the
Table 3), that for Colon cancer dataset the proposed EnALFRNN method proposed method for Colon cancer dataset are 8.95%, 11.47%, 8.86%,
performed exceedingly well compared to the other methods in terms of 33.27%, 44.23% and 2.96% with respect to FRNN, ALFKNN, ALRFC,
all the validity measures (namely, accuracy, precision, recall, macro 𝐹1 , Bagging IBK, Bagging SMO and EnFRNN respectively. Similarly for

10
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

Table 4
Summary of paired 𝑡-test results performed on accuracies obtained by the proposed EnAFRNN versus other methods (FRNN, ALFKNN, ALRFC,
Bagging IBK, Bagging SMO and EnFRNN) in terms of 𝑝-score for various microarray gene expression datasets.
Datasets EnAFRNN EnAFRNN EnAFRNN EnAFRNN EnAFRNN EnAFRNN
vs. vs. vs. vs. vs. vs.
FRNN ALFKNN ALRFC Bagging IBK Bagging SMO EnFRNN
Colon cancer 𝟏.𝟎𝟔 × 𝟏𝟎−𝟒 ↑ 𝟏.𝟒𝟐 × 𝟏𝟎−𝟒 ↑ 𝟑.𝟏𝟕 × 𝟏𝟎−𝟗 ↑ 𝟗.𝟓𝟕 × 𝟏𝟎−𝟗 ↑ 𝟔.𝟑𝟕 × 𝟏𝟎−𝟏𝟑 ↑ .𝟎𝟎𝟔𝟒 ↑
Prostate cancer 𝟑.𝟖𝟖 × 𝟏𝟎−𝟒 ↑ 𝟏.𝟕𝟐 × 𝟏𝟎−𝟔 ↑ 𝟏.𝟖𝟗 × 𝟏𝟎−𝟕 ↑ 𝟐.𝟎𝟒 × 𝟏𝟎−𝟖 ↑ 𝟕.𝟔𝟓 × 𝟏𝟎−𝟕 ↑ 𝟑.𝟐𝟖 × 𝟏𝟎−𝟓 ↑
SRBCT 𝟒.𝟓𝟓 × 𝟏𝟎−𝟔 ↑ 𝟎.𝟎𝟎𝟗𝟗 ↑ 0.2940 𝟑.𝟐𝟏 × 𝟏𝟎−𝟖 ↑ 𝟑.𝟎𝟎 × 𝟏𝟎−𝟒 ↑ 𝟐.𝟖𝟗 × 𝟏𝟎−𝟒 ↑
Ovarian cancer 𝟎.𝟎𝟎𝟓𝟒 ↑ 𝟎.𝟎𝟎𝟖𝟖 ↑ 𝟔.𝟕𝟏 × 𝟏𝟎−𝟔 ↑ 𝟐.𝟎𝟔 × 𝟏𝟎−𝟖 ↑ 𝟕.𝟐𝟐 × 𝟏𝟎−𝟔 ↑ 𝟎.𝟎𝟎𝟏𝟑 ↑
DLBCL 𝟏.𝟒𝟎 × 𝟏𝟎−𝟒 ↑ 𝟎.𝟎𝟎𝟐𝟖 ↑ 𝟎.𝟎𝟏𝟗𝟐 ↑ 𝟕.𝟔𝟖 × 𝟏𝟎−𝟓 ↑ 𝟐.𝟔𝟖 × 𝟏𝟎−𝟒 ↑ 𝟒.𝟐𝟔 × 𝟏𝟎−𝟒 ↑
CNS 𝟎.𝟎𝟎𝟏𝟕 ↑ 𝟕.𝟑𝟗 × 𝟏𝟎−𝟒 ↑ 0.1297 𝟎.𝟎𝟏𝟗𝟑 ↑ 𝟎.𝟎𝟎𝟒𝟒 ↑ 𝟎.𝟎𝟎𝟐𝟔 ↑
Lung cancer 𝟏.𝟕𝟏 × 𝟏𝟎−𝟔 ↑ 𝟎.𝟎𝟎𝟕𝟔 ↑ 0.5141 𝟑.𝟎𝟓 × 𝟏𝟎−𝟒 ↑ 𝟏.𝟑𝟎 × 𝟏𝟎−𝟒 ↑ 𝟒.𝟗𝟖 × 𝟏𝟎−𝟔 ↑
Leukemia 𝟎.𝟎𝟎𝟑 ↑ 𝟎.𝟎𝟎𝟏𝟑 ↑ 𝟎.𝟎𝟏𝟕𝟏 ↓ 𝟒.𝟒𝟓 × 𝟏𝟎−𝟒 ↑ 𝟎.𝟎𝟎𝟐𝟖 ↑ 𝟎.𝟎𝟎𝟐𝟖 ↑

Prostate cancer dataset, the proposed method dominates all the other improvements in accuracy in comparison to other methods in 44 cases
methods by 13.86%, 26.21%, 28.36%, 27.12%, 15.58% and 9.34% (shown in ↑), whereas, in 3 cases there are no statistically significant
betterments in accuracy as compared respectively to FRNN, ALFKNN, difference in performances by the EnAFRNN compared to the other
ALRFC, Bagging IBK, Bagging SMO and EnFRNN methods. In the same methods. However, only in 1 case the ALRFC method is performed
dataset, improvements are also observed in terms of precision, recall, significantly better than the proposed method (shown in ↓) in terms
macro 𝐹1 , micro 𝐹1 and kappa obtained by the proposed EnALFRNN of obtained accuracy.
method. Likewise, the simulation results also suggest that the proposed
EnALFRNN method outperformed FRNN, ALFKNN, ALRFC, Bagging
IBK, Bagging SMO and EnFRNN methods in terms of all the valid- 6. Conclusions
ity measures for SRBCT, Ovarian cancer, DLBCL and Central nervous
system datasets. However, ALRFC method dominates all the others
The gene expression dataset comprises of limited number of sam-
methods for Lung cancer and Leukemia datasets.
ples compared to the number of genes. However, basic classification
In summary, out of 8 datasets, the proposed method achieved the
best results for 6 datasets (viz., Colon cancer, Prostate cancer, SRBCT, techniques require ‘sufficient’ number of samples to achieve desired
Ovarian cancer, DLBCL and Central nervous system datasets). Whereas, classification accuracy and usually, single classifier is not adequate
only in two datasets ALRFC method performed marginally better than to take accurate decision which yields low prediction accuracy. To
the EnALFRNN. Therefore, the experimental outcomes suggest that address the said problems, ensemble-based active learning technique
the proposed EnALFRNN method is more effective for cancer sample is proposed as ensemble learning technique combines the decision
classification from gene expression datasets than the other counter part of multiple base classifiers and active learning method selects only
techniques. few most informative unlabeled samples (with the help of ensemble
learning) to get labels of the most informative unlabeled samples by the
5.1. Boxplots results subject expert’s which in turn are augmented with the limited training
samples to improve the classification accuracy. In this work, ensemble-
Boxplots (Tukey, 1977) of percentage accuracies of 10 simulations based active learning using fuzzy-rough nearest neighbor (EnAFRNN)
achieved by the various methods on 8 gene expression datasets are classifier is proposed for cancer sample classification from gene expres-
shown in Fig. 2. It is observed from Fig. 2 that, for most of the
sion datasets. Here, ensemble technique is adopted to combine multiple
datasets (viz., Colon cancer, Prostate cancer, SRBCT, Ovarian can-
learning methods to enhance overall classification accuracy and active
cer, and Leukemia) the boxplots are relatively (quite) more compact
learning technique handles the problem of the small number of training
(Fig. 2(a), (b), (c), (d) and (h)) for the proposed method in comparison
samples whereas, fuzzy set theory handles imprecise and vague data
to the other methods. Which indicates the fact that the percentage
accuracies achieved by the proposed method for 10 simulations have and rough set concept deals with the uncertainty, indiscernibility and
quite similar performances with lesser standard deviations than those incompleteness of the different cancer classes often present in the
obtained by the other compared methods. Also, the median values of microarray gene expression datasets.
the accuracies produced by the proposed method are higher than those The performance of the proposed method is accessed using eight
achieved with other counter part methods particularly for Colon cancer, publicly available benchmark gene expression datasets in terms of
Prostate cancer, SRBCT, Ovarian cancer datasets, thus suggesting the six different kinds of the validity measures. The proposed method is
fact that the accuracies achieved by the proposed method EnALFRNN compared with other three ensemble learning techniques viz., Bagging
are better than the other compared methods for most of the datasets. IBK, Bagging SMO and EnFRNN; one supervised method FRNN and
two active learning algorithms viz., ALFKNN and ALRFC. Experimental
5.2. Statistical significance test results in terms of accuracy, precision, recall, Macro 𝐹1 , Micro 𝐹1 and
kappa obtained by the proposed method turned out to be significantly
The percentage accuracies achieved by the proposed EnAFRNN
better than those achieved by state-of-the-art methods compared. The
method versus other methods (FRNN, ALFKNN, ALRFC, Bagging IBK,
paired 𝑡-test results justify the statistical significance of better results in
Bagging SMO and EnFRNN) are statistically validated using the paired
favor of the proposed method for cancer classification in comparison to
𝑡-test (Kreyszig, 1970) at 5% level of significance. In Table 4, statis-
other methods. The main focus of the article is to classify cancer from
tically significant results are marked in bold font for which 𝑝-score
values are less or equal to 0.05 (at 5% significance level) and up- the gene expression datasets and as the feature values of these datasets
arrow (↑) indicates that the significant improvements of the proposed are numerical in nature, therefore, this article in its present form can
EnAFRNN method over other compared methods whereas, down-arrow handle the numerical feature values only. However, the EnAFRNN
(↓) indicates that the significant improvements of the other methods method can be suitably modified to cope up also with categorical
compared to the proposed EnAFRNN method. In brief, out of total feature values in future. The proposed EnALFRNN method in future
48 paired 𝑡-tests performed by the proposed EnAFRNN versus other can also be applied to other kind of gene expression datasets such as
methods, the proposed method found to produce statistically significant microRNA.

11
A. Kumar and A. Halder Engineering Applications of Artificial Intelligence 91 (2020) 103591

References Maji, P., Pal, S.K., 2012. Rough-Fuzzy Pattern Recognition, first ed. John Wiley & Sons,
New Jersey.
Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach. Maniruzzaman, M., Rahman, M., Ahammed, B., Abedin, M., Suri, H., Biswas, M., El-
Learn. 6, 37–66. Baz, A., Bangeas, P., Tsoulfas, G., Suri, J., 2019. Statistical characterization and
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A., 1999. classification of colon microarray gene expression data using multiple machine
Broad patterns of gene expression revealed by clustering analysis of tumor and learning paradigms. Comput. Methods Programs Biomed. 176, 173–193.
normal colon tissues probed by oligonucleotide arrays. In: National Academy of Maroulis, D., Flaounas, I., Iakovidis, D., Karkanis, S., 2006. Microarray-md: A system
Sciences, Vol. 96. pp. 6745–6750. for exploratory analysis of microarray gene expression data. Comput. Methods
Beluch, W.H., Genewein, T., Nürnberger, A., Khler, J.M., 2018. The power of ensembles Programs Biomed. 83 (2), 157–167.
for active learning in image classification. In: The IEEE Conference on Computer Maulik, U., Chakraborty, D., 2014. Fuzzy preference based feature selection and
Vision and Pattern Recognition. pp. 9368–9377. semisupervised svm for cancer classification. IEEE Trans. NanoBiosci. 13 (2),
Bishop, C., 2006. Pattern Recognition and Machine Learning, first ed. Springer, Verlag, 1146–1156.
New York. Osareh, A., Shadgar, B., 2013. An efficient ensemble learning method for gene
Breiman, L., 1996. Bagging predictors. Mach. Learn. 24 (2), 123–140. microarray classification. BioMed. Res. Int. 2013 (1), 1–10.
Cohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. Pawlak, Z., 1982. Rough sets. Int. J. Comput. Inf. Sci. 11 (5), 341–356.
20 (1), 37–46. Pawlak, Z., 1991. Rough sets. In: Theory and Decision Library, vol. 9, Springer,
Dettling, M., 2004. Bagboosting for tumor classification with gene expression data. Netherlands.
Bioinformatics 20 (18), 583–593. Platt, J.C., 1998. Fast training of support vector machines using sequential minimal
Dettling, M., Buhlmann, P., 2002. Supervised clustering of genes. Genome Biol. 3 (12), optimization. In: Schoelkopf, B., Burges, C.J.C., Smola, A.J. (Eds.), Advances in
1–15. Kernel Methods - Support Vector Learning. The MIT Press, USA, pp. 185–208.
Dettling, M., Buhlmann, P., 2003. Boosting for tumor classification with gene expression Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. Mag.
data. Bioinformatics 19 (9), 1061–1069. 6 (3), 21–45.
Doan, D., Wang, Y., Pan, Y., 2011. Utilization of gene ontology in semi-supervised Priscilla, R., Swamynathan, S., 2013. A semi-supervised hierarchical approach: two-
clustering. In: IEEE Symposium on Computational Intelligence in Bioinformatics dimensional clustering of microarray gene expression data. Front. Comput. Sci. 7
and Computational Biology (CIBCB). IEEE Computer Society Press, pp. 1–7. (2), 204–213.
Du, D., Li, K., Li, X., Fei, M., 2014. A novel forward gene selection algorithm for Radzikowska, A.M., Kerre, E.E., 2002. A comparative study of fuzzy rough sets. Fuzzy
microarray data. Neurocomputing 133, 446–458. Sets and Systems 126 (2), 137–155.
Duda, R., Hart, P., Stork, D., 2000. Pattern Classification, second ed. Wiley, New York, Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J., 2007. Classification of
USA. microarray data using gene networks. BMC Bioinformatics 8 (1), 35.
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Schnitzer, S., Schmidt, S., Rensing, C., Harriehausen-Muhlbauer, B., 2014. Combining
Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., 1999. Molecular classification of active and ensemble learning for efficient classification of web documents. Polibits
cancer: class discovery and class prediction by gene expression monitoring. Science 45, 39–46.
286 (5439), 531–537. Schumacher, J., Sakič, D., Grumpe, A., Fink, G.A., Wöhler, C., 2012. Active learning
Halder, A., Dey, S., Kumar, A., 2015. Active learning using fuzzy k-NN for cancer of ensemble classifiers for gesture recognition. In: Pattern Recognition. Springer
classification from microarray gene expression data. In: Bora, P., Prasanna, S., Berlin Heidelberg, pp. 498–507.
Sarma, K., Saikia, N. (Eds.), Advances in Communication and Computing, Vol. 347. Settles, B., 2010. Active Learning Literature Survey. Computer Sciences Technical
Springer India, pp. 103–113. Report 1648, University of Wisconsin, Madison.
Halder, A., Ghosh, S., Ghosh, A., 2013. Aggregation pheromone metaphor for Shi, M., Zhang, B., 2011. Semi-supervised learning improves gene expression-based
semi-supervised classification. Pattern Recognit. 46 (8), 2239–2248. prediction of cancer recurrence. Bioinformatics 27 (21), 3017–3023.
Halder, A., Kumar, A., 2019. Active learning using rough fuzzy classifier for cancer Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., add, C., Tamayo, P.,
prediction from microarray gene expression data. J. Biomed. Inform. 92, 103136. Renshaw, A.A., D’Amico, A.V., Richie, J.P., 2002. Gene expression correlates of
Halder, A., Misra, S., 2014. Semi-supervised fuzzy k-NN for cancer classification clinical prostate cancer behavior. Cancer Cell 1 (2), 203–209.
from microarray gene expression data. In: Proceedings of the 1st International Stekel, D., 2003. Microarray Bioinformatics, first ed. Cambridge University Press,
Conference on Automation, Control, Energy and Systems (ACES 2014). IEEE Cambridge, UK.
Computer Society Press, pp. 1–5. Sturn, A., Quackenbush, J., Trajanoski, Z., 2002. Genesis: cluster analysis of microarray
Jensen, R., Cornelis, C., 2011. Fuzzy-rough nearest neighbour classification and data. Bioinformatics 18 (1), 207–208.
prediction. Theoret. Comput. Sci. (ISSN: 0304-3975) 412 (42), 5871–5884. Tan, A.C., Gilbert, D., 2003. Ensemble machine learning on gene expression data for
Jiang, D., Tang, C., Zhang, A., 2004. Cluster analysis for gene expression data: A survey. cancer classification. Appl. Bioinform. 2 (3 Suppl), S75–S83.
IEEE Trans. Knowl. Data Eng. 16 (11), 1370–1386. Tan, P., Tan, S., Lim, C., Khor, S., 2011. A modified two-stage svm-rfe model for cancer
Keller, J., Gray, M., Givens, J., 1985. A fuzzy k-nearest neighbor algorithm. IEEE Trans. classification using microarray data. In: Lu, B., Zhang, L., Kwok, J. (Eds.), Neural
Syst. Man Cybern. 15 (4), 580–585. Information Processing. In: Lecture Notes in Computer Science,, vol. 7062, Springer,
Khan, J., Wei1, J., Ringner, M., Saal1, L., Ladanyi, M., Westermann, F., Berthold, F., Berlin, Heidelberg, pp. 668–675.
Schwab, M., Antonescu, C., Peterson, C., Meltzer, P., 2001. Classification and Technology Agency for Science and Research. Kent ridge bio-medical dataset repository.
diagnostic prediction of cancers using gene expression profiling and artificial neural http://datam.i2r.astar.edu.sg/datasets/krbd/index.html.
networks. Nature Med. 6 (7), 673–679. Tukey, J.W., 1977. In: Tukey, J.W. (Ed.), Exploratory Data Analysis. In: Behavioral
Kreyszig, E., 1970. Introductory Mathematical Statistics, first ed. John Wiley. Science: Quantitative Methods, Addison-Wesley, Reading, Mass..
Kumar, A., Halder, A., 2019a. Active learning using fuzzy-rough nearest neighbor Vogiatzis, D., Tsapatsoulis, N., 2008. Active learning for microarray data. Internat. J.
classifier for cancer prediction from microarray gene expression data. Int. J. Pattern Approx. Reason. 47 (1), 85–96.
Recognit. Artif. Intell. 34 (1), 2057001, (28 pages). Wang, Y., Pan, Y., 2014. Semi-supervised consensus clustering for gene expression data
Kumar, A., Halder, A., 2019b. Ensemble based fuzzy-rough nearest neighbor approach analysis. BioData Min. 7 (1), 7.
for classification of cancer from microarray data. Int. J. Res. Adv. Technol. 7 (5), Xiao, F., 2019a. A distance measure for intuitionistic fuzzy sets and its application to
105–110. pattern classification problems. IEEE Trans. Syst. Man Cybern.: Syst. 1–13.
Kuncheva, L., 2004. Combining Pattern Classifiers: Methods and Algorithms, second ed. Xiao, F., 2019b. Efmcdm: Evidential fuzzy multicriteria decision making based on belief
Wiley-Interscience. entropy. IEEE Trans. Fuzzy Syst. 1–15.
Liu, Y., 2004. Active learning with support vector machine applied to gene expression Xiao, Y., Wu, J., Lin, Z., Zhao, X., 2018. A deep learning-based multi-model ensemble
data for cancer classification. J. Chem. Inf. Comput. Sci. 44 (6), 1936–1941. method for cancer prediction. Comput. Methods Programs Biomed. 153, 1–9.
Lu, Y., Han, J., 2003. Cancer classification using gene expression data. Inf. Syst. Spec. Xiao, F., Zhang, Z., Abawajy, J., 2019. Workflow scheduling in distributed systems
Issue: Data Manage. Bioinform. 28 (4), 243–268. under fuzzy environment. J. Intell. Fuzzy Systems 37 (4), 5323–5333.
Maji, P., Pal, S., 2007. RFCM:a hybrid clustering algorithm using rough and fuzzy sets. Yang, P., Yang, Y., Zhou, B., Zomaya, A., 2010. A review of ensemble methods in
Fund. Inform. 80 (4), 475–496. bioinformatics. Mach. Learn. 5 (4), 296–308.
Zadeh, L., 1965. Fuzzy sets. Inf. Control 8 (3), 338–353.

12

You might also like