You are on page 1of 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 1

Epistasis Analysis Using an Improved Fuzzy


C-means-based Entropy Approach
Cheng-Hong Yang, Senior Member, IEEE, Li-Yeh Chuang, and Yu-Da Lin, Member, IEEE

 complex traits [1]. However, most GWAS have focused on


Abstract—Epistasis detection is vital to determining single factors associated with diseases, such as single
disease susceptibility in the human genome. With rapid nucleotide polymorphisms (SNPs) or other DNA variations.
advances in technology, multifactor dimensionality Because of this concentration on single factors in GWAS,
reduction (MDR) has become an effective algorithm for interactions between major factors that could be significantly
epistasis detection. Classification of high-risk (H) and associated with diseases may have been overlooked [2]. Gene–
low-risk (L) groups in MDR operations is a key topic but gene and gene–environment interactions (i.e., epistasis) are
has not been thoroughly investigated. In this paper, we involved in missing heritability [3]. Epistasis represents a major
propose an improved fuzzy c-means-based entropy (FCME) factor in genetic susceptibility to many diseases [4, 5]. Thus,
approach to address the limitations of binary classification. identification of gene variants or instances of epistasis may be a
For this approach, the degree of membership in MDR, critical in biological analysis of many multifactorial diseases
referred to as FCMEMDR, was used. The FCME approach [6]. Statistical analysis of gene variant interactions and epistasis
and MDR measure were integrated to enable more precise
remains a challenge in biological computation [5]. Medical
differentiation between similar frequencies of multifactor
scientists can assess gene–gene and gene–environment
genotypes in cases of possible epistasis. We used the MDR
measures of correct classification rate and likelihood ratio. interactions to obtain useful information regarding complex
Numerous simulated datasets were applied, and the diseases; however, many complex diseases result from nonlinear
experimental results revealed two measures of FCMEMDR interactions among numerous genetic and environmental variables.
with higher detection rates than those of other MDR-based Development of an efficient computational approach and powerful
algorithms. Our analysis of binary and fuzzy classifications tools to facilitate identification of epistasis is critical for the
in MDR operations may offer insights into the problem of enhancement of genetic association studies [2].
uncertainty in H/L classification. Two measures of Statistical genetic correlation analysis is generally based on
FCMEMDR detected significant instances of epistasis nonparametric tests. In Bayesian epistasis association mapping
associated with coronary artery disease in the Wellcome (BEAM), the Markov chain Monte Carlo sampling strategy is
Trust Case Control Consortium dataset. FCMEMDR is used in a Bayesian marker partition model to maximize the
freely available at model’s posterior probability [7]. In SNPRuler, a rule utility
https://gitlab.com/yudalinemail/fcmemdr. measure based on chi-square statistics is used to assess the
quality of epistasis, and the branch and bound algorithm is used
Index Terms—Fuzzy c-means, classification system, multifactor to determine the maximum rule utility as the final result [8].
dimensionality reduction, epistasis
Boolean operation-based screening and testing (BOOST) was
proposed for the analysis of all pairwise interactions in
genome-wide SNP data [9]. AntEpiSeeker involves a two-stage
I. INTRODUCTION
ant colony optimization algorithm for detecting epistasis in a
enome-wide association studies (GWAS) have been case–control design [10]. Multifactor dimensionality reduction
G widely applied to identify associations between various (MDR) was proposed by Ritchie in 2001 [11]. MDR is a
powerful statistical tool for detecting and characterizing
Manuscript received XXX. This work was supported by the Ministry of nonadditive interactions among discrete variables that
Science and Technology, R.O.C. (under Grant no. 106-2221-E-992-327-MY2 influence a binary outcome. In contrast with conventional
and 107-2811-E-992-500-. statistical methods such as logistic regression, MDR can be
Cheng-Hong Yang is with the Department of Electronic Engineering,
National Kaohsiung University of Science and Technology, Kaohsiung, used to process nonparametric (i.e., no estimated parameters)
Taiwan, and he also is with Ph. D. Program in Biomedical Engineering, and genetic model-free data (i.e., no assumed genetic model) in
Kaohsiung Medical University, Kaohsiung, Taiwan. E-mail: case–control studies.
chyang@kuas.edu.tw
Li-Yeh Chuang is with the Department of Chemical Engineering & Institute
The basis of the MDR method is a constructive induction or
of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, feature engineering algorithm that reduces the dimensionality
Taiwan, Corresponding author. E-mail: chuang@isu.edu.tw of multilocus information by pooling multilocus classes into
Yu-Da Lin is with the Department of Electronic Engineering, National
Kaohsiung University of Science and Technology, Kaohsiung, Taiwan,
high-risk (H) and low-risk (L) groups to form a
Corresponding author. E-mail: e0955767257@yahoo.com.tw one-dimensional variable. This process of constructing a new

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 2

one-dimensional variable changes the spatial representation of


the data. The purpose of this process is to determine a means of
data representation that facilitates the detection of nonlinear or
nonadditive interactions among the original variables. In MDR,
k-fold cross-validation (CV) values are used to avoid
overfitting the training data. Recently, MDR has been
successfully applied to detect gene–gene interactions and
epistasis in genetic studies of common human diseases, such as
cardiovascular disease [12], breast cancer [13], and facial
emotion perception abnormality [14]. Numerous extension
methods based on MDR have been proposed [15], including
multiple-criteria decision analysis-based MDR [16], Balanced
MDR [17], and empirical fuzzy MDR (EFMDR) [18]. These
methods have been used to analyze the interactions between
disease-related genes in patients with hemodialysis. Although
research and development related to resources for improving
classification in MDR have increased, work in this field
remains limited.
The fuzzy c-means (FCM) approach was proposed by
Bezdek et al. [19]. FCM is an unsupervised technique that
assigns patterns to any cluster with a certain degree of fuzzy
membership. Currently, FCM is one of the most popular
clustering approaches in ongoing research in various areas.
FCM-based approaches have been adopted for many
applications, including feature analysis, target recognition, and
image segmentation [20]. For the epistasis detection problem,
fuzzy set theory has been applied to enable the use of MDR for
determining the membership of high-risk and low-risk groups,
increasing the detection success rates of gene–gene interactions.
Leem and Park introduced two fuzzy-based approaches
(FGMDR [21] and EFMDR [18]) to improve MDR. FGMDR
uses fuzzy-set-based generalized linear models to detect
epistasis while allowing for covariate adjustment. However,
when using FGMDR, it is not easy to choose appropriate tuning
parameter values. Therefore, EFMDR introduces an empirical
fuzzy approach that does not require the specification of tuning
parameter values. Furthermore, EFMDR was recently
improved by a quicker version [22].
In this study, we propose an improved fuzzy c-means-based
entropy (FCME) MDR to estimate membership degrees from a Fig. 1. Diagram of FCMEMDR-CCR and FCMEMDR-LR. The 5-fold
dataset for epistasis detection. In the present study, we used two CV is used to illustrate the steps of FCMEMDR-CCR and
MDR measures: correct classification rate (FCMEMDR-CCR) FCMEMDR-LR. The numbers in blocks and on the titles indicate the
steps of FCMEMDR-CCR and FCMEMDR-LR. All steps are detailed
and likelihood ratio (FCMEMDR-LR). Test datasets associated in the section on the FCMEMDR process.
with coronary artery disease were obtained from simulation
studies and the Wellcome Trust Case Control Consortium
(WTCCC). According to the results, FCMEMDR-CCR and
disease, and the control group consists of people who did not
FCMEMDR-LR provided higher detection rates than did other
develop that disease. Thus, epistasis containing conjunctive
methods.
rules can be defined as (r, ζ): s1 ∩ s2 ∩ … ∩ sn → ζ. This implies
that the association between a conjunction of n literals (s1, s2, …,
II. METHODS
sn) is given by r and the label ζ (ζ = 0 for controls, and ζ = 1 for
A. Epistasis detection cases). Therefore, m-order epistasis can be defined as the
For epistasis detection [8], let SNP S be a set involving three m-locus combinations. An m-locus combination contains 3m
genotypes; S = {1, 2, 3}. Let s be a pair (i, v), where i is the multifactor classes that are denoted as X = (n1, n2, …, nI), where I
index and v is the value of S. Let the multifactor class be the = 3m, n  S1 ∪ S2 ∪…∪ Sm. In this study, we detected epistasis
link between the genotype and the case and control groups. The by achieving an optimal measurement for identifying m-locus
case group comprises patients who had developed a particular combinations.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 3

B. MDR process Algorithm 1. Pseudo code of MDR.


MDR was introduced to detect epistasis on the basis of the 1 // stratified random k-fold
2 randomly sort dataset
sample distributions of cases and controls under all predictive 3 divide data into k-subsets
rules within m-locus combinations [11]. In MDR, the data 4 for d=1to k //cross-validation
reduction approach is employed to categorize a high dimension 5 classify d-th subset as the testing data and the others as the training data
6 M={M1, M2, …, Mi} //M is a set of available m-locus combinations
into high-risk and low-risk groups (a 2  2 contingency table). 7 for each M
To avoid overfitting the training data, k-fold CV is used in MDR 8 classify samples of d-th training data into multifactor genotype
to generate k-fold CV models, and the k-fold CV testing data is combinations under M
9 for i=1to total number of multifactor class
used to evaluate the corresponding training model. Motsinger et 10 determine the high/low risk groups in i-th multifactor class
al. provided an explanation for the usage of cross-validation to 11 if high risk group
avoid overfitting [23]. MDR then uses CV consistency (CVC) to 12 TP = TP + ni1*
13 FP = FP + ni0*
select the optimal model among the k-fold CV models. The 14 else
MDR pseudocode is presented in Algorithm 1, and the MDR 15 FN = FN + ni1*
steps [11] are as follows: 1) Conduct a k-fold CV operation; 2) 16 TN = TN + ni0*
evaluate all feasible m-locus combinations; 2.1) compute the 17 end if
18 end i
ratio between cases and controls within each multifactor class of 19 estimate M by four frequencies in a 2  2 contingency table
an m-locus combination; 2.2) categorize the multifactor classes 20 end M
into high-risk and low-risk groups; 2.3) evaluate the value of 21 choose the M with the best value as a d-th best model
22 end d
m-locus combinations; and 3) conduct the CVC operation. 23 select best model M * according cross-validation consistency
24 estimate average value of M * using the k testing data
C. EFMDR process
An EFMDR was introduced by Leem and Park in 2017 [18]. In
EFMDR, an empirical fuzzy approach is applied to generate the Algorithm 2. Pseudo code of FCMEMDR.
membership degrees of multifactor classes in steps 2.1 and 2.2 of 1 // stratified random k-fold
the aforementioned MDR process. Thus, each multifactor class of 2 randomly sort dataset
3 divide data into k-subsets
an m-locus combination can belong to H/L groups simultaneously 4 for d=1to k //cross-validation
according to the membership degrees. The other steps are identical 5 classify d-th subset as the testing data and the others as the training data
to those of MDR. 6 M={M1, M2, …, Mi} //M is a set of available m-locus combinations
7 for each M
D. FCMEMDR process 8 classify samples of d-th training data into multifactor genotype
combinations under M
In this study, we proposed FCMEMDR methods based on 9 for i=1to total number of multifactor class
CCR (FCMEMDR-CCR) and LR (FCMEMDR-LR) measures. 10 calculate wiH* and wiL*
An FCME approach was proposed for transforming binary 11 TPf = TPf + (ni1*  wiH*)
classification into fuzzy classification, thereby reducing the 12 FPf = FPf + (ni0*  wiH*)
13 FNf = FNf + (ni1*  wiL*)
multifactor dimensionality to a 2  2 contingency table 14 TNf = TNf + (ni0*  wiL*)
consisting of H(μH)/L(μL) fuzzy groups and cases/controls. The 15 end i
FCMEMDR pseudocode is presented in Algorithm 2, and the 16 estimate M by four frequencies in a 2  2 contingency table
17 end M
diagram is displayed in Fig. 1. FCMEMDR training model 18 choose the M with the best value as a d-th best model
involves five steps: 19 end d
 Step 1. Stratified random k-fold. 20 select best model M * according cross-validation consistency
21 estimate average value of M * using the k testing data
1-1: The dataset are randomly sorted by shuffling all case and
control samples.
1-2: The dataset is divided into k subsets. The proportion of
cases and controls is computed, and the k empty CV
subsets are generated. The case and control sets are  Step 3. Evaluation of all possible combinations.
assigned to a CV subset according to the proportion of 3-1: All m-locus combinations are evaluated, and 3m
cases and controls. multifactor classes are constructed according to the
results for all combinations of multifactor classes in an
 Step 2. Generation of training and testing data. m-locus combination, where m ≥ 2. All samples of the
2-1: A CV subset is selected as the testing data. The other CV training data are assigned into multifactor classes. Cases
subsets comprise the training data. The CV subset (black bar; Fig. 1, symbol 3-1) and controls (white bar)
comprising testing data is used for the testing module, are then counted.
and the remaining CV subsets are used for the training 3-2: Membership degree of multifactor classes is measured
module. using the FCME approach. A maximum likelihood
2-2: Available combinations are generated. A set of all estimator for the probability of the cases and controls
m-locus combinations is generated based on all factors. within the ith multifactor class is formulated as (1) under
the assumption of binomial distribution.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 4

𝑛𝑖1 0, if 𝑛𝑖1 ∗ = 0
𝑛𝑖1 ∗ =
𝑛++ 2 −1
{ 𝑛𝑖0 (1) 𝑤𝑖𝐻 ∗ = ((1 + 𝑙𝑜𝑔𝑛𝑖0∗ 𝑛𝑖1 ∗ )𝑚−1 ) , other
𝑛𝑖0 ∗ = (6)
𝑛++ ∗
{ 1, if 𝑛𝑖0 = 0
where n++ is the total number of cases and controls among all
0, if 𝑛𝑖0 ∗ = 0
multifactor classes, and ni1 and ni0 are the cases and controls that −1
2
satisfy the ith multifactor class, respectively. The FCM 𝑤𝑖𝐿 ∗ = ((1 + 𝑙𝑜𝑔𝑛𝑖1∗ 𝑛𝑖0 ∗ )𝑚−1 ) , other (7)
approach allows one datum to belong to two groups. The
membership degree function of the FCM approach is employed { 1, if 𝑛𝑖1 ∗ = 0
to evaluate the H (wH) group and the L (wL) group. The degree
of membership of ni* in group cj can be formulated as follows
[19]: 3-3: The four cells’ true positivity (TP), false positivity (FP),
false negative (FN), and true negative (TN) are computed
1 in a 2  2 contingency table. Membership degrees for the
𝑤𝑖𝑗 = 2
(2) four fuzzy cells (TPf, FPf, FNf, and TNf) are determined
𝑑(𝑛𝑖 ∗ , 𝑐𝑗 ) 𝑚−1
∑𝑘={ℎ,𝑙} ( ) based on the FCME approach (H (wH) and L (wL)) and
𝑑(𝑛𝑖 ∗ , 𝑐𝑘 ) frequencies of ith multifactor class (ni1* and ni0*). In all
multifactor classes, TPf sums the H (wH) of frequencies
where j = {h, l}. We define the H group as the matrix ch = {1, 0} n1* and FNf sums the L (wL) of frequencies n1*. In a
and the L group as the matrix cl = {0, 1}. A cross-entropy similar manner, the H (wH) of frequencies n0* and the L
method is applied to evaluate distance between the ith (wL) of frequencies n0* are added to FPf and TNf,
multifactor class and the outcomes (cases and controls). respectively. Membership degree is thus used in
FCMEMDR to reduce the dimensionality of 3m
𝑑(𝑛∗ , 𝑐 ) = − ∑ 𝑐 × log(𝑛∗ ) (3) multifactor classes into 2  2 dimensionality. The four
cells are formulated as follows:
In FCMEMDR, each sample can have simultaneous, partial 𝑇𝑃𝑓 = ∑ 𝑛𝑖1 ∗ × 𝑤𝑖𝐻 ∗
membership in the H (wH) and L (wL) groups. Membership 𝑖
degree in the H (wH) group and the L (wL) groups can be 𝐹𝑃𝑓 = ∑ 𝑛𝑖0 ∗ × 𝑤𝑖𝐻 ∗
obtained using Equations (1–3) for the ith multifactor class. The 𝑖 (8)
membership degrees for the H (wH) group and the L (wL) group
𝐹𝑁𝑓 = ∑ 𝑛𝑖1 ∗ × 𝑤𝑖𝐿 ∗
can be formulated as (4) and (5). The fuzzy rule for the H (wH)
𝑖
group and the L (wL) group can then be formulated as (6) and (7).
𝑇𝑁𝑓 = ∑ 𝑛𝑖0 ∗ × 𝑤𝑖𝐿 ∗
𝑖
3-4: Estimation of models. The m-locus combination is
measured based on the 2  2 contingency table.

1
𝑤𝑖𝐻 = 2 2
−(1 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 0 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ )) 𝑚−1 −(1 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 0 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ )) 𝑚−1
( ) +( )
−(1 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 0 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ )) −(0 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 1 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ ))
2 −1 (4)
1 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) 𝑚−1
= 2 =((1 + ) )
𝑙𝑜𝑔(𝑛𝑖0 ∗ )
−𝑙𝑜𝑔(𝑛𝑖1 ∗ ) 𝑚−1
1+( ∗ )
−𝑙𝑜𝑔(𝑛𝑖0 )
2 −1
= ((1 + 𝑙𝑜𝑔𝑛𝑖0∗ 𝑛𝑖1 ∗ )𝑚−1 )
1
𝑤𝑖𝐿 = 2 2
−(0 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 1 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ )) 𝑚−1 −(0 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 1 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ )) 𝑚−1
( ) +( ) (5)
−(1 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 0 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ )) −(0 × 𝑙𝑜𝑔(𝑛𝑖1 ∗ ) + 1 × 𝑙𝑜𝑔(𝑛𝑖0 ∗ ))
2 −1
∗ 𝑚−1
= ((1 + 𝑙𝑜𝑔𝑛𝑖1∗ 𝑛𝑖0 ) )

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 5

Fig. 2. Comparison of six epistatic detection methods used on 60 epistatic models without marginal effects. The figure compares implementations
of MDR-CCR (C), MDR-LR (L), EFMDR-CCR (EC), EFMDR-LR (EL), FCMEMDR-CCR (FC), and FCMEMDR-LR (FL) in 60 epistasis
models without marginal effects. Each block represents models comprising an MAF and an h2, in which the upper and lower regions are detection
success rate and number of CVC = 5, respectively. Darker red (H) denotes superior implementation, and darker blue (L) denotes inferior
implementation in the given region. White indicates moderate implementation in the given region. The numbers refer to the epistasis models. For
each model, the detection success rate is calculated based on the proportion in 100 datasets in which specific instances of epistasis are identified.
Each dataset includes 1000 SNPs and 400 samples (200 cases and 200 controls).

1) CCRfuzzy for FCMEMDR-CCR: The CCRfuzzy is used to  Step 4. Stratified random k-fold.
evaluate the proportion of correctly classified individuals 4-1: Models are sorted from low to high. The sets of all
based on an m-locus combination. The CCRfuzzy processes m-locus combinations are sorted according to their
unbalanced datasets to determine accurate balance using the values of measures (i.e., CCRfuzzy and LRfuzzy).
TPf and TNf proportions [24]. The CCRfuzzy is formulated as (9). 4-2: The optimal model is recorded into CVC. The m-locus
𝑇𝑃𝑓 𝑇𝑁𝑓 combination with maximum measures is identified as the
𝐶𝐶𝑅𝑓𝑢𝑧𝑧𝑦 = 0.5 ( + ) (9)
𝑇𝑃𝑓 + 𝐹𝑁𝑓 𝐹𝑃𝑓 + 𝑇𝑁𝑓 optimal model in the CV dataset. This m-locus
2) LRfuzzy for FCMEMDR-LR: LRfuzzy compares the maximum combination is then recorded into CVC.
likelihood of an unrestricted model with that of a restricted  Step 5. Saving of the optimal results and statistics from the
model. In the 2  2 contingency table, the unrestricted best model.
model consists of the observed frequencies in the data, and 5-1: Steps are repeated for each possible CV interval. Steps 2
the restricted model consists of the expected frequencies to 4 are repeated until completion of all CV.
under the null hypothesis of no association [25]. The LRfuzzy 5-2: The best model is selected according to CVC. The
is formulated as (10). m-locus combination with the highest CVC is regarded
3-5: Repeat for each combination. Steps 3-1 to 3-4 are as the best model in this experiment.
repeated until evaluation of all m-locus combinations has
been completed.

Observed
𝐿𝑅𝑓𝑢𝑧𝑧𝑦 = 2 ∑ Observed log [ ]
Expected
𝑇𝑃𝑓 𝐹𝑃𝑓 𝐹𝑁𝑓 𝑇𝑁𝑓
= 2 [𝑇𝑃𝑓 × 𝑙𝑜𝑔 ( ∗ ) + 𝐹𝑃𝑓 × 𝑙𝑜𝑔 ( ∗ ) + 𝐹𝑁𝑓 × 𝑙𝑜𝑔 ( ∗ ) + 𝑇𝑁𝑓 × 𝑙𝑜𝑔 ( ∗ )]
𝐴 𝐵 𝐶 𝐷
(𝑇𝑃 +𝐹𝑁 )(𝑇𝑃𝑓 +𝐹𝑃𝑓 )
𝐴∗ = 𝑓 𝑓 (10)
𝑇𝑃𝑓 +𝐹𝑃𝑓 +𝐹𝑁𝑓 +𝑇𝑁𝑓

∗ (𝐹𝑃𝑓 +𝑇𝑁𝑓 )(𝑇𝑃𝑓 +𝐹𝑃𝑓 )


𝐵 =
𝑇𝑃𝑓 +𝐹𝑃𝑓 +𝐹𝑁𝑓 +𝑇𝑁𝑓
𝑠. 𝑡. (𝑇𝑃𝑓 +𝐹𝑁𝑓 )(𝐹𝑁𝑓 +𝑇𝑁𝑓 )
𝐶∗ =
𝑇𝑃𝑓 +𝐹𝑃𝑓 +𝐹𝑁𝑓 +𝑇𝑁𝑓
(𝐹𝑃𝑓 +𝑇𝑁𝑓 )(𝐹𝑁𝑓 +𝑇𝑁𝑓 )
𝐷∗ =
{ 𝑇𝑃𝑓 +𝐹𝑃𝑓 +𝐹𝑁𝑓 +𝑇𝑁𝑓

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 6

III. RESULTS
A. Comparison of MDR-CCR, MDR-LR, EFMDR-CCR,
EFMDR-LR, FCMEMDR-CCR, and FCMEMDR-LR across
epistatic models without marginal effects and with marginal
effects.
 Epistatic model without marginal effects
A total of 60 epistatic models without marginal effects was
used to simulate the datasets and multilocus penetrances of 60
epistatic models obtained from [8]. GAMETES was used to
generate datasets under the settings of heritability (h2) and
minor allele frequency (MAF) values. The h2 value controls
phenotypic variation in disease models. In the 60 epistatic
models in this study, h2 ranged from 0.025 to 0.2, and the MAFs
were 0.2 and 0.4. Each dataset included a specific target Fig. 3. Comparison of six epistasis detection methods used on eight
epistatic models with marginal effects. Each model consists of an MAF
(epistatic interaction) with random architectures [26] and other and an h2. The upper and lower blocks are, respectively, detection
SNPs. The datasets were generated with MAFs selected success rate and the number of CVC = 5, and the left and right blocks,
uniformly within [0.05, 0.5]. The detection success rate was respectively, represent 400 samples (200 cases and 200 controls) and
calculated by counting the number of specific targets identified 2000 samples (1000 cases and 1000 controls). The darker red (H)
in 100 datasets by using the algorithm. denotes superior implementation and the darker blue (L) denotes poor
The detection success rates of MDR-CCR (C), MDR-LR (L), implementation of the model. White indicates moderate
implementation for a region. The numbers refer to the epistatic models.
EFMDR-CCR (EC), EFMDR-LR (EL), FCMEMDR-CCR
(FC), and FCMEMDR-LR (FL) for the 60 epistatic models are
presented in Fig. 2. For epistatic models 1 to 30 (h2  0.2), all 100 datasets, the detection success rates were determined by
methods exhibited relatively strong abilities to accurately counting the targets (i.e., epistatic interactions) that had been
detect the specific targets in each dataset. For epistatic models successfully detected using each algorithm.
31 to 60 (h2  0.1), the results revealed that the detection The detection success rates of MDR-CCR (C), MDR-LR (L),
abilities of FCMEMDR-CCR and FCMEMDR-LR were the EFMDR-CCR (EC), EFMDR-LR (EL), FCMEMDR-CCR (FC),
same, and the detection success rates of these two methods and FCMEMDR-LR (FL) for the eight models are presented in
were higher than those of MDR-CCR, MDR-LR, Fig. 3. Overall, the detection success rates of MDR-CCR and
EFMDR-CCR, and EFMDR-LR. A Wilcoxon signed-rank test MDR-LR were enhanced when the FCME approach was used.
was used to evaluate the performances of FCMEMDR-CCR FCMEMDR-CCR and FCMEMDR-LR outperformed the other
and FCMEMDR-LR as well as those of other algorithms in algorithms in the eight models with marginal effects. For detection
application to the 60 disease models. As denoted in Table I, the success rates at CVC = 5, the FCME approach outperformed the
p values of FCMEMDR-CCR and FCMEMDR-LR were <.05, original method, indicating that the FCME approach can be used
indicating the superior performance of these algorithms over to improve the robustness of models with marginal effects. The
other algorithms. In terms of detection success rates at CVC = 5 Wilcoxon signed-rank test was used to compare the performances
(Fig. 2), the blue blocks indicate the performance of CVC  3. of FCMEMDR and other algorithms in the eight disease models.
In epistatic models 41 to 45 and 51 to 55, CVC = 1 for all FCMEMDR-CCR and FCMEMDR-LR exhibited superior
methods. The FCME approach was more stable than the other detection success rates compared with those of the other
methods in epistatic models 31 to 60, indicating that this algorithms (+, Table I) in application to the eight epistasis models
approach enhanced the ability of MDR to detect disease loci with marginal effects (400 samples and 2000 samples). p < .05
without marginal effects. In terms of computational time, indicated significant superiority of FCMEMDR-CCR and
FCMEMDR-CCR and FCMEMDR-LR cost an average of 19.4 s FCMEMDR-LR over other methods. Our results suggest that the
to run a complete process in the 60 epistatic models, where each FCME approach enhanced MDR due to consideration of the
dataset included 1000 SNPs with 400 samples. uncertainty of H/L classification in disease loci with marginal
effects. Regarding computational time, FCMEMDR-CCR and
 Epistatic model with marginal effects FCMEMDR-LR required an average of 36.2 s to run a complete
For the simulation data with marginal effects test, multilocus process for each dataset in the eight epistatic models, where each
penetrances of the eight epistatic models were obtained from dataset included 1000 SNPs with 400 samples. For trials with
Namkung et al. (Models 1–6) [27] and Bush et al. (Models 7 2000 samples, FCMEMDR-CCR and FCMEMDR-LR required
and 8) [25]. In each epistatic model, we used GAMETES an average of 162 s to run a complete process for each dataset.
software [26] to simulate 100 datasets under setting of epistatic
model, with the MAF sets uniformly at [0.05, 0.5]. Thus, in these

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 7

TABLE I. COMPARISON OF FCMEMDR WITH MDR-BASED, EFMDR-BASED, ANTEPISEEKER, BEAM, BOOST, SNPRULER,
DECMDR, PBMDR, AND IMDR USING WILCOXON SIGNED-RANK TEST
A. FCMEMDR-CCR compares with nine algorithms B. FCMEMDR-LR compares with nine algorithms
i. 60 epistasis models without marginal effects
MDR-CCR EFMDR-CCR AntEpiSeeker MDR-CCR EFMDR-CCR AntEpiSeeker

p = 1.094E-5* p = 1.730E-5* p = 1.094E-5* p = 1.218E-5* p = 1.212E-5* p = 1.424E-11*


BEAM BOOST SNPRuler BEAM BOOST SNPRuler

p = 4.065E-5* p = 0.255 p = 2.402E-4* p = 2.296E-11* p = 0.255 p = 3.894E-9*


DECMDR PBMDR IMDR DECMDR PBMDR IMDR

p = 0.002* p =1.030E-11* p = 2.978E-5* p = 0.002* p =1.030E-11* p = 2.978E-5*


ii. 8 epistasis models with marginal effects (400 samples)
MDR-CCR EFMDR-CCR AntEpiSeeker MDR-CCR EFMDR-CCR AntEpiSeeker

p = 0.018* p = 0.018* p = 0.028* p = 0.018* p = 0.018* p = 0.028*


BEAM BOOST SNPRuler BEAM BOOST SNPRuler

p = 0.018* p = 0.180 p = 0.028* p = 0.018* p = 0.180 p = 0.028*


DECMDR PBMDR IMDR DECMDR PBMDR IMDR

p = 0.237 p = 0.018* p = 0.018* p = 0.237 p = 0.018* p = 0.018*


iii. 8 epistasis models with marginal effects (2,000 samples)
MDR-CCR EFMDR-CCR AntEpiSeeker MDR-CCR EFMDR-CCR AntEpiSeeker

p = 0.012* p = 0.180 p = 0.012* p = 0.018* p = 0.180 p = 0.008*


BEAM BOOST SNPRuler BEAM BOOST SNPRuler

p = 0.018* p = 0.180 p = 0.012* p = 0.012* p = 0.180 p = 0.012*


DECMDR PBMDR IMDR DECMDR PBMDR IMDR

p = 0.180 p = 0.018* p = 0.180 p = 0.180 p = 0.018* p = 0.180


−: Degree to which FCMEMDR-CCR/FCMEMDR-LR is inferior to the algorithm, +: degree to which FCMEMDR-CCR/FCMEMDR-LR is
superior to the algorithm, =: degree to which FCMEMDR-CCR/FCMEMDR-LR is equal to the algorithm, p: p-value. * indicates p < .05.

SNPRuler [8], DECMDR [28], PBMDR [29], and IMDR [30]


B. Comparison of AntEpiSeeker, BEAM, BOOST, SNPRuler,
with FCMEMDR-CCR and FCMEMDR-LR across epistatic
DECMDR, PBMDR, IMDR, and FCMEMDR across epistatic
models with and without marginal effects (Table I). The
models without marginal effects and with marginal effects.
aforementioned datasets were also used in the epistatic models
We compared AntEpiSeeker [10], BOOST [9], BEAM [7],

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 8

with marginal effects and without marginal effects. The black FCMEMDR-CCR and FCMEMDR-LR outperformed
bars represent that the degree to which FCMEMDR-CCR and AntEpiSeeker, BEAM, SNPRuler, PBMDR, and IMDR in the
FCMEMDR-LR were superior to the algorithm under comparison eight epistatic models with marginal effects. For large samples
(+), the degree to which FCMEMDR-CCR and FCMEMDR-LR (2000 samples), FCMEMDR-CCR and FCMEMDR-LR
were inferior to the algorithm under comparison (−), and the degree exhibited superiority to AntEpiSeeker, BEAM, SNPRuler, and
to which FCMEMDR-CCR and FCMEMDR-LR were equal to the PBMDR in all epistatic models with marginal effects. Although
compared algorithm. The absence of bars indicates a degree of zero. the degree of superiority of FCMEMDR-CCR and
FCMEMDR-LR was equal to that of BOOST for most models,
 Epistatic model without marginal effects FCMEMDR-CCR and FCMEMDR-LR were superior to
Sixty epistatic models without marginal effects were used to BOOST for models 1 and 4. FCMEMDR-CCR and
evaluate the performance of AntEpiSeeker, BEAM, SNPRuler, FCMEMDR-LR were superior to DECMDR and IMDR for
BOOST, DECMDR, PBMDR, IMDR, FCMEMDR-CCR, and models 7 and 8. The Wilcoxon signed-rank test indicated the
FCMEMDR-LR. Both FCMEMDR-CCR and FCMEMDR-LR significant superiority of FCMEMDR-CCR and
exhibited superior performance to that of AntEpiSeeker, FCMEMDR-LR to AntEpiSeeker, BEAM, SNPRuler, and
BEAM, SNPRuler, DECMDR, PBMDR, and IMDR. However, PBMDR, but the trend of superiority was only evident in
the performance of FCMEMDR-CCR and FCMEMDR-LR comparisons of FCMEMDR-CCR and FCMEMDR-LR with
was inferior to that of AntEpiSeeker (model 52), SNPRuler BOOST, DECMDR, and IMDR. Our results suggest that the
(model 55), BOOST (models 45, 53–55, 60), DECMDR FCME approach may enhance MDR because of its additional
(model 45), and IMDR (model 36). The Wilcoxon signed-rank consideration for the uncertainty of H/L classification when
test indicated the significant superiority of FCMEMDR-CCR detecting disease loci with marginal effects.
and FCMEMDR-LR to AntEpiSeeker, BEAM, SNPRuler,
DECMDR, PBMDR, and IMDR; however, FCMEMDR-CCR,
C. Real data experiment
and FCMEMDR-LR were superior to BOOST only for the
epistatic models without marginal effects or statistical Real data experiments were developed to validate the ability
significance. of FCMEMDR to correctly determine disease-associated
interactions. To accomplish this, we implemented two cases in
our real data experiments. A real coronary artery disease (CAD)
 Epistatic model with marginal effects
dataset was selected from the WTCCC, and a real chronic
Eight epistatic models with marginal effects were used to
dialysis dataset was obtained from the Kaohsiung Chang Gung
evaluate the performance of AntEpiSeeker, BEAM, SNPRuler,
Memorial Hospital.
BOOST, DECMDR, PBMDR, IMDR, FCMEMDR-CCR, and
FCMEMDR-LR. For small samples (400 samples),

Fig. 4. Genotype counts for SNP pairs in CAD and membership degrees of the H and L groups in chromosome 3 in the WTCCC data. (A)
Genotype counts for SNP pairs in CAD and (B) the membership degrees of the H (μH) and L (μL) groups relative to the most common double
homozygous genotype in chromosome 3 in the WTCCC data. In each cell of (A), the values attached to the left-side bars are the numbers of
cases, and the values attached to the right-side bars are the numbers of controls. In each cell of (B), the values attached to the left-side bars are the
membership degrees of the H group, and the values attached to the right-side bars are membership degrees of the L group. Background colors
represent the degrees of the membership function. Darker red indicates H groups, and darker green indicates L groups. White indicates similar
membership degrees between the H and L groups.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 9

Fig. 5. Genotype counts for SNP pairs in patients undergoing chronic dialysis, and membership degrees of the H and L groups in mitochondrial
DNA. (A) Genotype counts for SNP pairs in patients undergoing chronic dialysis and (B) the membership degrees of the H (μH) and L (μL) groups
relative to the most common homozygous genotype in mitochondrial DNA.

 WTCCC dataset optimal detection of epistasis between FCMEMDR-CCR and


In this study, we used a large dataset to test the performances FCMEMDR-LR when the SNPs were located in explicit genes,
of FCMEMDR-CCR and FCMEMDR-LR. A CAD dataset was epistasis rs41367044 (located in GTF2E1 gene) and
selected from the WTCCC to evaluate the performances of rs10866051 (located in LOC105376942 gene) in chromosome 3
FCMEMDR-CCR and FCMEMDR-LR. The WTCCC was exhibited the highest CCRfuzzy (0.837) and LRfuzzy (387.414)
assembled through a collaborative effort, beginning in 2005, values. For interpretation, a graphical representation of the
among 50 U.K. research groups [31]. The dataset includes 23 data interaction associated with the two-locus SNP combination is
subsets (chromosomes 1 to 22 and chromosome X) with a total of presented in Fig. 4. The distinctions between cases and controls
500 569 SNPs. Each SNP was obtained from 1988 CAD patients for epistases rs41367044 and rs10866051 were identified in the
and 1500 normal people in the United Kingdom who identified genotype pairs of each cell (Fig. 4A). Six green-colored cells
themselves as white Europeans. All SNPs were genotyped using the (CC, CC), (CT, CC), (TT, CC), (CC, CT), (CT, CT), and (TT,
Affymetrix GeneChip 500K Mapping Array Set. CT) were considered to belong to the low-risk group (Fig. 4B).
The epistases detected by the FCMEMDR-CCR and The interaction model corresponded to M3, a jointly
FCMEMDR-LR are denoted in Table II. The values of CCRfuzzy recessive-dominant model, in two-locus disease models. Two
were between 0 and 1, and the highest value indicated optimal copies of disease alleles from the first locus and at least one
estimation. The LRfuzzy value indicated a strong contrast disease allele from the second locus were considered affected
between cases and controls. CCRfuzzy values were 0.585–0.957 [33]. Two dark red cells (CT, TT) and (TT, TT) were considered
with an average value of 0.741 (standard deviation [SD] = to belong to the H group with strong certainty. Only one cell (CC,
0.095). LR values were 23.811–712.602 with an average value TT) exhibited a low probability of belonging to the L group.
of 215.149 (SD = 166.606). The three epistases exhibited high GTF2E1 is the basic transcription factor for the GTF gene family,
CCRfuzzy (>0.8), LRfuzzy (>300), and CVC (=5) values and high which is strongly associated with human immunodeficiency
significance (p < .0001), indicating that these epistases may virus type 1 (HIV-1) replication [34]. The other strong
have interacted in CAD. Chi-squared and p values were interactions could not exhibit a direct effect between genes and
evaluated using the raw datasets to determine the significance CAD because of the inclusion of UNKNOWN. Overall, the
levels of the epistases. All epistases detected using results revealed that FCMEMDR-CCR and FCMEMDR-LR can
FCMEMDR-CCR and FCMEMDR-LR in the 23 chromosomes be used to efficiently identify epistasis in a large dataset. The
had p values of <.05. Accuracy higher than 0.5 indicates that FCMEMDR process required an average of approximately 9.37 h
the frequency of chance can be significantly reduced [32]. for detecting the chromosome.
Therefore, FCMEMDR-CCR and FCMEMDR-LR detected
significant epistasis. A pair of SNPs, rs16926425 and  Chronic dialysis dataset
rs7299571, in chromosome 12 exhibited the highest CCRfuzzy The data of patients undergoing chronic dialysis was
(0.957) and LRfuzzy (712.602) values (CVC = 5, p < .0001). The collected after obtaining approval from the Committee on
epistases detected using FCMEMDR-CCR and FCMEMDR-LR Human Research at Kaohsiung Chang Gung Memorial
were not equal in chromosomes 10 and 17, as indicated by the Hospital, and the study was conducted in accordance with the
results of CVC = 5 and p < .0001. For chromosome 1, Helsinki Declaration in Taiwan [35]. The data comprises a total
FCMEMDR-CCR achieved superior detection of epistasis than of 77 SNPs in the mitochondrial displacement loop (D-loop) gene,
did FCMEMDR-LR, as indicated by the CVC. Regarding the and the data on these SNPs were obtained from 193 Taiwanese

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 10

TABLE II. SUMMARY OF FCMEMDR-CCR AND FCMEMDR-LR RESULTS FOR CORONARY ARTERY DISEASE
FROM WTCCC DATA
Location SNP Groups (Gene) CCR CVC SNP Groups (Gene) LR CVC
Chr. 1 rs41399650 (UNKNOWN), 0.784 5 rs2999538 (LOC105371442), 265.099 4
rs17163057 (UNKNOWN) rs41399650 (UNKNOWN)
Chr. 2 rs41509345 (NCKAP5), 0.783 5 242.958 5
rs41453947 (UNKNOWN)
Chr. 3 rs41367044 (GTF2E1), 0.837 5 387.414 3
rs10866051 (LOC105376942)
Chr. 4 rs41426946 (PPA2), 0.797 5 270.066 5
rs41529544 (UNKNOWN)
Chr. 5 rs41493746 (UNKNOWN), 0.674 5 91.429 5
rs41421845 (LINC02107)
Chr. 6 rs3006172 (WDR27), 0.779 5 240.729 5
rs41489047 (ADGRB3)
Chr. 7 rs41437948 (POU6F2), 0.639 5 55.271 5
rs41468749 (GALNT17)
Chr. 8 rs35120859 (UNKNOWN), 0.722 4 139.631 4
rs17480050 (CSGALNACT1)
Chr. 9 rs41354745 (KANK1), 0.736 5 161.290 5
rs2891142 (SLC24A2)
Chr. 10 rs41437948 (FAM107B), 0.779 5 rs41437948 (POU6F2), 267.022 5
rs41468749 (TCERG1L) rs41468749 (GALNT17)
Chr. 11 rs41535846 (UNKOWN), 0.627 2 47.620 2
rs41518446 (MAML2)
Chr. 12 rs16926425 (SOX5), 0.957 5 712.602 5
rs7299571 (UNKNOWN)
Chr. 13 rs7328649 (FAM155A), 0.805 5 275.813 5
rs9540728 (PCDH9)
Chr. 14 rs41324950 (LOC105370603), 0.783 5 234.386 5
rs41453247 (UNKNOWN)
Chr. 15 rs41418744 (UNKNOWN), 0.664 3 80.943 3
rs41418548 (SHC4)
Chr. 16 rs235633 (UNKNOWN), 0.752 5 186.539 5
rs41483646 (UNKNOWN)
Chr. 17 rs4969207 (DNAH17), 0.868 2 rs3785579 (CACNG1), 494.358 2
rs3785579 (CACNG1) rs1870998 (UNKNOWN)
Chr. 18 rs4799934 (CELF4), 0.732 5 191.401 5
rs3794931 (ZNF516)
Chr. 19 rs375299 (UNKNOWN), 0.601 5 31.169 5
rs41370444 (UNKNOWN)
Chr. 20 rs2748666 (UNKNOWN), 0.872 5 440.051 5
rs41405046 (UNKNOWN)
Chr. 21 rs41378546 (CLDN14), 0.585 5 23.811 5
rs41451052 (UNKNOWN)
Chr. 22 rs41437948 (POU6F2), 0.634 4 52.509 4
rs41468749 (GALNT17)
Chr. X rs1419930 (UNKNOWN), 0.637 5 56.305 5
rs41500547 (DMD)
Chr: Chromosome; CCR and LR values are based on membership degree of the FCME approach.

patients undergoing chronic dialysis and 704 normal people low-risk group (Fig. 5B), whereas the red cell (CC, AA) was
amongst Taiwanese [35]. When FCMEMDR-CCR and considered to belong to the high-risk group. The results revealed
FCMEMDR-LR were applied for the optimal detection of that FCMEMDR-CCR and FCMEMDR-LR successfully
epistasis between the SNPs located in explicit genes, the highest identifies the epistasis. In terms of computational time, the
CCRfuzzy (0.617), LRfuzzy (2.209), and CVC (4) values were found FCMEMDR process required approximately 0.21 s.
for epistasis between C150T and G185A (both located in
D-loop gene) in mitochondrial DNA, with a high level of IV. DISCUSSION
significance (p < 0.0001). A graphical representation of the MDR is a robust, nonparametric method that detects nonlinear
interaction in the two-locus SNP combination is displayed in Fig. interactions among multiple, discrete genetic factors. MDR can be
5, which has the same figure legend as Fig. 4. The distinctions used to facilitate conversion of high-dimensional space into
between cases and controls for the epistasis between C150T and low-dimensional space, in turn enabling conversion of 3m
G185A were identified in the genotype pairs of each cell (Fig. prediction rules into 2  2 contingency tables for evaluating
5A). The green cell (TT, GG) was considered to belong to the

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 11

Fig. 6. Comparison of EFMDR and FCMEMDR.

epistasis [11]. Regarding the limitations of MDR, the uncertainty classification enabled MDR to assign different membership
of binary classification may result in the loss of critical functions to two multifactor genotypes, thereby enabling the
information [36]. Binary classification uses the probabilities of binary membership on the set {0, 1} to be extended to the
case and control and multiple genotypes to identify H and L membership function on the interval [0, 1]. FCMEMDR
groups. In this study, an equilibrium dataset was assumed (i.e., the addresses information uncertainty by using the FCME
classification threshold was 1) and a two-SNP combination approach. In the FCME approach, FCM produces patterns that
comprising nine multifactor genotypes was used. A multifactorial can belong to any of the cluster classes with a certain degree of
genotype with odds of 5.5 and a multifactorial genotype with odds fuzzy membership, thus increasing the degree of difference
of 2.5 were categorized into the H group. Therefore, significant between similar distributions of m-locus combinations and
differences between the two multifactor genotypes could not be ultimately resulting in the more accurate detection of
distinguished using MDR. Many studies have focused on this significant epistasis. For interval [0, 1], a cross-entropy method
limitation [18, 36]. Research and development of resources for was used to evaluate the distance between the ith multifactor
improving MDR-based classification have increased; however, class and the outcomes (cases and controls). We compared the
work in this field remains limited. FCME and empirical fuzzy (EF) [18] for evaluating an m-locus
In this study, our results agree with those of Leem and Park combination, thereby detailing the improvement of
[18], in which the epistasis detection ability of the fuzzy-based FCMEMDR. For illustrative purposes, two SNP combinations
classification approach was superior to that of binary-based were considered: SNP A and B (Fig. 6A). Differences between
classification in MDR. The application of fuzzy-based the measurements of membership degree for multifactor classes

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 12

are illustrated in Fig. 6B. The red line indicates the distance [3] E. E. Eichler, J. Flint, G. Gibson, A. Kong, S. M. Leal, J. H. Moore,
et al., "Missing heritability and strategies for finding the underlying
between absolute cases and the ith cell, and the green dotted line causes of complex disease," Nature Reviews Genetics, vol. 11, pp.
indicates the distance between absolute controls and the ith cell. 446-450, 2010.
A short distance suggests a high probability that the ith cell [4] H. J. Cordell, "Detecting gene–gene interactions that underlie
belongs to the associated class. In this study, results for the cells human diseases," Nature Reviews Genetics, vol. 10, Art. no. 392,
2009.
(AA, BB), (aa, BB), and (Aa, Bb) revealed that the FCME was [5] T. F. Mackay and J. H. Moore, "Why epistasis is important for
superior in distinguishing cells associated with cases and tackling complex human disease genetics," Genome Medicine, vol.
controls. Thus, the membership degrees of the H and L groups 6, Art. no. 42, 2014.
[6] T. F. C. Mackay, "Epistasis and quantitative traits: using model
could be influenced to improve the degree of distinction (Fig. organisms to study gene-gene interactions," Nature Reviews
6C and 6D). A more accurate estimation of the difference Genetics, vol. 15, pp. 22-33, 2014.
between four cells in the 2  2 contingency table was achieved [7] Y. Zhang and J. S. Liu, "Bayesian inference of epistatic interactions
in case-control studies," Nature Genetics, vol. 39, pp. 1167-1173,
using FCME. Simulation experiments demonstrated that the 2007.
detection success rate of FCMEMDR was higher than those of [8] X. Wan, C. Yang, Q. Yang, H. Xue, N. L. S. Tang, and W. C. Yu,
other fuzzy-based MDR methods. "Predictive rule inference for epistatic interaction detection in
genome-wide association studies," Bioinformatics, vol. 26, pp.
In this study, MDR revealed that experience ambiguity can 30-37, 2010.
improve detection capabilities. Our results confirmed that [9] X. Wan, C. Yang, Q. Yang, H. Xue, X. Fan, N. L. Tang, et al.,
FCMEMDR can be used to perform and detect epistasis in "BOOST: A fast approach to detecting gene-gene interactions in
genome-wide case-control studies," The American Journal of
simulated and real datasets. FCMEMDR retained the advantages Human Genetics, vol. 87, pp. 325-340, 2010.
of the MDR method. First, FCMEMDR used FCM measurements [10] Y. Wang, X. Liu, K. Robbins, and R. Rekaya, "AntEpiSeeker:
to determine potential instances of epistasis that could be used to detecting epistatic interactions for case-control studies using a
two-stage ant colony optimization algorithm," BMC Research Notes,
enhance distinctions between pairs of multifactor genotypes. vol. 3, Art. no. 117, 2010.
Second, FCMEMDR was used to graphically represent [11] M. D. Ritchie, L. W. Hahn, N. Roodi, L. R. Bailey, W. D. Dupont, F.
membership of the epistasis associated with the disease group. F. Parl, et al., "Multifactor-dimensionality reduction reveals
high-order interactions among estrogen-metabolism genes in
Reduction of the dimensionality of the multisite data enabled clear sporadic breast cancer," American Journal of Human Genetics, vol.
determination of whether multiple loci associated with a disease 69, pp. 138-147, 2001.
are more common in affected or unaffected individuals. Third, [12] J. Gui, J. H. Moore, S. M. Williams, P. Andrews, H. L. Hillege, P.
FCMEMDR did not select the optimal adjustment parameter value van der Harst, et al., "A simple and computationally efficient
approach to multifactor dimensionality reduction analysis of
for the fuzzy set theory in practical data applications. In various gene-gene interactions for quantitative traits," PLoS One, vol. 8, Art.
simulation models, FCMEMDR exhibited higher power than no. e66545, 2013.
fuzzy MDR-based methods. [13] O. Y. Fu, H. W. Chang, Y. D. Lin, L. Y. Chuang, M. F. Hou, and C.
n
H. Yang, "Breast cancer-associated high-order SNP-SNP
FCMEMDR requires a total computation time of k × × s×   interaction of CXCL12/CXCR4-related genes by an improved
m
m multifactor dimensionality reduction (MDR-ER)," Oncology
3 to determine the optimal m-locus combination between the Reports, vol. 36, pp. 1739-1747, 2016.
number of k-subsets in the number of n SNPs and the total [14] L. Y. Chuang, H. Y. Lane, Y. D. Lin, M. T. Lin, C. H. Yang, and H.
number of s samples. FCMEMDR exhibited satisfactory W. Chang, "Identification of SNP barcode biomarkers for genes
associated with facial emotion perception using particle swarm
runtimes, and powerful computing methods such as parallel optimization algorithm," Annals of General Psychiatry, vol. 13, Art.
operations [37], GPU-based MDR [38], the greedy search no. 15, 2014.
strategy [39], and DE-based MDR [28] may be used to further [15] D. Gola, J. M. M. John, K. van Steen, and I. R. Konig, "A roadmap
to multifactor dimensionality reduction methods," Briefings in
improve the runtime of FCMEMDR. Bioinformatics, vol. 17, pp. 293-308, 2016.
[16] C. H. Yang, Y. D. Lin, and L. Y. Chuang, "Multiple-criteria
V. CONCLUSIONS decision analysis-based multifactor dimensionality reduction for
detecting gene-gene interactions," IEEE Journal of Biomedical and
In this work, we introduced a powerful FCMEMDR method for Health Informatics, vol. 23, pp. 416-426, 2018.
the detection of epistasis. The FCMEMDR method was [17] C. H. Yang, Y. D. Lin, and L. Y. Chuang, "Class balanced
multifactor dimensionality reduction to detect gene—gene
formulated based on an FCME approach to address the uncertainty interactions," IEEE/ACM Transactions on Computational Biology
associated with MDR-based methods. Under the application of and Bioinformatics, doi:10.1109/TCBB.2018.2858776, 2018.
FCM, each cell derived from a multifactor genotype could assess [18] S. Leem and T. Park, "An empirical fuzzy multifactor
its own membership, allowing MDR to detect the most dimensionality reduction method for detecting gene-gene
interactions," BMC Genomics, vol. 18, Art. no. 115, 2017.
biologically significant instances of epistasis. Performance [19] J. C. Bezdek, C. Coray, R. Gunderson, and J. Watson, "Detection
evaluation based on simulations using real GWAS datasets and characterization of cluster substructure i. linear structure: Fuzzy
confirmed that FCMEMDR satisfactorily detected epistasis. c-lines," SIAM Journal on Applied Mathematics, vol. 40, pp.
339-357, 1981.
[20] J. Nayak, B. Naik, and H. Behera, "Fuzzy C-means (FCM)
VI. REFERENCES clustering algorithm: a decade review from 2000 to 2014," in
Computational intelligence in data mining-volume 2, ed: Springer,
[1] W. S. Bush and J. H. Moore, "Genome-wide association studies,"
2015, pp. 133-149.
PLoS Computational Biology, vol. 8, Art. no. e1002822, 2012.
[21] H.-Y. Jung, S. Leem, and T. Park, "Fuzzy set-based generalized
[2] J. H. Moore, F. W. Asselbergs, and S. M. Williams, "Bioinformatics
multifactor dimensionality reduction analysis of gene-gene
challenges for genome-wide association studies," Bioinformatics,
interactions," BMC Medical Genomics, vol. 11, Art. no. 32, 2018.
vol. 26, pp. 445-455, 2010.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2019.2914629, IEEE
Transactions on Fuzzy Systems
TFS-2018-0762.R2 13

[22] S. Leem and T. Park, "EFMDR-Fast: An application of empirical


fuzzy multifactor dimensionality reduction for fast execution," Cheng-Hong Yang (M’00-SM’03) received an
Genomics & Informatics, vol. 16, Art. no. e37, 2018. M.S. and Ph.D. in computer engineering from North
[23] A. A. Motsinger and M. D. Ritchie, "Multifactor dimensionality Dakota State University in 1988 and 1992,
reduction: an analysis strategy for modelling and detecting respectively. He is currently a chair professor in the
gene-gene interactions in human genetics and pharmacogenomics Department of Electronic Engineering at National
studies," Human Genomics, vol. 2, Art. no. 318, 2006. Kaohsiung University of Science and Technology,
[24] C. H. Yang, Y. D. Lin, L. Y. Chuang, J. B. Chen, and H. W. Chang, Taiwan. He has authored/coauthored over 380
"MDR-ER: Balancing functions for adjusting the ratio in risk refereed publications and a number of book chapters.
classes and classification errors for imbalanced cases and controls He is an Editorial Board Member of multiple
using multifactor-dimensionality reduction," PLoS One, vol. 8, Art. international journals. His main areas of research are
no. e79387, 2013. fuzzy control, evolutionary computation,
[25] W. S. Bush, T. L. Edwards, S. M. Dudek, B. A. McKinney, and M. bioinformatics, and data analysis and the applications
D. Ritchie, "Alternative contingency table measures improve the of these methods. Prof. Yang is a Senior Member of
power and detection of multifactor dimensionality reduction," BMC the IEEE, fellow of the Institution of Engineering and
Bioinformatics, vol. 9, Art. no. 238, 2008. Technology, and fellow of the American Biographical
[26] R. J. Urbanowicz, J. Kiralis, N. A. Sinnott-Armstrong, T. Heberling, Institute.
J. M. Fisher, and J. H. Moore, "GAMETES: a fast, direct algorithm
for generating pure, strict, epistatic models with random Li-Yeh Chuang is a professor in the Department of
architectures," Biodata Mining, vol. 5, Art. no. 16, 2012. Chemical Engineering and Institute of Biotechnology
[27] J. Namkung, K. Kim, S. Yi, W. Chung, M. S. Kwon, and T. Park, and Chemical Engineering at I-Shou University,
"New evaluation measures for multifactor dimensionality reduction Kaohsiung, Taiwan. She received an M.S. from the
classifiers in gene-gene interaction analysis," Bioinformatics, vol. Department of Chemistry at the University of North
25, pp. 338-345, 2009. Carolina in 1989 and a Ph.D. from the Department of
[28] C. H. Yang, L. Y. Chuang, and Y. D. Lin, "CMDR based Biochemistry at North Dakota State University in 1994.
differential evolution identify the epistatic interaction in She has authored/coauthored over 300 refereed
genome-wide association studies," Bioinformatics, vol. 33, pp. publications. Her main areas of research are
2354-2362, 2017. bioinformatics, biochemistry, and genetic engineering.
[29] C. H. Yang, H. S. Yang, and L. Y. Chuang, "PBMDR: A particle
swarm optimization-based multifactor dimensionality reduction for
Yu-Da Lin (M’17) received his M.S. and Ph.D.
the detection of multilocus interactions," Journal of theoretical
degrees in the Department of Electronic Engineering,
biology, vol. 461, pp. 68-75, 2019.
National Kaohsiung University of Science and
[30] L. Y. Chuang, Y. D. Lin, and C. H. Yang, "Improved classification
Technology, Taiwan, in 2011 and 2015, respectively.
method for detecting potential interactions between genes," in
He is currently a Postdoctoral fellow of the
Science and Information Conference, 2018, pp. 394-403.
Department of Electronic Engineering at National
[31] P. R. Burton, D. G. Clayton, L. R. Cardon, N. Craddock, P.
Kaohsiung University of Science and Technology,
Deloukas, A. Duncanson, et al., "Genome-wide association study of
Taiwan. He also is a software engineer and an adjunct
14,000 cases of seven common diseases and 3,000 shared controls,"
assistant professor. His main research interests
Nature, vol. 447, pp. 661-678, 2007.
include artificial intelligence, biomedical informatics,
[32] C. S. Coffey, P. R. Hebert, M. D. Ritchie, H. M. Krumholz, J. M.
bioinformatics and computational biology. He has
Gaziano, P. M. Ridker, et al., "An application of conditional logistic
authored/coauthored over 80 refereed publications.
regression and multifactor dimensionality reduction for detecting
He is a member of the IEEE Tainan Section, IEEE
gene-gene interactions on risk of myocardial infarction: The
Young Professionals, and IEEE Computational
importance of model validation," BMC Bioinformatics, vol. 5, Art.
Intelligence Society Membership.
no. 49, 2004.
[33] W. T. Li and J. Reich, "A complete enumeration and classification
of two-locus disease models," Human Heredity, vol. 50, pp.
334-349, 2000.
[34] N. Dziuba, M. R. Ferguson, W. A. O'Brien, A. Sanchez, A. J.
Prussia, N. J. McDonald, et al., "Identification of cellular proteins
required for replication of human immunodeficiency virus type 1,"
Aids Research and Human Retroviruses, vol. 28, pp. 1329-1339,
2012.
[35] J. B. Chen, Y. H. Yang, W. C. Lee, C. W. Liou, T. K. Lin, Y. H.
Chung, et al., "Sequence-based polymorphisms in the
mitochondrial D-Loop and potential SNP predictors for chronic
dialysis," PLos One, vol. 7, Art. no. e41125, 2012.
[36] H.-Y. Jung, S. Leem, S. Lee, and T. Park, "A novel fuzzy set based
multifactor dimensionality reduction method for detecting gene–
gene interaction," Computational Biology and Chemistry, vol. 65,
pp. 193-202, 2016.
[37] W. S. Bush, S. M. Dudek, and M. D. Ritchie, "Parallel multifactor
dimensionality reduction: a tool for the large-scale analysis of
gene-gene interactions," Bioinformatics, vol. 22, pp. 2173-2174,
2006.
[38] C. S. Greene, N. A. Sinnott-Armstrong, D. S. Himmelstein, P. J.
Park, J. H. Moore, and B. T. Harris, "Multifactor dimensionality
reduction for graphics processing units enables genome-wide
testing of epistasis in sporadic ALS," Bioinformatics, vol. 26, pp.
694-695, 2010.
[39] C. H. Yang, Y. D. Lin, C. S. Yang, and L. Y. Chuang, "An
efficiency analysis of high-order combinations of gene-gene
interactions using multifactor-dimensionality reduction," BMC
Genomics, vol. 16, Art. no. 489, 2015.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like