Professional Documents
Culture Documents
Bioinformatics Research and Applications 10th International Symposium ISBRA 2014 Zhangjiajie China June 28 30 2014 Proceedings 1st Edition Mitra Basu
Bioinformatics Research and Applications 10th International Symposium ISBRA 2014 Zhangjiajie China June 28 30 2014 Proceedings 1st Edition Mitra Basu
https://textbookfull.com/product/bioinformatics-research-and-
applications-12th-international-symposium-isbra-2016-minsk-
belarus-june-5-8-2016-proceedings-1st-edition-anu-bourgeois/
https://textbookfull.com/product/intelligent-computing-in-
bioinformatics-10th-international-conference-icic-2014-taiyuan-
china-august-3-6-2014-proceedings-1st-edition-de-shuang-huang/
https://textbookfull.com/product/advanced-data-mining-and-
applications-10th-international-conference-adma-2014-guilin-
china-december-19-21-2014-proceedings-1st-edition-xudong-luo/
https://textbookfull.com/product/experimental-algorithms-13th-
international-symposium-sea-2014-copenhagen-denmark-
june-29-july-1-2014-proceedings-1st-edition-joachim-gudmundsson/
https://textbookfull.com/product/beyond-databases-architectures-
and-structures-10th-international-conference-bdas-2014-ustron-
poland-may-27-30-2014-proceedings-1st-edition-stanislaw-
kozielski/
https://textbookfull.com/product/engineering-secure-software-and-
systems-6th-international-symposium-essos-2014-munich-germany-
february-26-28-2014-proceedings-1st-edition-jan-jurjens/
https://textbookfull.com/product/trust-and-trustworthy-
computing-7th-international-conference-trust-2014-heraklion-
crete-june-30-july-2-2014-proceedings-1st-edition-thorsten-holz/
https://textbookfull.com/product/web-and-internet-economics-10th-
international-conference-wine-2014-beijing-china-
december-14-17-2014-proceedings-1st-edition-tie-yan-liu/
Mitra Basu
Yi Pan
Jianxin Wang (Eds.)
LNBI 8492
Bioinformatics
Research and Applications
10th International Symposium, ISBRA 2014
Zhangjiajie, China, June 28–30, 2014
Proceedings
123
Lecture Notes in Bioinformatics 8492
Bioinformatics
Research andApplications
10th International Symposium, ISBRA 2014
Zhangjiajie, China, June 28-30, 2014
Proceedings
13
Volume Editors
Mitra Basu
Johns Hopkins University
Computer Science Department
Baltimore, MD 21218, USA
and National Science Foundation, CCF
Arlington, VA 22230, USA
E-mail: mbasu@nsf.gov
Yi Pan
Georgia State University
Department of Computer Science
Atlanta, GA 30303, USA
E-mail: yipan@gsu.edu
Jianxin Wang
Central South University
School of Information Science and Engineering
Changsha, 410083, China
E-mail: jxwang@mail.csu.edu.cn
Steering Chairs
Alex Zelikovsky Georgia State University, USA
Dan Gusfield University of California, Davis, USA
Ion Mandoiu University of Connecticut, USA
Marie-France Sagot Inria, France
Yi Pan Georgia State University, USA
Ying Xu University of Georgia, USA
General Chairs
Albert Zomaya University of Sydney, Australia
Ming Li University of Waterloo, Canada
Program Chairs
Mitra Basu Johns Hopkins University, National Science
Foundation, USA
Yi Pan Georgia State University, USA
Jianxin Wang Central South University, China
Publication Chair
Min Li Central South University, China
Program Committee
Srinivas Aluru IIT Bombay/Iowa State University, India/USA
Mitra Basu National Science Foundation, USA
Robert Beiko Dalhousie University, Canada
Paola Bonizzoni Università di Milano-Bicocca, Italy
Zhipeng Cai Georgia State University, USA
Doina Caragea Kansas State University, USA
Tien-Hao Chang National Cheng Kung University
Ovidiu Daescu University of Texas at Dallas, USA
Bhaskar Dasgupta University of Illinois at Chicago, USA
Amitava Datta University of Western Australia
Oliver Eulenstein Iowa State University, USA
Guillaume Fertin LINA, UMR CNRS 6241, University of Nantes,
France
Lin Gao Xidian University, China
Katia Guimaraes UFPE, Brazil
Jiong Guo Universität des Saarlandes, Germany
Jieyue He Southeast University, China
Matthew He Nova Southeastern University, USA
Steffen Heber NCSU, USA
Wei Hu Houghton College, USA
Xiaohua Tony Hu Drexel University, USA
Jinling Huang East Carolina University, USA
Lars Kaderali University of Technology Dresden, Germany
Iyad Kanj DePaul University, USA
Ming-Yang Kao Northwestern University, USA
Yury. Khudyakov Centers for Disease Control and Prevention,
USA
Wooyoung Kim University of Washington Bothell, USA
Danny Krizanc Wesleyan University, USA
Guojun Li Shandong University, China
Jing Li Case Western Reserve University, USA
Min Li Central South University, China
Shuaicheng Li City University of Hong Kong, SAR China
Yanchun Liang Jilin University, China
Zhiyong Liu Institute of Computing Technology, Chinese
Academy of Science
Ion Mandoiu University of Connecticut, USA
Fenglou Mao University of Georgia, USA
Osamu Maruyama Kyushu University, Japan
Giri Narasimhan Florida International University, USA
Yi Pan Georgia State University, USA
Symposium Organization IX
Additional Reviewers
Alonso Alemany, Daniel Cliquet, Freddy
Anghelache, Andreea Curé, Olivier
Beissbarth, Tim Dao, Phuong
Beißer, Daniela Dondi, Riccardo
Bingbo, Wang Du, Xiangjun
Campo, David S. Falca, Elena-Bianca
Caravagna, Giulio Guo, Xingli
Cardona, Gabriel Hayes, Matthew
Cho, Dongyeon Herrmann, Carl
Chowdhury, Salim Hoinka, Jan
X Symposium Organization
Numerous theories and hypotheses have been proposed in the past 100 years
regarding what drive a cancer to initiate, progress and metastasize, including (1)
the now popular view of cancer as a result of genomic mutations;
(2) cancer being induced by viral or bacterial infection; and (3) cancer re-
sulted from malfunctioning mitochondria. I will present our recent work on (i)
key drivers of cancer initiation and (ii) drivers of post-metastatic cancer’s ex-
plosive growth, based on comparative and integrative analyses of very large
scale of multiple type of omic data collected on cancer tissues. On (i), our
starting point is a speculation made by Nobel Laureate Otto Warburg in the
1960s: “Cancer . . . has countless secondary causes. But . . . there is only one
prime cause, [which] is the replacement of respiration of oxygen in normal body
cells by a fermentation of sugar.” While increasingly more cancer researchers
tend to agree with Warburg, the link between the observed reprogramming of
energy metabolism and cell proliferation is unknown. We have recently discov-
ered that hyaluronic acid may be the missing link through statistical analyses of
omic data of different types of cancer; and developed a detailed model in link-
ing energy metabolism reprogramming and cell proliferation. On (ii), metastatic
cancer is responsible for 90% of cancer-related mortalities, and has been con-
sidered as a terminal illness, mainly based on past experience in largely unsuc-
cessful treatment of metastatic cancers using drugs designed for primary cancer.
We have recently discovered that fundamentally different from primary can-
cer, metastatic cancer is predominantly driven by a different force, i.e., oxidized
cholesterols and their steroidogenic metabolites. A detailed model is proposed
regarding (a) why metastatic cancer tends to have increased cholesterol influx
and (b) how oxidized cholesterol products drive metastatic cancers. Both studies
suggest fundamentally different ways to view and possibly treat cancer.
Table of Contents
Full Papers
Predicting Disease Risks Using Feature Selection Based on Random
Forest and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Jing Yang, Dengju Yao, Xiaojuan Zhan, and Xiaorong Zhan
Abstracts
PNImodeler: Web Server for Inferring Protein Binding Nucleotides
from Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Jinyong Im, Narankhuu Tuvshinjargal, Byungkyu Park,
Wook Lee, and Kyungsook Han
1 Introduction
Disease risk prediction is an important issue in biomedical and bioinformatics. High-
dimensional and redundant features in medical and biological data have created an
urgent need for feature selection techniques [1]. In general, feature selection
algorithms can be divided into Filter methods and Wrapper methods by the adopted
M. Basu, Y. Pan, and J. Wang (Eds.): ISBRA 2014, LNBI 8492, pp. 1–11, 2014.
© Springer International Publishing Switzerland 2014
2 J. Yang et al.
feature selection strategy [2]. Filter methods are independent to machine learning
algorithms and can quickly remove out noise features and narrows searching range of
the optimal feature subset, but it does not guarantee find out a smaller optimized
feature subset. Conversely, Wrapper methods use the selected feature subset directly
to train classifiers in the feature selection process and evaluate the quality of the
feature subsets according to the performance of the classifier in the test set. Wrapper
methods are computationally less efficient than Filter methods, but these methods can
result smaller optimal feature subset than Filter methods [3].
Random forest (RF henceforth) [4] is a popular ensemble machine learning
algorithm, which provides a unique combination of prediction accuracy and model
interpretability among popular machine learning method [1]. RF uses Bootstrap [16]
to sample samples randomly from original samples with replacement and train the
decision trees in each Bootstrap sampling. In the process of node splitting of each
tree, a feature is randomly selects as splitting attribute from a feature subset [5, 6, 7].
Finally, the class of a new sample is decided by voting of multiple decision trees.
Currently, RF has been widely used in various classifications, prediction, the
variables importance, feature selection, and outlier detection issues [8, 9, 10, 11].
Especially in the biomedicine and bioinformatics, random forest is favored because it
can efficiently identify complex interaction among multiple predictors. Diaz-Uriarte
et al [12] investigated the use of random forest for classification of microarray data
and proposed a method for gene selection in classification problems based on random
forest. Their experimental results showed that random forest has comparable
performance to other classification methods, including DLDA, KNN, and SVM, and
the proposed gene selection procedure yielded very small sets of genes while
preserving predictive accuracy. However, this approach made the decision as to the
number of genes to retain arbitrarily, and it is not the most appropriate if the objective
is to obtain the smaller possible sets of genes that will allow good predictive
performance. Herbert Pang et al [13] developed an iterative feature elimination
method based on the random survival forests to identify a set of prognostic genes.
Indeed, it is an extension of the method proposed by Diaz-Uriarte in survival
outcomes prediction. This approach ordered the genes by variable importance in
descending order and removed genes of the bottom 20 percent (default), where 20
percent is also the default chosen by Diaz-Uriarte. Dessì et al [14] proposed a pre-
filtering feature selection method based on random forests for microarray data
classification. They examined random forests from an experimental perspective and
evaluated the effects of a filtering process which preceded the actual construction of
the random forest. However, within this approach, a first critical issue is the choice of
a threshold value denoting the cut-off point of the list of ranked features. Ali Anaissi
et al [15] introduced a balanced iterative random forest (BIRF) algorithm to select the
most relevant genes for a disease from imbalanced high-throughput gene expression
microarray data. The experimental results showed the BIRF approach outperformed
these state-of-the-art methods, such as Support Vector Machine-Recursive Feature
Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF)
and Naive Bayes (NB) classifiers, especially in the case of imbalanced datasets.
However, BIRF algorithm has a limitation that random forest will not be able to get
global correlation due to the splitting of the dataset.
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 3
In all these methods mentioned above, random forest was directly used for
classifier to evaluate the quality of feature subsets in the process of feature selection,
but the applicability of the random forest and the comparison with other classification
algorithms were not been systematically researched. This paper studied the
performance of random forest used as feature subset evaluating function and
compared it with k-nearest neighbor (KNN) and support vector machine (SVM)
classification algorithms. The experimental results on acute lymphoblastic leukemia
(ALL) dataset showed the SVM is similar to RF but superior to KNN with respect to
classification performance when they were used as feature subset evaluating function.
On this basis, we proposed a new method of feature selection based on random forest,
called RF&SVMFS, which is a wrapper feature selector that combined the random
forest with support vector machine. RF&SVMFS also combined sequence backward
searching approach and sequence forward searching approach. The base learning
algorithm is random forest, which is used to compute variable importance for each
feature and to determine what features are removed or selected at each step. The SVM
algorithm is used for evaluating the quality of feature subsets. Feature selection starts
with the entire set of features in the dataset. At every iteration, two feature subsets are
gained. One feature subset removes those most unimportant features and the most
important feature at the same time, which is used to train random forest and to
compute feature importance for next feature selection. Another feature subset
removes only those most unimportant features while remains the most important
feature, which is used as the optimal feature subset to train SVM classifier. The
experimental results on 11 UCI datasets, a real clinical data sets and a gene expression
dataset show that the proposed algorithm can generate the smaller feature subset
while improve the classification accuracy.
2 Method
In this paper, we proposed a new feature selection method called RF&SVMFS based
on random forest and support vector machine, which combined sequence backward
searching approach and sequence forward searching approach. In the RF&SVMFS,
RF was run firstly to compute importance score for each feature. Then, all features
were sorted based on the importance scores. In order to ensure the stability and
reliability of the result, RF was run 5 times and the average of 5 times running result
was used as the basis of sorting features in every iteration. Next, the generalized
sequence backward searching strategy and sequence forward searching strategy was
used to generate feature subset. In detail, L most unimportant features (with minimal
importance score) and the most important feature were removed from original dataset,
and a new dataset was generated. Meanwhile, another dataset was generated by
removing only the L most unimportant features. The first dataset was used to train
random forest and to compute variable importance for next iteration. The second
dataset was used to train support vector machine and to evaluate the quality of the
feature subset. In order to ensure the stability of results, 10-fold cross-validation was
used while calculating the classification accuracy. The above process was repeated
4 J. Yang et al.
iteratively until the number of features in the feature set meeting the requirements
(only 5 features are left in the feature set in this research). Finally, feature set with
highest classification accuracy of SVM in all iterations was selected as the optimal
features set, and the variable importance scores are calculated for each feature at the
same time. The proposed algorithm is designed as follows:
Input: the original dataset S
L value in generalized sequence backward search
Output: highest classification accuracy MaxAccuracy
optimal feature subset OptFeatureSet
importance scores of features FeatureScore
Steps:
1. Initialization: MaxAccuracy <- 0
OptFeatureSet <- S
TmpFeatureSet <- S
GloOptFeatureSet <- NUll
2. while ( the number of features in OptFeatureSet > 5)
2.1 RF is run on 5 times TmpFeatureSet, the average value
of variable importance score of each feature is
computed and stored as vector RFAverageScore;
2.2 Features in TmpFeatureSet are ordered according to
RFAverageScore, the results are saved as
SortedFeatureSet;
2.3 According to SortedFeatureSet, remove L features with
the lowest variable importance scores from
OptFeatureSet and get new dataset OptFeatureSet;
2.4 According to SortedFeatureSet, remove L features with
the lowest variable importance scores and a most
important feature from TmpFeatureSet and get new
dataset TmpFeatureSet;
2.5 Randomly divided OptFeatureSet into 10 equal parts,
on nine of which run SVM algorithm to train
classifier, the remaining one part is used as a test
dataset, then calculate the SVM classifier accuracy
in test set, such iterations was executed 10 times,
calculate the average classification accuracy and
save as SVMAverageAccuracy;
2.6 if( MaxAccuracy <= SVMAverageAccuracy)
MaxAccuracy <- SVMAverageAccuracy
GloOptFeatureSet <- features in OptFeatureSet
End while
3. print(MaxAccuracy)
print(GloOptFeatureSet)
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 5
For validating the effectiveness of the proposed feature selection algorithm, the paper
selects 11 UCI datasets frequently used in literatures [17], a real diabetes clinical
dataset (DiabetesDB), and an acute lymphoblastic leukemia (ALL) dataset [18]. The
detailed information about these datasets is shown in Table 1. The dimensions of the
UCI datasets range from 6 to 61, and the data types include not only discrete data but
also continuous data or discrete and continuous mixed data. Particularly, Diabetes
clinical data were collected from a Level-three hospital in Heilongjiang Province of
China in 2006-2012, which includes 955 records of patients with type II diabetes.
Each original record has 72 features. According to advices of the endocrine experts, a
portion of obviously irrelevant and redundant features were removed, and the final
dataset includes 46 classification variables and one objective variable. ALL dataset
consist of 12625 genes from 128 different individuals with acute lymphoblastic
leukemia (ALL), in which there are 33 T cells acute lymphoblastic leukemia and 95 B
cells acute lymphoblastic leukemia. The data have been normalized (using rma) and it
is the jointly normalized data that are available here. The data are presented in the
form of an exprSet object. In this paper, we focus on the analysis of B cells acute
lymphoblastic leukemia, and gene mutation is our target class. So we selected 94 B
cells acute lymphoblastic leukemia samples with 12625 genes and an objective
variable representing the type of acute lymphoblastic leukemia as our dataset, marked
as ALLb.
Experimental results on UCI datasets are shown in Table 2 and Table 3, where the
"Feature" represents the number of features in optimal feature subset, "Acc"
represents the classification accuracy in test set, and “NA" indicates that there are no
experimental results in the referenced literatures. Here, the L value in RF&SVMFS is
set as 1. From Table 2 and Table 3, it can be seen, in dataset 1, 3 and 5, RF&SVMFS
selected out the smaller or equal optimal feature subset than CBFS [19], AMGA [20],
but the classification accuracy has been significantly improved. In dataset 7, 8, 9, 10
and 11, the number of the selected features selected by RF&SVMFS is similar with
other algorithms, but the classification accuracy is obvious higher than ACAHFS [21]
and GA-cull [21]. In addition, as one can see, the higher the dimensions of the dataset
are, the better the performance of RF&SVMFS is. In a word, the proposed algorithm
outperformed the existing methods in literature with respect to both the quality of
feature subset and the classification accuracy.
original RF and SVM algorithm. In addition, the method based on random forests can
provide variable importance scores of each feature. The bigger the importance score
is, the greater the impact of the feature to target variable is. This can help medical
experts to understand the results of data mining.
We also studied the risk factors of Peripheral Arterial Disease. The top 10 risk
factors were shown in Figure 2. As shown, age is the primary risk factor to Peripheral
Arterial Disease. Smoking history is ranked in second. These results are consistent
with previous findings. ALT is the third risk factor. Recent studies have shown that
ALT is a flag that one’s liver has been damaged, which is related with atherosclerosis.
Therefore, our findings are consistent with previous researches. We also have a new
discovery that INS30 and INS60 are important risk factors for Peripheral Arterial
Disease and their impacts are similar. According to medical knowledge, insulin is a
potent growth factor, which can increase collagen synthesis and stimulate vascular
smooth muscle cell proliferation. This is a process of atherosclerosis, and therefore
insulin levels reflect the lower limbs of atherosclerosis in some extent. Overall, the
results of this study are highly consistent with previous studies, and the proposed
RFVIMFS algorithm is reasonable.
In this section, we studied the capability of different classifiers used as feature subset
evaluating function and compared the performance of the proposed RF&SVMFS with
8 J. Yang et al.
some popular feature selection method using acute lymphoblastic leukemia dataset
(ALLb) [18]. ALLb is a microarray gene express dataset, which include 94 B cell
acute lymphoblastic leukemia samples with 12625 genes. The objective variable has
four categories, including ALL/AF4, BCR/ABL, E2A/PBX1 and NEG. The
dimensions of this dataset are very large, so feature selection was performed before
the prediction model was trained. Firstly, we used interquartile range (IQR) to filter
genes based on gene expression levels distribution, and all genes whose variability is
less than 1/5 overall IQR are eliminated. After this process, the numbers of genes
became 3970 from 12625. Next, we performed Factor Analysis (AVONA), filter
feature selection based on random forest (FFSRF), filter feature selection based on
combined feature clustering (FFSFC) [18] and our proposed method respectively.
Incidentally, because KNN, randomForest and SVM were used as feature evaluating
function respectively here, we represent the proposed feature selection method by
WFSRF which is different from FFSRF. As a result, AVONA selected 752 genes;
Both FFSRF and FFSFC selected top 30 genes, RF&SVMFS selected top 50 and top
20 genes. Finally, KNN, randomForest and SVM algorithm were performed on these
feature subsets, and the classification accuracy of each case was shown as Table 5. As
one can see from Table 5, the proposed WFSRF methods are overall superior to
ANOVA, FFSRF and FFSFC with respect to classification accuracy. While KNN,
randomForest and SVM were used as feature subset evaluating function respectively,
randomForest and SVM are evenly matched, and both are superior to KNN. This
proves the validity of our proposed method.
classifier
KNN randomForest SVM
feature selection
ANOVA.752 0.8298 0.7979 0.8617
FFSRF.30 0.8830 0.8723 0.8617
FFSFC.30 0.8617 0.8298 0.8511
WFSRF.20 0.8421 0.8947 0.8947
WFSRF.50 0.8947 0.9475 0.9475
dataset is small, each timee deleting one feature can effectively eliminate redunddant
features and irrelevant featu
ures, and larger L value will make some important featuures
be removed together with non-related
n features. However, when the dimensions off the
dataset are very large, such
h as ALLb, the larger L value makes it possible to rem move
redundant features and irreelevant features quickly and to improve the classificattion
performance, as shown in Sonar
S dataset in this paper. As a rule of thumb, when the
dimensions of dataset is very
v large, L should be set as √N, N is the numberr of
features of dataset. For ALLb
A dataset in this paper, we adopted a combinattion
method. When the dimensio on of the dataset is larger than 50, we set L as 50; when the
dimension is smaller than 50, we set L as 5.
4 Conclusions
Due to high-dimensional feeature space and highly feature redundancy in biomediccine
and bioinformatics dataset,, the existing machine learning algorithms have been not
competent data mining task
ks in these field. Random forests algorithm has the capaccity
of analyzing complex in nteractions among features and can provide variaable
importance score which cann be used as a convenient tool for the feature selection. T
The
paper proposed a new Wraapper feature selection algorithms based on random forest
variable importance measurrement and support vector machine. The proposed methhod
combined generalized sequ uence backward searching strategy and sequence forw ward
sequence searching strategyy for feature selection. Experimental results show that the
proposed feature selection algorithm is responsible for finding the optimal featture
10 J. Yang et al.
subset and can effectively improve the classification accuracy. Simultaneously, the
algorithm can give out the variable importance scores for each feature in the optimal
feature subset, and enhance the comprehensibility of data mining results. In addition,
we study the capability of different classification algorithms used as feature subset
evaluating function, and experiment shows that SVM is evenly matched with random
forest but superior to KNN in ALLb dataset. Experimental validation and deeper
research on more datasets is the next direction of research.
References
1. Qi, Y.: Random Forest for Bioinformatics. In: Ensemble Machine Learning, pp. 307–323
(2012)
2. Inza, I., Larranaga, P., Blanco, R.: Filter versus wrapper gene selection approaches in
DNA microarray domains. Artificial Intelligence in Medicine 31(2), 91–103 (2008)
3. Tsymbal, A., Puuronen, S.: Ensemble feature selection with the simple Bayesian
classification. Information Fusion 4(2), 87–100 (2010)
4. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
5. Bishop, C.M.: Bootstrap. Pattern Recognition and Machine Learning. Springer, Singapore
(2006)
6. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
7. Breiman, L., Friedman, J.H., Olshen, R.A., et al.: Classification and Regression Trees.
Chapman&Hall (1993)
8. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable
importance for random forests. BMC Bioinformatics 9, 307 (2008)
9. Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: A survey
and results of new tests. Pattern Recognition 44, 330–349 (2011)
10. Liu, H., Li, J.: A comparative study on feature selection and classification methods using
gene expression profiles and proteomic patterns. Genome Informatics 13, 51–60 (2012)
11. Wang, A., Wan, G., Cheng, Z., et al.: Incremental Learning Extremely Random Forest
Classifier for Online Learning. Journal of Software 22(9), 2059–2074 (2011)
12. Díaz-Uriarte, R., de Andrés, S.A.: Gene selection and classification of microarray data
using random forest. BMC Bioinformatics 7, 3 (2006)
13. Pang, H., George, S.L., Hui, K., Tong, T.: Gene Selection Using Iterative Feature
Elimination Random Forests for Survival Outcomes. IEEE/ACM Transactions on
Computational Biology and Bioinformatics 9(5), 1422–1431 (2012)
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 11
14. Dessì, N., Milia, G., Pes, B.: Pre-filtering Features in Random Forests for Microarray Data
Classification. In: New Frontiers in Mining Complex Patterns (NFMCP 2012). vol. 60
(2012)
15. Anaissi, A., Kennedy, P.J., Goyal, M., Catchpoole, D.R.: A balanced iterative random
forest for gene selection from microarray data. BMC Bioinformatics 14, 261 (2013)
16. Yi, C., Li, J., Zhu, C.: A kind of feature selection based on classification accuracy of SVM.
Journal of Shandong University 45(7), 119–124 (2010)
17. UC Irvine Machine Learning Repository, http://archive.ics.uci.edu/ml/
18. Torgo, L.: Data Mining with R: Learning with Case Studies. Luis Chapman & Hall/CRC
(2010)
19. Jiang, S., Zheng, Q., Zhang, Q.: Clustering-Based Feature Selection. Acta Electronica
Sinica 36(12), 157–160 (2008)
20. Liu, Y., Wang, G., Zhu, X.: Feature selection based on adaptive multi-population genetic
algorithm. Journal of Jilin University 41(6), 1690–1693 (2011)
21. Zhang, J., He, Z., Wang, J.: Hybrid Feature Selection Algorithm Based on Adaptive Ant
Colony Algorithm. Journal of System Simulation 21(6), 1605–1614 (2009)
Phylogenetic Bias in the Likelihood Method
Caused by Missing Data Coupled with Among-Site Rate
Variation: An Analytical Approach
Xuhua Xia
1 Introduction
Many supermatrices have been compiled in recent years by concatenating sequences
from many different genes [1-4]. Such concatenated genes typically have few shared sites
among all included species. For example, while Regier et al. [3] claimed to have 41 kilo-
bases of aligned DNA sequences, the actual number of sites that are completely unambi-
guous among all 80 species amounts to only 705 sites. Some genes are completely miss-
ing in nearly half of the 80 species. While the potential problems involving such “?”-
laden supermatrices have been suspected before[5], specific biases associated with such
missing data have not been well studied, especially not in the likelihood framework
which has been the gold standard in phylogenetic reconstruction.
Previous studies [6-11] attempted to identify bias associated with missing data ei-
ther by sequence simulation or by selectively eliminating sites in a real sequence
alignment. While most publications suggest that phylogenetic reconstruction is
not sensitive to missing data or that the benefit of including taxa with missing data
M. Basu, Y. Pan, and J. Wang (Eds.): ISBRA 2014, LNBI 8492, pp. 12–23, 2014.
© Springer International Publishing Switzerland 2014
Phylogenetic Bias in the Likelihood Method Caused by Missing Data 13
out-weight the cost of their exclusion [6, 8-11], a recent study [7] suggested a signifi-
cant bias associated with missing data and coupled with among-site rate variation.
However, such simulation-based findings often cannot pin-point where the bias arises
and consequently have been challenged by others on both empirical [6, 9, 11] and
theoretical grounds [9], although these latter publications did not explicitly test the
claimed bias [7] associated with among-site variation. Roure et al. [9] noted that, if
sequences contain similar phylogenetic information, then phylogenetic reconstruction
is not sensitive to missing data. However, they also noted that heterogeneous data
could lead to phylogenetic bias based on extensive data analysis.
Here I demonstrate analytically the bias associated with the missing data coupled
with among-site rate variation. The pruning algorithm [12, 13, 14, pp. 253-255] is
briefly outlined, in conjunction with the conventional missing data handling by the
likelihood method, so that the reader can verify the claimed bias introduced by miss-
ing data. I first illustrate the “bias” shown by Lemmon et al. [7] when branch lengths
are not allowed to be zero, by using both JC69 [15] and F84 [16] models. Such a
“bias” can be easily avoided by simply allow branches to be zero and should not be
considered as estimation bias in the likelihood method. However, the bias due to the
missing data associated with among-site rate variation [7] is real. This bias can lead to
either increased tendency (and confidence) to group together OTUs (operational
taxonomic units) that share the same stretches of missing sites or in the opposite di-
rection. The results suggest that blindly concatenating sequence data to generate a
supermatrix with many pieces of missing data will generate false confidence in phy-
logenetic resolution and should be avoided.
The likelihood approach features a convenient way to handle missing data, which is
best illustrated with the pruning algorithm. Suppose we have four OTUs with
sequence data in Fig. 1a, and with the last two sequences being entirely missing
(represented by ‘?’). Obviously, we can only estimate the distance between S1 and S2
but not the evolutionary relationships involving OTUs S3 or S4. The maximum like-
lihood distance between S1 and S2, based on the JC69 model, is given by
8!
L= Pii4 Pij4 (1)
4!4!
which, when maximized, leads to a distance of 0.8239592165.
Fig. 2 illustrates the computation of the likelihood by the pruning algorithm, given
the first site of the aligned nucleotide sequence (Fig. 1a) and topology T1 in Fig. 1. I
included the numerical illustration here to facilitate the verification of subsequent
claims that the maximum likelihood method does exhibit a true and identifiable bias
in phylogenetic reconstruction involving missing data coupled with among-site rate
variation.
Another random document with
no related content on Scribd:
many instances the lack of teachers is greater in those
provinces which are most thickly populated and whose people
are most highly civilized. …
"While most of the small towns have one teacher of each sex,
in the larger towns and cities no adequate provision is made
for the increased teaching force necessary; so that places of
30,000 or 40,000 inhabitants are often no better off as
regards number of teachers than are other places in the same
province of but 1,500 or 2,000 souls. The hardship thus
involved for children desiring a primary education will be
better understood if one stops to consider the nature of the
Philippine 'pueblo,' which is really a township, often
containing within its limits a considerable number of distinct
and important villages or towns, from the most important of
which the township takes its name. The others, under distinct
names, are known as 'barrios,' or wards. It is often quite
impossible for small children to attend school at the
particular town which gives its name to the township on
account of their distance from it. …
2. Reading.
3. Writing.
8. Rules of deportment.
9. Vocal music.'
EDUCATION: Russia:
Student troubles in the universities.
EDUCATION: Tunis:
Schools under the French Protectorate.
{195}
----------EGYPT: Start--------
EGYPT:
Recent Archæological Explorations and their result.
Discovery of prehistoric remains.
Light on the first dynasties.
EGYPT: A. D. 1885-1896.
Abandonment of the Egyptian Sudan to the Dervishes.
Death of the Mahdi and reign of the Khalifa.
Beginning of a new Anglo-Egyptian movement for
the recovery of the Sudan.
The expedition to Dongola.
{196}
On the 21st of March, the Sirdar left Cairo for Assouan and
Wady Halfa, and various Egyptian battalions were hurried up
the river. Meantime, the forces already on the frontier had
moved forward and taken the advanced post of the Dervishes, at
Akasheh. From that point the Sirdar was ready to begin his
advance early in June, and did so with two columns, a River
Column and a Desert Column, the latter including a camel corps
and a squadron of infantry mounted on camels, besides cavalry,
horse artillery and Maxim guns. Ferket, on the east bank of
the Nile, 16 miles from Akasheh, was taken after hard fighting
on the 7th of June, many of the Dervishes refusing quarter and
resisting to the death. They lost, it was estimated, 1,000
killed and wounded, and 500 were taken prisoners. The Egyptian
loss was slight. The Dervishes fell back some fifty miles, and
the Sirdar halted at Suarda during three months, while the
railroad was pushed forward, steamers dragged up the cataracts
and stores concentrated, the army suffering greatly, meantime,
from an alarming epidemic of cholera and from exhausting
labors in a season of terrific heat. In the middle of
September the advance was resumed, and, on the 23d, Dongola
was reached. Seeing themselves outnumbered, the enemy there
retreated, and the town, or its ruins, was taken with only a
few shots from the steamers on the river. "As a consequence of
the fall of Dongola every Dervish fled for his life from the
province. The mounted men made off across the desert direct to
Omdurman, and the foot soldiers took the Nile route to Berber,
always being careful to keep out of range of the gunboats,
which were prevented by the Fourth Cataract from pursuing them
beyond Merawi."
C. Hoyle,
The Egyptian Campaigns, new and revised edition,
to December, 1899, chapter 70-71.
EGYPT: A. D. 1895.
New anti-slavery law.
EGYPT: A. D. 1897.
Italian evacuation of Kassala, in the eastern Sudan.
EGYPT: A. D. 1897-1898.
The final campaigns of the Anglo-Egyptian conquest
of the Eastern Sudan.
Desperate battles of the Atbara and of Omdurman.
A. S. White,
The Expansion of Egypt,
pages 383-384
(New York: New Amsterdam Book Company).
"The honour of the fight [at Omdurman] must still go with the
men who died. Our men were perfect, but the dervishes were
superb—beyond perfection. It was their largest, best, and
bravest army that ever fought against us for Mahdism, and it
died worthily of the huge empire that Mahdism won and kept so
long. Their riflemen, mangled by every kind of death and
torment that man can devise, clung round the black flag and
the green, emptying their poor, rotten, homemade cartridges
dauntlessly. Their spearmen charged death at every minute
hopelessly. Their horsemen led each attack, riding into the
bullets till nothing was left but three horses trotting up to
our line, heads down, saying, 'For goodness' sake, let us in
out of this.' Not one rush, or two, or ten—but rush on rush,
company on company, never stopping, though all their view that
was not unshaken enemy was the bodies of the men who had
rushed before them. A dusky line got up and stormed forward:
it bent, broke up, fell apart, and disappeared. Before the
smoke had cleared, another line was bending and storming
forward in the same track.
{198}
"But the people! We could hardly see the place for the people.
We could hardly hear our own voices for their shrieks of
welcome. We could hardly move for their importunate greetings.
They tumbled over each other like ants from every mud heap, from
behind every dung-hill, from under every mat. … They had been
trying to kill us three hours before. But they salaamed, none
the less, and volleyed, 'Peace be with you' in our track. All
the miscellaneous tribes of Arabs whom Abdullahi's fears or
suspicions had congregated in his capital, all the blacks his
captains had gathered together into franker
slavery—indiscriminate, half-naked, grinning the grin of the
sycophant, they held out their hands and asked for backsheesh.
Yet more wonderful were the women. The multitude of women whom
concupiscence had harried from every recess of Africa and
mewed up in Baggara harems came out to salute their new
masters. There were at least three of them to every man. Black
women from Equatoria and almost white women from Egypt,
plum-skinned Arabs and a strange yellow type with square, bony
faces and tightly-ringleted black hair, … the whole city was a
huge harem, a museum of African races, a monstrosity of
African lust."
G. W. Steevens,
With Kitchener to Khartum,
chapter 32-34
(copyright, Dodd, Mead & Company, quoted with permission).
"Anyone who has not served in the Sudan cannot conceive the
state of devastation and misery to which that unfortunate
country has been brought under Dervish rule. Miles and miles
of formerly richly cultivated country lies waste; villages are
deserted; the population has disappeared. Thousands of women
are without homes or families. Years must elapse before the
Sudan can recover from the results of its abandonment to
Dervish tyranny; but it is to be hoped and may be confidently
expected, that in course of time, under just and upright
government, the Sudan may be restored to prosperity; and the
great battle of September will be remembered as having
established peace, without which prosperity would have been
impossible; and from which thousands of misguided and wretched
people will reap the benefits of civilization."
E. S. Wortley,
With the Sirdar
(Scribner's Magazine, January, 1899).
EGYPT: A. D. 1898.
The country and its people after 15 years
of British occupation.
"The British occupation has now lasted for over fifteen years.
During the first five, comparatively little was accomplished,
owing to the uncertain and provisional character of our
tenure. The work done has been done in the main in the last
ten years, and was only commenced in earnest when the British
authorities began to realise that, whether we liked it or not,
we had got to stay; and the Egyptians themselves came to the
conclusion that we intended to stay. … Under our occupation
Egypt has been rendered solvent and prosperous; taxes have
been largely reduced; her population has increased by nearly
50 per cent.; the value and the productiveness of her soil has
been greatly improved; a regular and permanent system of
irrigation has been introduced into Lower Egypt, and is now in
the course of introduction into Upper Egypt; trade and
industry have made giant strides; the use of the Kurbash
[bastinado] has been forbidden; the Corvée has been
suppressed; regularity in the collection of taxes has been
made the rule, and not the exception; wholesale corruption has
been abolished; the Fellaheen can now keep the money they
earn, and are better off than they were before; the landowners
are all richer owing to the fresh supply of water, with the
consequent rapid increase in the saleable price of land;
justice is administered with an approach to impartiality;
barbarous punishments have been mitigated, if not abolished;
and the extraordinary conversion of Cairo into a fair
semblance of a civilised European capital has been repeated on
a smaller scale in all the chief centres of Egypt. To put the
matter briefly, if our occupation were to cease to-morrow, we
should leave Egypt and the Egyptians far better off than they
were when our occupation commenced.
E. Dicey,
Egypt, 1881 to 1897
(Fortnightly Review, May, 1898).
{199}
Spectator,
April 15, 1899.