You are on page 1of 54

Bioinformatics Research and

Applications 10th International


Symposium ISBRA 2014 Zhangjiajie
China June 28 30 2014 Proceedings 1st
Edition Mitra Basu
Visit to download the full and correct content document:
https://textbookfull.com/product/bioinformatics-research-and-applications-10th-interna
tional-symposium-isbra-2014-zhangjiajie-china-june-28-30-2014-proceedings-1st-editi
on-mitra-basu/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Bioinformatics Research and Applications 12th


International Symposium ISBRA 2016 Minsk Belarus June 5
8 2016 Proceedings 1st Edition Anu Bourgeois

https://textbookfull.com/product/bioinformatics-research-and-
applications-12th-international-symposium-isbra-2016-minsk-
belarus-june-5-8-2016-proceedings-1st-edition-anu-bourgeois/

Intelligent Computing in Bioinformatics 10th


International Conference ICIC 2014 Taiyuan China August
3 6 2014 Proceedings 1st Edition De-Shuang Huang

https://textbookfull.com/product/intelligent-computing-in-
bioinformatics-10th-international-conference-icic-2014-taiyuan-
china-august-3-6-2014-proceedings-1st-edition-de-shuang-huang/

Advanced Data Mining and Applications 10th


International Conference ADMA 2014 Guilin China
December 19 21 2014 Proceedings 1st Edition Xudong Luo

https://textbookfull.com/product/advanced-data-mining-and-
applications-10th-international-conference-adma-2014-guilin-
china-december-19-21-2014-proceedings-1st-edition-xudong-luo/

Analytical and Stochastic Modeling Techniques and


Applications 21st International Conference ASMTA 2014
Budapest Hungary June 30 July 2 2014 Proceedings 1st
Edition Bruno Sericola
https://textbookfull.com/product/analytical-and-stochastic-
modeling-techniques-and-applications-21st-international-
conference-asmta-2014-budapest-hungary-
Experimental Algorithms 13th International Symposium
SEA 2014 Copenhagen Denmark June 29 July 1 2014
Proceedings 1st Edition Joachim Gudmundsson

https://textbookfull.com/product/experimental-algorithms-13th-
international-symposium-sea-2014-copenhagen-denmark-
june-29-july-1-2014-proceedings-1st-edition-joachim-gudmundsson/

Beyond Databases Architectures and Structures 10th


International Conference BDAS 2014 Ustron Poland May 27
30 2014 Proceedings 1st Edition Stanislaw Kozielski

https://textbookfull.com/product/beyond-databases-architectures-
and-structures-10th-international-conference-bdas-2014-ustron-
poland-may-27-30-2014-proceedings-1st-edition-stanislaw-
kozielski/

Engineering Secure Software and Systems 6th


International Symposium ESSoS 2014 Munich Germany
February 26 28 2014 Proceedings 1st Edition Jan Jürjens

https://textbookfull.com/product/engineering-secure-software-and-
systems-6th-international-symposium-essos-2014-munich-germany-
february-26-28-2014-proceedings-1st-edition-jan-jurjens/

Trust and Trustworthy Computing 7th International


Conference TRUST 2014 Heraklion Crete June 30 July 2
2014 Proceedings 1st Edition Thorsten Holz

https://textbookfull.com/product/trust-and-trustworthy-
computing-7th-international-conference-trust-2014-heraklion-
crete-june-30-july-2-2014-proceedings-1st-edition-thorsten-holz/

Web and Internet Economics 10th International


Conference WINE 2014 Beijing China December 14 17 2014
Proceedings 1st Edition Tie-Yan Liu

https://textbookfull.com/product/web-and-internet-economics-10th-
international-conference-wine-2014-beijing-china-
december-14-17-2014-proceedings-1st-edition-tie-yan-liu/
Mitra Basu
Yi Pan
Jianxin Wang (Eds.)
LNBI 8492

Bioinformatics
Research and Applications
10th International Symposium, ISBRA 2014
Zhangjiajie, China, June 28–30, 2014
Proceedings

123
Lecture Notes in Bioinformatics 8492

Subseries of Lecture Notes in Computer Science

LNBI Series Editors


Sorin Istrail
Brown University, Providence, RI, USA
Pavel Pevzner
University of California, San Diego, CA, USA
Michael Waterman
University of Southern California, Los Angeles, CA, USA

LNBI Editorial Board


Alberto Apostolico
Georgia Institute of Technology, Atlanta, GA, USA
Søren Brunak
Technical University of Denmark Kongens Lyngby, Denmark
Mikhail S. Gelfand
IITP, Research and Training Center on Bioinformatics, Moscow, Russia
Thomas Lengauer
Max Planck Institute for Informatics, Saarbrücken, Germany
Satoru Miyano
University of Tokyo, Japan
Eugene Myers
Max Planck Institute of Molecular Cell Biology and Genetics
Dresden, Germany
Marie-France Sagot
Université Lyon 1, Villeurbanne, France
David Sankoff
University of Ottawa, Canada
Ron Shamir
Tel Aviv University, Ramat Aviv, Tel Aviv, Israel
Terry Speed
Walter and Eliza Hall Institute of Medical Research
Melbourne, VIC, Australia
Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
W. Eric Wong
University of Texas at Dallas, Richardson, TX, USA
Mitra Basu Yi Pan Jianxin Wang (Eds.)

Bioinformatics
Research andApplications
10th International Symposium, ISBRA 2014
Zhangjiajie, China, June 28-30, 2014
Proceedings

13
Volume Editors
Mitra Basu
Johns Hopkins University
Computer Science Department
Baltimore, MD 21218, USA
and National Science Foundation, CCF
Arlington, VA 22230, USA
E-mail: mbasu@nsf.gov
Yi Pan
Georgia State University
Department of Computer Science
Atlanta, GA 30303, USA
E-mail: yipan@gsu.edu
Jianxin Wang
Central South University
School of Information Science and Engineering
Changsha, 410083, China
E-mail: jxwang@mail.csu.edu.cn

ISSN 0302-9743 e-ISSN 1611-3349


ISBN 978-3-319-08170-0 e-ISBN 978-3-319-08171-7
DOI 10.1007/978-3-319-08171-7
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014940855

LNCS Sublibrary: SL 8 – Bioinformatics


© Springer International Publishing Switzerland 2014
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in ist current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

The International Symposium on Bioinformatics Research and Applications (IS-


BRA) provides a forum for the exchange of ideas and results among researchers,
developers, and practitioners working on all aspects of bioinformatics and compu-
tational biology and their applications. Submissions presenting original research
are solicited in all areas of bioinformatics and computational biology, including
the development of experimental or commercial systems.
The 10th edition of the International Symposium on Bioinformatics Research
and Applications (ISBRA 2014) was held during June 28–30, 2014, in Zhangji-
ajie, China. ISBRA 2014 received 119 full paper submissions. Every paper went
through a very rigorous review process. Each paper was reviewed by two to
five Program Committee members. After careful consideration, 33 papers were
accepted as regular papers in Track 1 (28.57% acceptance rate) and 32 papers
were accepted as one-page abstracts in Track 2 (26.05% acceptance rate). Ad-
ditionally, the symposium included three invited keynote talks by distinguished
speakers: Prof. John E. Hopcroft from Cornell University, USA, Prof. Ming Li
from University of Waterloo, Canada, and Prof. Ying Xu from University of
Georgia, USA.
We would like to thank the Program Committee members and external re-
viewers for volunteering their time to review and discuss the symposium papers.
We would like to extend special thanks to the steering and general chairs of the
symposium for their leadership, and to the finance, publication, publicity, and
local organization chairs for their hard work in making ISBRA 2014 a successful
event. Last but not least, we would like to thank all authors for presenting their
work at the symposium.

June 2014 Mitra Basu


Yi Pan
Jianxin Wang
Symposium Organization

Steering Chairs
Alex Zelikovsky Georgia State University, USA
Dan Gusfield University of California, Davis, USA
Ion Mandoiu University of Connecticut, USA
Marie-France Sagot Inria, France
Yi Pan Georgia State University, USA
Ying Xu University of Georgia, USA

General Chairs
Albert Zomaya University of Sydney, Australia
Ming Li University of Waterloo, Canada

Program Chairs
Mitra Basu Johns Hopkins University, National Science
Foundation, USA
Yi Pan Georgia State University, USA
Jianxin Wang Central South University, China

Publication Chair
Min Li Central South University, China

Local Organizing Chairs


Jianxin Wang Central South University, China
Qingping Zhou Jishou University, China

Local Organizing Committee


Min Li Central South University, China
Mingxing Zeng Jishou University, China
Yu Sheng Central South University, China
Guihua Duan Central South University, China
Li Wang Jishou University, China
Yanping Yang Jishou University, China
VIII Symposium Organization

Program Committee
Srinivas Aluru IIT Bombay/Iowa State University, India/USA
Mitra Basu National Science Foundation, USA
Robert Beiko Dalhousie University, Canada
Paola Bonizzoni Università di Milano-Bicocca, Italy
Zhipeng Cai Georgia State University, USA
Doina Caragea Kansas State University, USA
Tien-Hao Chang National Cheng Kung University
Ovidiu Daescu University of Texas at Dallas, USA
Bhaskar Dasgupta University of Illinois at Chicago, USA
Amitava Datta University of Western Australia
Oliver Eulenstein Iowa State University, USA
Guillaume Fertin LINA, UMR CNRS 6241, University of Nantes,
France
Lin Gao Xidian University, China
Katia Guimaraes UFPE, Brazil
Jiong Guo Universität des Saarlandes, Germany
Jieyue He Southeast University, China
Matthew He Nova Southeastern University, USA
Steffen Heber NCSU, USA
Wei Hu Houghton College, USA
Xiaohua Tony Hu Drexel University, USA
Jinling Huang East Carolina University, USA
Lars Kaderali University of Technology Dresden, Germany
Iyad Kanj DePaul University, USA
Ming-Yang Kao Northwestern University, USA
Yury. Khudyakov Centers for Disease Control and Prevention,
USA
Wooyoung Kim University of Washington Bothell, USA
Danny Krizanc Wesleyan University, USA
Guojun Li Shandong University, China
Jing Li Case Western Reserve University, USA
Min Li Central South University, China
Shuaicheng Li City University of Hong Kong, SAR China
Yanchun Liang Jilin University, China
Zhiyong Liu Institute of Computing Technology, Chinese
Academy of Science
Ion Mandoiu University of Connecticut, USA
Fenglou Mao University of Georgia, USA
Osamu Maruyama Kyushu University, Japan
Giri Narasimhan Florida International University, USA
Yi Pan Georgia State University, USA
Symposium Organization IX

Andrei Paun University of Bucharest, Romania


Nadia Pisanti Università di Pisa, Italy
Teresa Przytycka NIH, USA
Sven Rahmann University of Duisburg-Essen, Germany
David Sankoff University of Ottawa, Canada
Daniel Schwartz University of Connecticut, USA
Russell Schwartz Carnegie Mellon University, USA
Joao Setubal University of São Paulo, Brazil
Xinghua Shi University of North Carolina at Charlotte, USA
Ileana Streinu Smith College, Northampton, USA
Zhengchang Su University of North Carolina at Charlotte, USA
Raj Sunderraman Georgia State University, USA
Wing-Kin Sung National University of Singapore
Sing-Hoi Sze Texas A&M University, USA
Gabriel Valiente Technical University of Catalonia, Spain
Stéphane Vialette Université Paris-Est LIGM UMR CNRS 8049,
France
Jianxin Wang Central South University, China
Li-San Wang University of Pennsylvania, USA
Lusheng Wang City University of Hong Kong, SAR China
Peng Wang Chinese Academy of Sciences
Fangxiang Wu University of Saskatchewan, Canada
Yufeng Wu University of Connecticut, USA
Minzhu Xie Hunan Normal University, China
Dechang Xu Harbin Institute of Technology, China
Zhenyu Xuan University of Texas at Dallas, USA
Zuguo Yu Queensland University of Technology, Australia
Alex Zelikovsky GSU, USA
Fa Zhang Institute of Computing Technology, China
Fengfeng Zhou Chinese Academy of Sciences
Leming Zhou University of Pittsburgh, USA

Additional Reviewers
Alonso Alemany, Daniel Cliquet, Freddy
Anghelache, Andreea Curé, Olivier
Beissbarth, Tim Dao, Phuong
Beißer, Daniela Dondi, Riccardo
Bingbo, Wang Du, Xiangjun
Campo, David S. Falca, Elena-Bianca
Caravagna, Giulio Guo, Xingli
Cardona, Gabriel Hayes, Matthew
Cho, Dongyeon Herrmann, Carl
Chowdhury, Salim Hoinka, Jan
X Symposium Organization

Hwang, Yih-Chii Peterlongo, Pierre


Jiang, Ruhua Pirola, Yuri
Jiang, Xingpeng Rizzi, Raffaella
Kim, Yooah Rocha, Jairo
Knapp, Bettina Roman, Theodore
Komusiewicz, Christian Sun, Peng Gang
Köster, Johannes Valentini, Giorgio
Lara, James Venturini, Rossano
Lauber, Chris Wan, Xiaohua
Li, Fan Wang, Weichao
Li, Xiaojie Wang, Wenhui
Llabrés, Mercè Wang, Yan
Luo, Junwei Wohlers, Inken
Menconi, Giulia Wójtowicz, Damian
Mirzaei, Sajad Zhang, Fa
Pan, Yi Zheng, Qi
Pei, Jingwen Zhong, Jiancheng
Peng, Wei Zhou, Chan
Peng, Xiaoqing
Elucidation of Key Drivers and Facilitators
of Cancer Initiation and Metastasis:
A Data-Mining Approach
(Invited Talk)

Ying Xu, Ph.D.

Department of Biochemistry and Molecular Biology, and Institute of


Bioinformatics,
University of Georgia, USA, and College of Computer Science and Technology,
Jilin University, China

Numerous theories and hypotheses have been proposed in the past 100 years
regarding what drive a cancer to initiate, progress and metastasize, including (1)
the now popular view of cancer as a result of genomic mutations;
(2) cancer being induced by viral or bacterial infection; and (3) cancer re-
sulted from malfunctioning mitochondria. I will present our recent work on (i)
key drivers of cancer initiation and (ii) drivers of post-metastatic cancer’s ex-
plosive growth, based on comparative and integrative analyses of very large
scale of multiple type of omic data collected on cancer tissues. On (i), our
starting point is a speculation made by Nobel Laureate Otto Warburg in the
1960s: “Cancer . . . has countless secondary causes. But . . . there is only one
prime cause, [which] is the replacement of respiration of oxygen in normal body
cells by a fermentation of sugar.” While increasingly more cancer researchers
tend to agree with Warburg, the link between the observed reprogramming of
energy metabolism and cell proliferation is unknown. We have recently discov-
ered that hyaluronic acid may be the missing link through statistical analyses of
omic data of different types of cancer; and developed a detailed model in link-
ing energy metabolism reprogramming and cell proliferation. On (ii), metastatic
cancer is responsible for 90% of cancer-related mortalities, and has been con-
sidered as a terminal illness, mainly based on past experience in largely unsuc-
cessful treatment of metastatic cancers using drugs designed for primary cancer.
We have recently discovered that fundamentally different from primary can-
cer, metastatic cancer is predominantly driven by a different force, i.e., oxidized
cholesterols and their steroidogenic metabolites. A detailed model is proposed
regarding (a) why metastatic cancer tends to have increased cholesterol influx
and (b) how oxidized cholesterol products drive metastatic cancers. Both studies
suggest fundamentally different ways to view and possibly treat cancer.
Table of Contents

Full Papers
Predicting Disease Risks Using Feature Selection Based on Random
Forest and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Jing Yang, Dengju Yao, Xiaojuan Zhan, and Xiaorong Zhan

Phylogenetic Bias in the Likelihood Method Caused by Missing Data


Coupled with Among-Site Rate Variation: An Analytical Approach . . . . . 12
Xuhua Xia

An Eigendecomposition Method for Protein Structure Alignment . . . . . . 24


Satish Chandra Panigrahi and Asish Mukhopadhyay

Functional Interplay between Hemagglutinin and Neuraminidase of


Pandemic 2009 H1N1 from the Perspective of Virus Evolution . . . . . . . . . 38
Wei Hu

Predicting Protein Submitochondrial Locations Using a K-Nearest


Neighbors Method Based on the Bit-Score Weighted Euclidean
Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Jing Hu and Xianghe Yan

Algorithms Implemented for Cancer Gene Searching


and Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Murad M. Al-Rajab and Joan Lu

Dysregulated microRNA Profile in HeLa Cell Lines Induced


by Lupeol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Xiyuan Lu, Cuihong Dai, Aiju Hou, Jie Cui, Dayou Cheng, and
Dechang Xu

A Simulation for Proportional Biological Operational Mu-Circuit . . . . . . 81


Dechang Xu, Zhipeng Cai, Ke Liu, Xiangmiao Zeng,
Yujing Ouyang, Cuihong Dai, Aiju Hou, Dayou Cheng, and
Jianzhong Li

Computational Prediction of Human Saliva-Secreted Proteins . . . . . . . . . . 92


Ying Sun, Chunguang Zhou, Jiaxin Wang, Zhongbo Cao,
Wei Du, and Yan Wang

A Parallel Scheme for Three-Dimensional Reconstruction in Large-Field


Electron Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Jingrong Zhang, Xiaohua Wan, Fa Zhang, Fei Ren,
Xuan Wang, and Zhiyong Liu
XIV Table of Contents

An Improved Correlation Method Based on Rotation Invariant Feature


for Automatic Particle Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Yu Chen, Fei Ren, Xiaohua Wan, Xuan Wang, and Fa Zhang

An Effective Algorithm for Peptide de novo Sequencing from Mixture


MS/MS Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Yi Liu, Bin Ma, Kaizhong Zhang, and Gilles Lajoie

Identifying Spurious Interactions in the Protein-Protein Interaction


Networks Using Local Similarity Preserving Embedding . . . . . . . . . . . . . . . 138
Lin Zhu, Zhu-Hong You, and De-Shuang Huang

Multiple RNA Interaction with Sub-optimal Solutions . . . . . . . . . . . . . . . . 149


Syed Ali Ahmed and Saad Mneimneh

Application of Consensus String Matching in the Diagnosis of Allelic


Heterogeneity (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Fatema Tuz Zohora and M. Sohel Rahman

Continuous Time Bayesian Networks for Gene Network Reconstruction:


A Comparative Study on Time Course Data . . . . . . . . . . . . . . . . . . . . . . . . . 176
Enzo Acerbi and Fabio Stella

Drug Target Identification Based on Structural Output Controllability


of Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Lin Wu, Yichao Shen, Min Li, and Fang-Xiang Wu

NovoGMET: De Novo Peptide Sequencing Using Graphs with Multiple


Edge Types (GMET) for ETD/ECD Spectra . . . . . . . . . . . . . . . . . . . . . . . . 200
Yan Yan, Anthony J. Kusalik, and Fang-Xiang Wu

Duplication Cost Diameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212


Pawel Górecki, Jaroslaw Paszek, and Oliver Eulenstein

Computational Identification of De-Centric Genetic Regulatory


Relationships from Functional Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . 224
Zongliang Yue, Ping Wan, Zhan Xie, and Jake Y. Chen

Classification of Mutations by Functional Impact Type: Gain of


Function, Loss of Function, and Switch of Function . . . . . . . . . . . . . . . . . . . 236
Mingming Liu, Layne T. Watson, and Liqing Zhang

Network Analysis of Human Disease Comorbidity Patterns Based on


Large-Scale Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Yang Chen and Rong Xu

Identification of Essential Proteins by Using Complexes and Interaction


Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Min Li, Yu Lu, Zhibei Niu, Fang-Xiang Wu, and Yi Pan
Table of Contents XV

GenoScan: Genomic Scanner for Putative miRNA Precursors . . . . . . . . . . 266


Benjamin Ulfenborg, Karin Klinga-Levan, and Björn Olsson

Searching SNP Combinations Related to Evolutionary Information of


Human Populations on HapMap Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Xiaojun Ding, Haihua Gu, Zhen Zhang, Min Li, and Fangxiang Wu

2D Pharmacophore Query Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289


David Hoksza and Petr Škoda

Structure-Based Analysis of Protein Binding Pockets Using Von


Neumann Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Negin Forouzesh, Mohammad Reza Kazemi, and Ali Mohades

A New Mathematical Model for Inbreeding Depression in Large


Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Shuhao Sun, Fima Klebaner, and Tianhai Tian

dSpliceType: A Multivariate Model for Detecting Various Types of


Differential Splicing Events Using RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . 322
Nan Deng and Dongxiao Zhu

Conformational Transitions and Principal Geodesic Analysis on the


Positive Semidefinite Matrix Manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Xiao-Bo Li and Forbes J. Burkowski

Joint Analysis of Functional and Phylogenetic Composition for Human


Microbiome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Xingpeng Jiang, Xiaohua Hu, and Weiwei Xu

schematikon: Detailed Sequence-Structure Relationships from Mining a


Non-redundant Protein Structure Database (Extended Abstract) . . . . . . . 357
Boris Steipe and Bhooma Thiruv

Abstracts
PNImodeler: Web Server for Inferring Protein Binding Nucleotides
from Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Jinyong Im, Narankhuu Tuvshinjargal, Byungkyu Park,
Wook Lee, and Kyungsook Han

A MCI Decision Support System Based on Ontology . . . . . . . . . . . . . . . . . 368


Xiaowei Zhang, Yang Zhou, Bin Hu, Jing Chen, and Xu Ma

Context Similarity Based Feature Selection Methods for Protein


Interaction Article Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Yifei Chen, Yuxing Sun, and Ping Hou
XVI Table of Contents

Genome-Wide Analysis of Transcription Factor Binding Sites and Their


Characteristic DNA Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Zhiming Dai, Dongliang Guo, Xianhua Dai, and Yuanyan Xiong

A Comparative Study of Disease Genes and Drug Targets in the


Human Protein Interactome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Jingchun Sun, Kevin Zhu, W. Jim Zheng, and Hua Xu

Efficient Identification of Endogenous Mammalian Biochemical


Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
Mai A. Hamdalla, Reda A. Ammar, and Sanguthevar Rajasekaran

LncRNA2Function: A Comprehensive Resource for Functional


Investigation of Human lncRNAs Based on RNA-seq Data . . . . . . . . . . . . 373
Qinghua Jiang, Rui Ma, Jixuan Wang, Xiaoliang Wu, Shuilin Jin,
Jiajie Peng, Renjie Tan, Tianjiao Zhang, Yu Li, and Yadong Wang

Network Propagation Reveals Novel Genetic Features Predicting Drug


Response of Cancer Cell Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Jiguang Wang, Judith Kribelbauer, and Raul Rabadan

Splice Site Prediction Using Support Vector Machine with Markov


Model and Codon Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Dan Wei, Yin Peng, Yanjie Wei, and Qingshan Jiang

Similarity Analysis of DNA Sequences Based on Frequent Patterns and


Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Xiaojing Xie, Jihong Guan, and Shuigeng Zhou

Exploiting Topic Modeling to Boost Metagenomic Sequences Binning . . . 378


Ruichang Zhang, Zhanzhan Cheng, Jihong Guan, and Shuigeng Zhou

Network-Based Method for Identifying Overlapping Mutated Driver


Pathways in Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Hao Wu, Lin Gao, Feng Li, Fei Song, and Xiaofei Yang

Completing a Bacterial Genome with in silico and Wet Lab


Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Rutika Puranik, Jacob Werner, Guangri Quan, Rong Zhou, and
Zhaohui Xu

Screening Ingredients from Herbs against Pregnane X Receptor in the


Study of Inductive Herb-Drug Interactions: Combining Pharmacophore
and Docking-Based Rank Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Zhijie Cui, Hong Kang, Kailin Tang, Qi Liu, Zhiwei Cao, and
Ruixin Zhu
Table of Contents XVII

Improving Multiple Sequence Alignment by Using Better Guide


Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Qing Zhan, Yongtao Ye, Tak-Wah Lam, Siu-Ming Yiu,
Hing-Fung Ting, and Yadong Wang

A Markov Clustering Based Link Clustering Method for Overlapping


Module Identification in Yeast Protein-Protein Interaction Networks . . . . 385
Yan Wang, Guishen Wang, Di Meng, Lan Huang,
Enrico Blanzieri, and Juan Cui

Protein Function Prediction: A Global Prediction Method with


Multiple Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Jun Meng, Xin Zhang, and Yushi Luan

A microRNA-Gene Network in Ovarian Cancer from Genome-Wide


QTL Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
Andrew Quitadamo, Frederick Lin, Lu Tian, and Xinghua Shi

K-Profiles Nonlinear Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389


Kai Wang and Tianwei Yu

Estrogen Induced RNA Polymerase II Stalling in Breast Cancer Cell


Line MCF7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Zhi Han, Lu Tian, Jie Zhang, Tim Huang, Raghu Machiraju, and
Kun Huang

A Knowledge-Driven Approach in Constructing a Large-Scale


Drug-Side Effect Relationship Knowledge Base for Computational Drug
Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Rong Xu and QuanQiu Wang

Systems Biology Approach to Understand Seed Composition . . . . . . . . . . 393


Ling Li, Wenxu Zhou, Manhoi Hur, Joon-Yong Lee,
Nick Ransom, Cumhur Yusuf Demirkale, Zhihong Song,
Dan Nettleton, Mark Westgate, Vidya Iyer, Jackie Shanks,
Eve Syrkin Wurtele, and Basil J. Nikolau

Prediction of the Cooperative cis-regulatory Elements for Broadly


Expressed Neuronal Genes in Caenorhabditis Elegans . . . . . . . . . . . . . . . . . 394
Chen Xu and Zhengchang Su

Improving the Mapping of the Smith-Waterman Sequence Database


Search Algorithm onto CUDA GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Chao-Chin Wu, Liang-Tsung Huang, Lien-Fu Lai, and Yun-Ju Li

Isomorphism and Similarity for 2-Generation Pedigrees . . . . . . . . . . . . . . . 396


Haitao Jiang, Guohui Lin, Weitian Tong, Daming Zhu, and
Binhai Zhu
XVIII Table of Contents

VFP: A Visual Tool for Predicting Gene-Fusion Base on Analyzing


Single-end RNA-Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Ye Yang and Juan Liu

A Novel Method for Identifying Essential Proteins from Active PPI


Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Qianghua Xiao, Xiaoqing Peng, Fangxiang Wu, and Min Li

RAUR: Re-alignment of Unmapped Reads with Base Quality Score . . . . 399


Xiaoqing Peng, Zhen Zhang, Qianghua Xiao, and Min Li

PIGS: Improved Estimates of Identity-by-Descent Probabilities by


Probabilistic IBD Graph Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
Danny S. Park, Yael Baran, Farhad Hormozdiari, and Noah Zaitlen

Clustering PPI Data through Improved Synchronization-Based


Hierarchical Clustering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Xiujuan Lei, Chao Ying, Fang-Xiang Wu, and Jin Xu

Order Decay in Transcription Regulation in Type 1 Diabetes . . . . . . . . . . 404


Shouguo Gao, Shuang Jia, Martin J. Hessner, and Xujing Wang

Simulated Regression Algorithm for Transcriptome Quantification . . . . . . 405


Adrian Caciula, Olga Glebova, Alexander Artyomenko,
Serghei Mangul, James Lindsay, Ion I. Măndoiu, and Alex Zelikovsky

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407


Predicting Disease Risks Using Feature Selection
Based on Random Forest and Support Vector Machine

Jing Yang1, Dengju Yao1,2, Xiaojuan Zhan3, and Xiaorong Zhan4


1
College of Computer Science and Technology, Harbin Engineering University, Harbin, China
yangjing@hrbeu.edu.cn
2
School of Software, Harbin University of Science and Technology, Harbin, China
ydkvictory@163.com
3
College of Computer Science and Technology,
Heilongjiang Institute of Technology, Harbin China
xiaojuanzhan@gmail.com
4
Department of Endocrinology, First Affiliated Hospital,
Harbin Medical University, Harbin China
xiaorongzhan@sina.com

Abstract. Disease risk prediction is an important task in biomedicine and


bioinformatics. To resolve the problem of high-dimensional features space and
highly feature redundancy and to improve the intelligibility of data mining
results, a new wrapper method of feature selection based on random forest
variables importance measures and support vector machine was proposed. The
proposed method combined sequence backward searching approach and
sequence forward searching approach. Feature selection starts with the entire
set of features in the dataset. At every iteration, two feature subsets are gained.
One feature subset removes those most unimportant features and the most
important feature at the same time, which is used to train random forest and to
compute feature importance for next feature selection. Another feature subset
removes only those most unimportant features while remains the most
important feature, which is used as the optimal feature subset to train SVM
classifier. Finally, the feature subset with the highest SVM classification
accuracy was regarded as optimal feature subset. The experimental results on
11 UCI datasets, a real clinical data sets and a gene expression dataset show that
the proposed algorithm can generate the smaller feature subset while improve
the classification accuracy.

Keywords: Disease risk prediction, Feature selection, High dimensional data,


Random forest, Support vector machine.

1 Introduction
Disease risk prediction is an important issue in biomedical and bioinformatics. High-
dimensional and redundant features in medical and biological data have created an
urgent need for feature selection techniques [1]. In general, feature selection
algorithms can be divided into Filter methods and Wrapper methods by the adopted

M. Basu, Y. Pan, and J. Wang (Eds.): ISBRA 2014, LNBI 8492, pp. 1–11, 2014.
© Springer International Publishing Switzerland 2014
2 J. Yang et al.

feature selection strategy [2]. Filter methods are independent to machine learning
algorithms and can quickly remove out noise features and narrows searching range of
the optimal feature subset, but it does not guarantee find out a smaller optimized
feature subset. Conversely, Wrapper methods use the selected feature subset directly
to train classifiers in the feature selection process and evaluate the quality of the
feature subsets according to the performance of the classifier in the test set. Wrapper
methods are computationally less efficient than Filter methods, but these methods can
result smaller optimal feature subset than Filter methods [3].
Random forest (RF henceforth) [4] is a popular ensemble machine learning
algorithm, which provides a unique combination of prediction accuracy and model
interpretability among popular machine learning method [1]. RF uses Bootstrap [16]
to sample samples randomly from original samples with replacement and train the
decision trees in each Bootstrap sampling. In the process of node splitting of each
tree, a feature is randomly selects as splitting attribute from a feature subset [5, 6, 7].
Finally, the class of a new sample is decided by voting of multiple decision trees.
Currently, RF has been widely used in various classifications, prediction, the
variables importance, feature selection, and outlier detection issues [8, 9, 10, 11].
Especially in the biomedicine and bioinformatics, random forest is favored because it
can efficiently identify complex interaction among multiple predictors. Diaz-Uriarte
et al [12] investigated the use of random forest for classification of microarray data
and proposed a method for gene selection in classification problems based on random
forest. Their experimental results showed that random forest has comparable
performance to other classification methods, including DLDA, KNN, and SVM, and
the proposed gene selection procedure yielded very small sets of genes while
preserving predictive accuracy. However, this approach made the decision as to the
number of genes to retain arbitrarily, and it is not the most appropriate if the objective
is to obtain the smaller possible sets of genes that will allow good predictive
performance. Herbert Pang et al [13] developed an iterative feature elimination
method based on the random survival forests to identify a set of prognostic genes.
Indeed, it is an extension of the method proposed by Diaz-Uriarte in survival
outcomes prediction. This approach ordered the genes by variable importance in
descending order and removed genes of the bottom 20 percent (default), where 20
percent is also the default chosen by Diaz-Uriarte. Dessì et al [14] proposed a pre-
filtering feature selection method based on random forests for microarray data
classification. They examined random forests from an experimental perspective and
evaluated the effects of a filtering process which preceded the actual construction of
the random forest. However, within this approach, a first critical issue is the choice of
a threshold value denoting the cut-off point of the list of ranked features. Ali Anaissi
et al [15] introduced a balanced iterative random forest (BIRF) algorithm to select the
most relevant genes for a disease from imbalanced high-throughput gene expression
microarray data. The experimental results showed the BIRF approach outperformed
these state-of-the-art methods, such as Support Vector Machine-Recursive Feature
Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF)
and Naive Bayes (NB) classifiers, especially in the case of imbalanced datasets.
However, BIRF algorithm has a limitation that random forest will not be able to get
global correlation due to the splitting of the dataset.
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 3

In all these methods mentioned above, random forest was directly used for
classifier to evaluate the quality of feature subsets in the process of feature selection,
but the applicability of the random forest and the comparison with other classification
algorithms were not been systematically researched. This paper studied the
performance of random forest used as feature subset evaluating function and
compared it with k-nearest neighbor (KNN) and support vector machine (SVM)
classification algorithms. The experimental results on acute lymphoblastic leukemia
(ALL) dataset showed the SVM is similar to RF but superior to KNN with respect to
classification performance when they were used as feature subset evaluating function.
On this basis, we proposed a new method of feature selection based on random forest,
called RF&SVMFS, which is a wrapper feature selector that combined the random
forest with support vector machine. RF&SVMFS also combined sequence backward
searching approach and sequence forward searching approach. The base learning
algorithm is random forest, which is used to compute variable importance for each
feature and to determine what features are removed or selected at each step. The SVM
algorithm is used for evaluating the quality of feature subsets. Feature selection starts
with the entire set of features in the dataset. At every iteration, two feature subsets are
gained. One feature subset removes those most unimportant features and the most
important feature at the same time, which is used to train random forest and to
compute feature importance for next feature selection. Another feature subset
removes only those most unimportant features while remains the most important
feature, which is used as the optimal feature subset to train SVM classifier. The
experimental results on 11 UCI datasets, a real clinical data sets and a gene expression
dataset show that the proposed algorithm can generate the smaller feature subset
while improve the classification accuracy.

2 Method

In this paper, we proposed a new feature selection method called RF&SVMFS based
on random forest and support vector machine, which combined sequence backward
searching approach and sequence forward searching approach. In the RF&SVMFS,
RF was run firstly to compute importance score for each feature. Then, all features
were sorted based on the importance scores. In order to ensure the stability and
reliability of the result, RF was run 5 times and the average of 5 times running result
was used as the basis of sorting features in every iteration. Next, the generalized
sequence backward searching strategy and sequence forward searching strategy was
used to generate feature subset. In detail, L most unimportant features (with minimal
importance score) and the most important feature were removed from original dataset,
and a new dataset was generated. Meanwhile, another dataset was generated by
removing only the L most unimportant features. The first dataset was used to train
random forest and to compute variable importance for next iteration. The second
dataset was used to train support vector machine and to evaluate the quality of the
feature subset. In order to ensure the stability of results, 10-fold cross-validation was
used while calculating the classification accuracy. The above process was repeated
4 J. Yang et al.

iteratively until the number of features in the feature set meeting the requirements
(only 5 features are left in the feature set in this research). Finally, feature set with
highest classification accuracy of SVM in all iterations was selected as the optimal
features set, and the variable importance scores are calculated for each feature at the
same time. The proposed algorithm is designed as follows:
Input: the original dataset S
L value in generalized sequence backward search
Output: highest classification accuracy MaxAccuracy
optimal feature subset OptFeatureSet
importance scores of features FeatureScore
Steps:
1. Initialization: MaxAccuracy <- 0
OptFeatureSet <- S
TmpFeatureSet <- S
GloOptFeatureSet <- NUll
2. while ( the number of features in OptFeatureSet > 5)
2.1 RF is run on 5 times TmpFeatureSet, the average value
of variable importance score of each feature is
computed and stored as vector RFAverageScore;
2.2 Features in TmpFeatureSet are ordered according to
RFAverageScore, the results are saved as
SortedFeatureSet;
2.3 According to SortedFeatureSet, remove L features with
the lowest variable importance scores from
OptFeatureSet and get new dataset OptFeatureSet;
2.4 According to SortedFeatureSet, remove L features with
the lowest variable importance scores and a most
important feature from TmpFeatureSet and get new
dataset TmpFeatureSet;
2.5 Randomly divided OptFeatureSet into 10 equal parts,
on nine of which run SVM algorithm to train
classifier, the remaining one part is used as a test
dataset, then calculate the SVM classifier accuracy
in test set, such iterations was executed 10 times,
calculate the average classification accuracy and
save as SVMAverageAccuracy;
2.6 if( MaxAccuracy <= SVMAverageAccuracy)
MaxAccuracy <- SVMAverageAccuracy
GloOptFeatureSet <- features in OptFeatureSet
End while
3. print(MaxAccuracy)
print(GloOptFeatureSet)
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 5

3 Experiments and Discussion


3.1 Datasets

For validating the effectiveness of the proposed feature selection algorithm, the paper
selects 11 UCI datasets frequently used in literatures [17], a real diabetes clinical
dataset (DiabetesDB), and an acute lymphoblastic leukemia (ALL) dataset [18]. The
detailed information about these datasets is shown in Table 1. The dimensions of the
UCI datasets range from 6 to 61, and the data types include not only discrete data but
also continuous data or discrete and continuous mixed data. Particularly, Diabetes
clinical data were collected from a Level-three hospital in Heilongjiang Province of
China in 2006-2012, which includes 955 records of patients with type II diabetes.
Each original record has 72 features. According to advices of the endocrine experts, a
portion of obviously irrelevant and redundant features were removed, and the final
dataset includes 46 classification variables and one objective variable. ALL dataset
consist of 12625 genes from 128 different individuals with acute lymphoblastic
leukemia (ALL), in which there are 33 T cells acute lymphoblastic leukemia and 95 B
cells acute lymphoblastic leukemia. The data have been normalized (using rma) and it
is the jointly normalized data that are available here. The data are presented in the
form of an exprSet object. In this paper, we focus on the analysis of B cells acute
lymphoblastic leukemia, and gene mutation is our target class. So we selected 94 B
cells acute lymphoblastic leukemia samples with 12625 genes and an objective
variable representing the type of acute lymphoblastic leukemia as our dataset, marked
as ALLb.

Table 1. Summary of Datasets

NO. Dataset Nominal Continuous Instance size Number of


attributes attributes Class
1 Breast 1 9 699 2
2 Chess 36 0 3196 2
3 Credit 9 6 690 2
4 Diabetes 0 8 768 2
5 Heart 7 6 270 2
6 Liver 0 6 345 2
7 wpbc 1 33 198 2
8 wdbc 1 31 569 2
9 German-org 15 7 1000 2
10 Sonar 1 60 208 2
11 Ionosphere 0 34 351 2
12 DiabetesDB 1 46 955 2
13 ALLb 0 12626 94 4
6 J. Yang et al.

3.2 Experiments on UCI Datasets

Experimental results on UCI datasets are shown in Table 2 and Table 3, where the
"Feature" represents the number of features in optimal feature subset, "Acc"
represents the classification accuracy in test set, and “NA" indicates that there are no
experimental results in the referenced literatures. Here, the L value in RF&SVMFS is
set as 1. From Table 2 and Table 3, it can be seen, in dataset 1, 3 and 5, RF&SVMFS
selected out the smaller or equal optimal feature subset than CBFS [19], AMGA [20],
but the classification accuracy has been significantly improved. In dataset 7, 8, 9, 10
and 11, the number of the selected features selected by RF&SVMFS is similar with
other algorithms, but the classification accuracy is obvious higher than ACAHFS [21]
and GA-cull [21]. In addition, as one can see, the higher the dimensions of the dataset
are, the better the performance of RF&SVMFS is. In a word, the proposed algorithm
outperformed the existing methods in literature with respect to both the quality of
feature subset and the classification accuracy.

Table 2. The Performance Comparison among RF&SVMFS, CBFS and AMGA

No. CBFS AMGA RF&SVMFS


Feature Acc Feature Acc Feature Acc
1 6 0.943 NA NA 5 0.985
2 22 0.954 5 0.991 22 0.967
3 5 0.848 NA NA 5 0.885
4 4 0.670 NA NA 4 0.79
5 9 0.791 5 0.804 3 0.963
6 3 0.628 8 0.915 3 0.768

Table 3. The Performance Comparison among RFVIMFS, ACAHFS and GA-cull

No. ACAHFS GA-cull RF&SVMFS


Feature Acc Feature Acc Feature Acc
7 3 0.796 10 0.694 6 0.854
8 5 0.979 2 0.979 9 0.983
9 17 0.732 2 0.692 14 0.793
10 16 0.824 2 0.827 17 0.918
11 6 0.965 6 0.937 6 0.972

3.3 Experiments on Real Diabetes Dataset

For DiabetesDB dataset, we performed RF, SVM and RF&SVMFS algorithm


respectively. In order to ensure the stability of the results, the paper adopted 10-fold
cross-validation method for each algorithm, and the average value of classification
accuracy in 10 test sets was computed. The results were shown in Table 4. One can
see that the classification accuracy of the RF&SVMFS is significantly better than
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 7

original RF and SVM algorithm. In addition, the method based on random forests can
provide variable importance scores of each feature. The bigger the importance score
is, the greater the impact of the feature to target variable is. This can help medical
experts to understand the results of data mining.

Table 4. Performance of RF, SVM and RF&SVMFS on DiabetesDB Dataset

Algorithm Accuracy Features Set


RF 0.742 All Features
SVM 0.723 All Features
RFVIMFS 0.832 Age CH ALT INS30 INS60 AST LDL-C Cr
CP60 FCP CP30 INS120

We also studied the risk factors of Peripheral Arterial Disease. The top 10 risk
factors were shown in Figure 2. As shown, age is the primary risk factor to Peripheral
Arterial Disease. Smoking history is ranked in second. These results are consistent
with previous findings. ALT is the third risk factor. Recent studies have shown that
ALT is a flag that one’s liver has been damaged, which is related with atherosclerosis.
Therefore, our findings are consistent with previous researches. We also have a new
discovery that INS30 and INS60 are important risk factors for Peripheral Arterial
Disease and their impacts are similar. According to medical knowledge, insulin is a
potent growth factor, which can increase collagen synthesis and stimulate vascular
smooth muscle cell proliferation. This is a process of atherosclerosis, and therefore
insulin levels reflect the lower limbs of atherosclerosis in some extent. Overall, the
results of this study are highly consistent with previous studies, and the proposed
RFVIMFS algorithm is reasonable.

Fig. 1. Top-10 risk factors of Peripheral Arterial Disease

3.4 Experiments on Acute Lymphoblastic Leukemia Dataset

In this section, we studied the capability of different classifiers used as feature subset
evaluating function and compared the performance of the proposed RF&SVMFS with
8 J. Yang et al.

some popular feature selection method using acute lymphoblastic leukemia dataset
(ALLb) [18]. ALLb is a microarray gene express dataset, which include 94 B cell
acute lymphoblastic leukemia samples with 12625 genes. The objective variable has
four categories, including ALL/AF4, BCR/ABL, E2A/PBX1 and NEG. The
dimensions of this dataset are very large, so feature selection was performed before
the prediction model was trained. Firstly, we used interquartile range (IQR) to filter
genes based on gene expression levels distribution, and all genes whose variability is
less than 1/5 overall IQR are eliminated. After this process, the numbers of genes
became 3970 from 12625. Next, we performed Factor Analysis (AVONA), filter
feature selection based on random forest (FFSRF), filter feature selection based on
combined feature clustering (FFSFC) [18] and our proposed method respectively.
Incidentally, because KNN, randomForest and SVM were used as feature evaluating
function respectively here, we represent the proposed feature selection method by
WFSRF which is different from FFSRF. As a result, AVONA selected 752 genes;
Both FFSRF and FFSFC selected top 30 genes, RF&SVMFS selected top 50 and top
20 genes. Finally, KNN, randomForest and SVM algorithm were performed on these
feature subsets, and the classification accuracy of each case was shown as Table 5. As
one can see from Table 5, the proposed WFSRF methods are overall superior to
ANOVA, FFSRF and FFSFC with respect to classification accuracy. While KNN,
randomForest and SVM were used as feature subset evaluating function respectively,
randomForest and SVM are evenly matched, and both are superior to KNN. This
proves the validity of our proposed method.

Table 5. Classification Accuracy of Different Methos

classifier
KNN randomForest SVM
feature selection
ANOVA.752 0.8298 0.7979 0.8617
FFSRF.30 0.8830 0.8723 0.8617
FFSFC.30 0.8617 0.8298 0.8511
WFSRF.20 0.8421 0.8947 0.8947
WFSRF.50 0.8947 0.9475 0.9475

3.5 Discussion on L Value

In this section, we explored the setting of L value in generalized sequence backward


search strategy. Readily appreciated, the larger L value mean the more features are
removed at each iteration and the algorithm run quickly, meanwhile some important
features maybe been deleted in advance. On the other hand, the smaller L value can
provide a more fine-grained deletion, but it will increase computing time. We
designed a set of experiments on 11 UCI datasets, and we run RF&SVMFS when L is
set as 1, 2, 3, 4 and 5 respectively. The results are shown as Figure 1. As one can see,
in the Sonar, when L is set as 3, RF&SVMFS obtains the best classification
performance of 94.23%. Otherwise, in the other 10 data sets, with the increasing of L,
the classification accuracy showed a downward trend, that is, when L is set as 1,
RF&SVMFS has best results. This is maybe because that, when dimensions of the
Predicting Disease Risks Usin
ng Feature Selection Based on Random Forest and SVM 9

dataset is small, each timee deleting one feature can effectively eliminate redunddant
features and irrelevant featu
ures, and larger L value will make some important featuures
be removed together with non-related
n features. However, when the dimensions off the
dataset are very large, such
h as ALLb, the larger L value makes it possible to rem move
redundant features and irreelevant features quickly and to improve the classificattion
performance, as shown in Sonar
S dataset in this paper. As a rule of thumb, when the
dimensions of dataset is very
v large, L should be set as √N, N is the numberr of
features of dataset. For ALLb
A dataset in this paper, we adopted a combinattion
method. When the dimensio on of the dataset is larger than 50, we set L as 50; when the
dimension is smaller than 50, we set L as 5.

Fig. 2. Performance comp


parison of SVM classification while L adopts different values

4 Conclusions
Due to high-dimensional feeature space and highly feature redundancy in biomediccine
and bioinformatics dataset,, the existing machine learning algorithms have been not
competent data mining task
ks in these field. Random forests algorithm has the capaccity
of analyzing complex in nteractions among features and can provide variaable
importance score which cann be used as a convenient tool for the feature selection. T
The
paper proposed a new Wraapper feature selection algorithms based on random forest
variable importance measurrement and support vector machine. The proposed methhod
combined generalized sequ uence backward searching strategy and sequence forw ward
sequence searching strategyy for feature selection. Experimental results show that the
proposed feature selection algorithm is responsible for finding the optimal featture
10 J. Yang et al.

subset and can effectively improve the classification accuracy. Simultaneously, the
algorithm can give out the variable importance scores for each feature in the optimal
feature subset, and enhance the comprehensibility of data mining results. In addition,
we study the capability of different classification algorithms used as feature subset
evaluating function, and experiment shows that SVM is evenly matched with random
forest but superior to KNN in ALLb dataset. Experimental validation and deeper
research on more datasets is the next direction of research.

Acknowledgements. This work is sponsored by the National Natural Science


Foundation of China (No.61370083, No.61073043, and No.61073041), the National
Research Foundation for the Doctoral Program of Higher Education of China
(No.20112304110011, No.20122304110012), the Natural Science Foundation of
Heilongjiang Province (No.F200901, No. F201313), the Harbin Outstanding Academic
Leader Foundation of Heilongjiang Province of China (No.2011RFXXG015), the Harbin
Special Funds for Technological Innovation Research of Heilongjiang Province of China
(No.2013RFQXJ114), and the Foundation of Heilongjiang Province Educational
Committee(No.12511233).

References
1. Qi, Y.: Random Forest for Bioinformatics. In: Ensemble Machine Learning, pp. 307–323
(2012)
2. Inza, I., Larranaga, P., Blanco, R.: Filter versus wrapper gene selection approaches in
DNA microarray domains. Artificial Intelligence in Medicine 31(2), 91–103 (2008)
3. Tsymbal, A., Puuronen, S.: Ensemble feature selection with the simple Bayesian
classification. Information Fusion 4(2), 87–100 (2010)
4. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
5. Bishop, C.M.: Bootstrap. Pattern Recognition and Machine Learning. Springer, Singapore
(2006)
6. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
7. Breiman, L., Friedman, J.H., Olshen, R.A., et al.: Classification and Regression Trees.
Chapman&Hall (1993)
8. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable
importance for random forests. BMC Bioinformatics 9, 307 (2008)
9. Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: A survey
and results of new tests. Pattern Recognition 44, 330–349 (2011)
10. Liu, H., Li, J.: A comparative study on feature selection and classification methods using
gene expression profiles and proteomic patterns. Genome Informatics 13, 51–60 (2012)
11. Wang, A., Wan, G., Cheng, Z., et al.: Incremental Learning Extremely Random Forest
Classifier for Online Learning. Journal of Software 22(9), 2059–2074 (2011)
12. Díaz-Uriarte, R., de Andrés, S.A.: Gene selection and classification of microarray data
using random forest. BMC Bioinformatics 7, 3 (2006)
13. Pang, H., George, S.L., Hui, K., Tong, T.: Gene Selection Using Iterative Feature
Elimination Random Forests for Survival Outcomes. IEEE/ACM Transactions on
Computational Biology and Bioinformatics 9(5), 1422–1431 (2012)
Predicting Disease Risks Using Feature Selection Based on Random Forest and SVM 11

14. Dessì, N., Milia, G., Pes, B.: Pre-filtering Features in Random Forests for Microarray Data
Classification. In: New Frontiers in Mining Complex Patterns (NFMCP 2012). vol. 60
(2012)
15. Anaissi, A., Kennedy, P.J., Goyal, M., Catchpoole, D.R.: A balanced iterative random
forest for gene selection from microarray data. BMC Bioinformatics 14, 261 (2013)
16. Yi, C., Li, J., Zhu, C.: A kind of feature selection based on classification accuracy of SVM.
Journal of Shandong University 45(7), 119–124 (2010)
17. UC Irvine Machine Learning Repository, http://archive.ics.uci.edu/ml/
18. Torgo, L.: Data Mining with R: Learning with Case Studies. Luis Chapman & Hall/CRC
(2010)
19. Jiang, S., Zheng, Q., Zhang, Q.: Clustering-Based Feature Selection. Acta Electronica
Sinica 36(12), 157–160 (2008)
20. Liu, Y., Wang, G., Zhu, X.: Feature selection based on adaptive multi-population genetic
algorithm. Journal of Jilin University 41(6), 1690–1693 (2011)
21. Zhang, J., He, Z., Wang, J.: Hybrid Feature Selection Algorithm Based on Adaptive Ant
Colony Algorithm. Journal of System Simulation 21(6), 1605–1614 (2009)
Phylogenetic Bias in the Likelihood Method
Caused by Missing Data Coupled with Among-Site Rate
Variation: An Analytical Approach

Xuhua Xia

Department of Biology and Center for Advanced Research in Environmental Genomics,


University of Ottawa, 30 Marie Curie, P.O. Box 450, Station A,
Ottawa, Ontario, Canada, K1N 6N5
xxia@uottawa.ca

Abstract. More and more researchers in phylogenetics are concatenating gene


sequences to produce supermatrices in the hope that larger data sets will lead to
better phylogenetic resolution. Almost all of these supermatrices contain a high
proportion of missing data which could potentially cause phylogenetic bias.
Previous studies aiming to identify the missing-data-mediated bias in the max-
imum likelihood method have noted a bias associated with among-site rate vari-
ation. However, this finding is by sequence simulation and has been challenged
by other simulation studies, with the controversy still unresolved. Here I illu-
strate analytically this bias caused by missing data coupled with among-site rate
variation. This approach allows one to see how much the bias can contribute to
likelihood differences among different topologies. The study highlights the
point that, while supermatrices may lead to “robust” trees, such “robust” trees
may be purchased with illegal phylogenetic currency.

Keywords: missing data, pruning algorithm, likelihood, phylogenetic bias, su-


permatrix.

1 Introduction
Many supermatrices have been compiled in recent years by concatenating sequences
from many different genes [1-4]. Such concatenated genes typically have few shared sites
among all included species. For example, while Regier et al. [3] claimed to have 41 kilo-
bases of aligned DNA sequences, the actual number of sites that are completely unambi-
guous among all 80 species amounts to only 705 sites. Some genes are completely miss-
ing in nearly half of the 80 species. While the potential problems involving such “?”-
laden supermatrices have been suspected before[5], specific biases associated with such
missing data have not been well studied, especially not in the likelihood framework
which has been the gold standard in phylogenetic reconstruction.
Previous studies [6-11] attempted to identify bias associated with missing data ei-
ther by sequence simulation or by selectively eliminating sites in a real sequence
alignment. While most publications suggest that phylogenetic reconstruction is
not sensitive to missing data or that the benefit of including taxa with missing data

M. Basu, Y. Pan, and J. Wang (Eds.): ISBRA 2014, LNBI 8492, pp. 12–23, 2014.
© Springer International Publishing Switzerland 2014
Phylogenetic Bias in the Likelihood Method Caused by Missing Data 13

out-weight the cost of their exclusion [6, 8-11], a recent study [7] suggested a signifi-
cant bias associated with missing data and coupled with among-site rate variation.
However, such simulation-based findings often cannot pin-point where the bias arises
and consequently have been challenged by others on both empirical [6, 9, 11] and
theoretical grounds [9], although these latter publications did not explicitly test the
claimed bias [7] associated with among-site variation. Roure et al. [9] noted that, if
sequences contain similar phylogenetic information, then phylogenetic reconstruction
is not sensitive to missing data. However, they also noted that heterogeneous data
could lead to phylogenetic bias based on extensive data analysis.
Here I demonstrate analytically the bias associated with the missing data coupled
with among-site rate variation. The pruning algorithm [12, 13, 14, pp. 253-255] is
briefly outlined, in conjunction with the conventional missing data handling by the
likelihood method, so that the reader can verify the claimed bias introduced by miss-
ing data. I first illustrate the “bias” shown by Lemmon et al. [7] when branch lengths
are not allowed to be zero, by using both JC69 [15] and F84 [16] models. Such a
“bias” can be easily avoided by simply allow branches to be zero and should not be
considered as estimation bias in the likelihood method. However, the bias due to the
missing data associated with among-site rate variation [7] is real. This bias can lead to
either increased tendency (and confidence) to group together OTUs (operational
taxonomic units) that share the same stretches of missing sites or in the opposite di-
rection. The results suggest that blindly concatenating sequence data to generate a
supermatrix with many pieces of missing data will generate false confidence in phy-
logenetic resolution and should be avoided.

2 Missing Data Handling and the Pruning Algorithm

The likelihood approach features a convenient way to handle missing data, which is
best illustrated with the pruning algorithm. Suppose we have four OTUs with
sequence data in Fig. 1a, and with the last two sequences being entirely missing
(represented by ‘?’). Obviously, we can only estimate the distance between S1 and S2
but not the evolutionary relationships involving OTUs S3 or S4. The maximum like-
lihood distance between S1 and S2, based on the JC69 model, is given by

8!
L= Pii4 Pij4 (1)
4!4!
which, when maximized, leads to a distance of 0.8239592165.
Fig. 2 illustrates the computation of the likelihood by the pruning algorithm, given
the first site of the aligned nucleotide sequence (Fig. 1a) and topology T1 in Fig. 1. I
included the numerical illustration here to facilitate the verification of subsequent
claims that the maximum likelihood method does exhibit a true and identifiable bias
in phylogenetic reconstruction involving missing data coupled with among-site rate
variation.
Another random document with
no related content on Scribd:
many instances the lack of teachers is greater in those
provinces which are most thickly populated and whose people
are most highly civilized. …

"While most of the small towns have one teacher of each sex,
in the larger towns and cities no adequate provision is made
for the increased teaching force necessary; so that places of
30,000 or 40,000 inhabitants are often no better off as
regards number of teachers than are other places in the same
province of but 1,500 or 2,000 souls. The hardship thus
involved for children desiring a primary education will be
better understood if one stops to consider the nature of the
Philippine 'pueblo,' which is really a township, often
containing within its limits a considerable number of distinct
and important villages or towns, from the most important of
which the township takes its name. The others, under distinct
names, are known as 'barrios,' or wards. It is often quite
impossible for small children to attend school at the
particular town which gives its name to the township on
account of their distance from it. …

"The character and amount of the instruction which has


heretofore been furnished is also worthy of careful
consideration. The regulations for primary schools were as
follows: 'Instruction in schools for natives shall for the
present be reduced to elementary primary instruction and shall
consist of—

1. Christian doctrine and principles of morality and sacred


history suitable for children.

2. Reading.

3. Writing.

4. Practical instruction in Spanish, including grammar


and orthography.
5. Principles of arithmetic, comprising the four rules for
figures, common fractions, decimal fractions, and instruction
in the metric system with its equivalents in ordinary weights
and measures.

6. Instruction in general geography and Spanish history.

7. Instruction in practical agriculture as applied to the


products of the country.

8. Rules of deportment.

9. Vocal music.'

"It will be noted that education in Christian doctrine is


placed before reading and writing, and, if the natives are to
be believed, in many of the more remote districts instruction
began and ended with this subject and was imparted in the
local native dialect at that. It is further and persistently
charged that the instruction in Spanish was in very many cases
purely imaginary, because the local friars, who were formerly
'ex officio' school inspectors, not only prohibited it, but
took active measures to enforce their dictum. … Ability to
read and write a little of the local native language was
comparatively common. Instruction in geography was extremely
superficial. As a rule no maps or charts were available, and
such information as was imparted orally was left to the memory
of the pupil, unaided by any graphic method of presentation.
The only history ever taught was that of Spain, and that under
conventional censorship. The history of other nations was a
closed volume to the average Filipino. … The course as above
outlined was that prescribed for boys. Girls were not given
instruction in geography, history, or agriculture, but in
place of these subjects were supposed to receive instruction
'in employments suitable to their sex.'
"It should be understood that the criticisms which have been
here made apply to the provincial schools. The primary
instruction given at the Ateneo Municipal at Manila, under the
direction of the Jesuits, fulfilled the requirements of the
law, and in some particulars exceeded them. … The only
official institution for secondary education in the
Philippines was the College of San Juan de Letran, which was
in charge of the Dominican Friars and was under the control of
the university authorities. Secondary education was also given in
the Ateneo Municipal of Manila, by the Jesuit Fathers, and
this institution was better and more modern in its methods
than any other in the archipelago. But although the Jesuits
provided the instruction, the Dominicans held the
examinations. … There are two normal schools in Manila, one
for the education of male and the other for the education of
female teachers. … The only institutions for higher education
in the Philippines have been the Royal and Pontifical
University of Santo Tomas, and the Royal College of San José,
which has for the past twenty-five years been under the
direction of the university authorities."

Report of the Philippine Commission,


January 1, 1900, volume 1, part 3.

EDUCATION: Porto Rico: A. D. 1898.


Spanish schools and teachers.

See (in this volume)


PORTO RICO: A. D. 1898-1899 (AUGUST-JULY).

EDUCATION: Porto Rico: A. D. 1900.


First steps in the creation of a public school system.

See (in this volume)


PORTO RICO: A. D. 1900 (AUGUST-OCTOBER).

EDUCATION: Russia:
Student troubles in the universities.

See (in this volume)


RUSSIA: A. D. 1899 (FEBRUARY-JUNE); 1900; and 1901.

EDUCATION: Tunis:
Schools under the French Protectorate.

See (in this volume)


TUNIS: A. D. 1881-1898.

EDUCATION: United States:


Indian schools.

See (in this volume)


INDIANS, AMERICAN: A. D. 1899-1900.

EDUCATION: United States: A. D. 1896.


Princeton University.

The one hundred and fiftieth anniversary of the founding of


the institution at Princeton, New Jersey, which had borne the
name of "The College of New Jersey," was celebrated on the
20th, 21st, and 22d of October, 1896, with ceremonies in which
many representatives from famous seats of learning in Europe
and America took part. The proceedings included a formal
change of name, to Princeton University.

{195}

EDUCATION: United States: A. D. 1900.


Women as students and as teachers.

See (in this volume)


NINETEENTH CENTURY: THE WOMAN'S CENTURY.

EDWARD VII., King of England.


Accession.
English estimate of his character.

See (in this volume)


ENGLAND: A. D. 1901 (JANUARY-FEBRUARY).

Opening of his first Parliament.


The Royal Test Oath.

EDWARD VII., King of England.


See (in this volume)
ENGLAND: A. D. 1901 (FEBRUARY).

----------EGYPT: Start--------

EGYPT:
Recent Archæological Explorations and their result.
Discovery of prehistoric remains.
Light on the first dynasties.

See (in this volume)


ARCHÆOLOGICAL RESEARCH: EGYPT: RESULTS.

EGYPT: A. D. 1885-1896.
Abandonment of the Egyptian Sudan to the Dervishes.
Death of the Mahdi and reign of the Khalifa.
Beginning of a new Anglo-Egyptian movement for
the recovery of the Sudan.
The expedition to Dongola.

After the failure to rescue General Gordon from the Mahdists


at Khartoum (see, in volume 1, EGYPT: A. D. 1884-1885), the
British government, embarrassed in other quarters, felt
compelled to evacuate the Sudan. Before it did so the Mahdi
had finished his career, having died of smallpox in June,
1885, and one of his three chief commanders, styled khalifas,
had acquired authority over the Dervish army and reigned in
his place. This was the Khalifa Abdullah, a chieftain of the
Baggara tribe. Khartoum had been destroyed, and Omdurman, on
the opposite side of the river, became his capital. The rule
of the Khalifa was soon made so cruelly despotic, and so much
in the interest of his own tribe, that incessant rebellions in
many parts of his dominions restrained him from any vigorous
undertaking of the conquest of Egypt, which was the great
object of Dervish desire. But his able and energetic
lieutenant in the Eastern Sudan, Osman Digna, was a serious
menace to the Egyptian forces holding Suakin, where Major
Watson, at first, and afterwards Colonel Kitchener, were
holding command, under General Grenfell, who was then the
Egyptian Sirdar, or military chief. Osman Digna, however, was
defeated in all his attempts. At the same time the Khalifa was
desperately at war with the Negus. John, of Abyssinia, who
fell in a great battle at Galabat (March, 1889), and whose
death at the crisis of the battle threw his army into
confusion and caused its defeat. Menelek, king of the
feudatory state of Shoa, acquired the Abyssinian crown, and
war with the Dervishes was stopped. Then they began an advance
down the Nile, and suffered a great defeat from the British
and Egyptian troops, at Toski, on the 17th of August, 1889.
From that time, for several years, "there was no real menace
to Egypt," and little was heard of the Khalifa. "His
territories were threatened on all sides: on the north by the
British in Egypt; on the south by the British in Uganda: on
the west by the Belgians in the Congo Free State, and by the
French in the Western Soudan; whilst the Italians held Kassala
on the east; so that the Khalifa preferred to husband his
resources until the inevitable day should arrive when he would
have to fight for his position."

A crisis in the situation came in 1896. The Egyptian army,


organized and commanded by British officers, had become a
strong fighting force, on which its leaders could depend. Its
Sirdar was now Major-General Sir Herbert Kitchener, who
succeeded General Grenfell in 1892. Suddenly there came news,
early in March, 1896, of the serious reverse which the
Italians had suffered at Adowa, in their war with the
Abyssinians (see, (in this volume) ITALY: A. D. 1895-1896).
"The consternation felt in England and Egypt at this disaster
deepened when it became known that Kassala, which was held by
the Italian forces, was hemmed in, and seriously threatened by
10,000 Dervishes, and that Osman Digna was marching there with
reinforcements. If Kassala fell into the hands of the
Dervishes, the latter would be let loose to overrun the Nile
valley on the frontier of Egypt, and threaten that country
itself. As if in anticipation of these reinforcements, the
Dervishes suddenly assumed an offensive attitude, and it was
rumoured that a large body of Dervishes were contemplating an
immediate advance on Egypt. … A totally new situation was now
created, and immediate action was rendered imperative.
Everything was ripe for an expedition up the Nile. Whilst
creating a diversion in favour of the Italians besieged at
Kassala, it afforded an opportunity of creating a stronger
barrier than the Wady Halfa boundary between Egypt and the
Dervishes, and it would moreover be an important step towards
the long-wished-for recovery of the Soudan. The announcement
of the contemplated expedition was made in the House of
Commons on the 17th of March, 1896, by Mr. Curzon,
Under-Secretary for Foreign Affairs. It came as a great
surprise to the whole country, which, having heard so little
of the Dervishes of late years, was not prepared for a
recrudescence of the Soudan question. [But a vote of censure
on the Egyptian policy of the government, moved by Mr. John
Morley in the House of Commons, was rejected by 288 to 145.] …

"An unexpected difficulty arose in connection with the


financing of the expedition. This is explained very plainly
and concisely in the 'Annual Register,' 1896, which we quote
at length:—

'In order to defray the cost of the undertaking, it being


obviously desirable to impose as little strain as possible on
the slowly recovering finances of Egypt, it was determined by
the Egyptian Government to apply for an advance of £500,000
from the General Reserve Fund of the Caisse de la Dette, and
the authorities of the Caisse obligingly handed over the
money. … However, the French and Russian members of the Caisse
de la Dette protested against the loan which the Caisse had
made. … In December (1896) the International Court of Appeal
required the Egyptian Government to refund to the Caisse the
£500,000 which they had secured. The very next day Lord Cromer
offered an English loan to make good the advance. The Egyptian
Government accepted his offer, and repaid immediately the
£500,000 to the Caisse, and the result of this somewhat absurd
transaction is that England has thus strengthened her hold in
another small point on the Government of Egypt.'"

H. S. L. Alford, and W. D. Sword,


The Egyptian Soudan, its Loss and Recovery,
chapter 4 (London: Macmillan & Company).

{196}

On the 21st of March, the Sirdar left Cairo for Assouan and
Wady Halfa, and various Egyptian battalions were hurried up
the river. Meantime, the forces already on the frontier had
moved forward and taken the advanced post of the Dervishes, at
Akasheh. From that point the Sirdar was ready to begin his
advance early in June, and did so with two columns, a River
Column and a Desert Column, the latter including a camel corps
and a squadron of infantry mounted on camels, besides cavalry,
horse artillery and Maxim guns. Ferket, on the east bank of
the Nile, 16 miles from Akasheh, was taken after hard fighting
on the 7th of June, many of the Dervishes refusing quarter and
resisting to the death. They lost, it was estimated, 1,000
killed and wounded, and 500 were taken prisoners. The Egyptian
loss was slight. The Dervishes fell back some fifty miles, and
the Sirdar halted at Suarda during three months, while the
railroad was pushed forward, steamers dragged up the cataracts
and stores concentrated, the army suffering greatly, meantime,
from an alarming epidemic of cholera and from exhausting
labors in a season of terrific heat. In the middle of
September the advance was resumed, and, on the 23d, Dongola
was reached. Seeing themselves outnumbered, the enemy there
retreated, and the town, or its ruins, was taken with only a
few shots from the steamers on the river. "As a consequence of
the fall of Dongola every Dervish fled for his life from the
province. The mounted men made off across the desert direct to
Omdurman, and the foot soldiers took the Nile route to Berber,
always being careful to keep out of range of the gunboats,
which were prevented by the Fourth Cataract from pursuing them
beyond Merawi."

C. Hoyle,
The Egyptian Campaigns, new and revised edition,
to December, 1899, chapter 70-71.

The Emir who commanded at Dongola was a comparatively young


man, Mohammed Wad el Bishara, who seems to have been possessed
by a very genuinely religious spirit, as shown in the
following letter, which he had written to the Dervish
commander at Ferket, just before the battle there, and which
was found by the British officers when they entered that
place:

"You are, thank God, of good understanding, and are thoroughly


acquainted with those rules of religion which enjoin love and
unison. Thanks be to God that I hear but good reports of you.
But you are now close to the enemy of God, and have with you,
with the help of God, a sufficient number of men. I therefore
request you to unite together, to have the heart of a single
man founded on love and unity. Consult with one another, and
thus you will insure good results, which will strengthen the
religion and vex the heathen, the enemies of God. Do not move
without consulting one another, and such others, also, in the
army who are full of sense and wisdom. Employ their plans and
tricks of war, in the general fight more especially. Your
army, thank God, is large; if you unite and act as one hand,
your action will be regular: you will, with the help of God,
defeat the enemies of God and set at ease the mind of the
Khalifa, peace be on him! Follow this advice, and do not allow
any intrigues to come between you. Rely on God in all your
doings; be bold in all your dealings with the enemy; let them
find no flaw in your disposition for the fight. But be ever
most vigilant, for these enemies of God are cunning, may God
destroy them! Our brethren, Mohamed Koku, with two others,
bring you this letter; on their return they will inform me
whether you work in unison or not. Let them find you as
ordered in religion, in good spirits, doing your utmost to
insure the victory of religion. Remember, my brethren, that
what moves me to urge on you to love each other and to unite
is my love for you and my desire for your good. This is a
trial of war; so for us love and amity are of utmost
necessity. You were of the supporters of the Mahdi, peace be
on him! You were as one spirit occupying one body. When the
enemy know that you are quite united they will be much
provoked. Strive, therefore, to provoke these enemies of
religion. May God bless you and render you successful."

EGYPT: A. D. 1895.
New anti-slavery law.

A convention to establish a more effective anti-slavery law in


Egypt was signed on the 21st of November, 1895, "by the
Minister of Foreign Affairs, representing the Khedival
Government, and the British diplomatic agent and
consul-general. … This new convention will supplant that of
August 4, 1877, which … was found to be defective, inasmuch as
it provided no penalty for the purchaser of a slave, but for
the seller only. An Egyptian notable, Ali Pasha Cherif, at
that time president of the Legislative Council, was tried for
buying slaves for his household, but escaped punishment
through a technicality of the law hitherto escaping notice. …
Under the existing regulations, every slave in the Egyptian
dominions has the right to complete freedom, and may demand
his certificate of manumission whenever he chooses. Thus, all
domestic slaves, of whom there are thousands in Cairo,
Alexandria, and the large towns, may call upon their masters
to set them free. Many choose to remain in nominal bondage,
preferring the certainty of food and shelter to the hardships
and uncertainty of looking after themselves."

United States, Consular Reports,


March, 1896, page 370.

EGYPT: A. D. 1897.
Italian evacuation of Kassala, in the eastern Sudan.

See (in this volume)


ITALY: A. D. 1897.

EGYPT: A. D. 1897 (June).


Census.

A census of Egypt, taken on the 1st of June, 1897, showed a


population of 9,700,000, the area being Egypt up to Wady
Haifa. In 1882 an imperfect census gave six and three-quarter
millions. Twelve per cent. of the males can write, the rest
are totally illiterate. There are, it is said, about 40,000
persons not really Egyptians, but who come from other parts of
the Ottoman Empire. The Bedouin number 570,000, but of these
only 89,000 are really nomads, the rest being semi-sedentary.
Of foreign residents there are 112,500, of whom the Greeks,
the most numerous, number 38,000. Then come the Italians, with
24,500. The British (including 6,500 Maltese and 5,000 of the
Army of Occupation) are 19,500; and the French (including
4,000 Algerians and Tunisians), 14,000. The Germans only
number 1,300.
{197}
The classification according to religion shows nearly
9,000,000 Moslems, 730,000 Christians, and 25,000 Jews, The
Christians include the Coptic race, numbering about 608,000.
Only a very small proportion profess the Roman Catholic and
Protestant faiths. Amongst the town populations Cairo contains
570,000, Alexandria 320,000.

EGYPT: A. D. 1897-1898.
The final campaigns of the Anglo-Egyptian conquest
of the Eastern Sudan.
Desperate battles of the Atbara and of Omdurman.

"The winter of 1896-1897 was passed, undisturbed by the enemy.


The extended and open front of the Egyptian army imperatively
called for fresh guarantees against a Dervish invasion. The
important strategic position of Abu Hamed was then held by the
enemy, to dislodge whom was the objective of the 1897 campaign.
The railway was boldly launched into the Nubian Desert; the
rail-head crept rapidly and surely towards the Dervish post,
until within striking distance of Abu Hamed: when the
river-column, by a forced march, through difficult country,
delivered an attack on 7th August. Abu Hamed was taken by the
Egyptian army under Major-General (now Sir Archibald) Hunter,
with trifling loss: and the effect of this victory caused the
precipitate evacuation of Berber. The Dervishes withdrew: the
Egyptians—not to lose so favourable an opening—advanced.
Berber, the key to the Sudan, was promptly re-occupied. The
railway was hastened forward; reinforcements were detrained,
before the close of the year, at a short distance from Berber:
and the Anglo-Egyptian authorities gathered force for the last
heat. British troops were called up. In this final struggle
[1898] nothing could be risked. An Egyptian reverse would have
redoubled the task on the accomplishment of which, having
deliberately accepted it, we had pledged our honour. Mahmud,
the Dervish emir, and that ubiquitous rascal Osman Digna, with
their united forces, were marching on Berber. They, however,
held up at the confluence of the Atbara, and comfortably
intrenched themselves in a 'zariba.' Here the Sirdar came out
to have a look at them. The Dervish force numbered about
19,000 men. The Anglo-Egyptian army was composed of 13,000
men. The odds were good enough for the Sirdar: and he went for
them. Under the demoralization created by some sharp artillery
practice, the Anglo-Egyptians stormed the 'zariba,' killed
three-fourths of the defenders, and chased the remainder away.
This victory [April 8, 1898], which cost over 500 men in
killed and wounded, broke the Dervish power for offence and
seriously damaged the Khalifa's prestige. With reinforcements,
bringing his army up to 22,000 men, including some picked
British regiments, the Sirdar then advanced slowly up the
river. It was a pilgrimage to the Mahdi's tomb, in sight of
which Cross and Crescent combined to overthrow the false
prophet. This sanguinary and decisive engagement [before
Omdurman] took place on 2nd September, 1898. The Khalifa was
put to flight; his forces were scattered and ridden down. On
the same evening, the Sirdar entered Omdurman, and released
the European captives. Subsequently, the British and Egyptian
flags were hoisted together at Khartum; and divine service was
celebrated at the spot where Gordon fell."

A. S. White,
The Expansion of Egypt,
pages 383-384
(New York: New Amsterdam Book Company).

"The honour of the fight [at Omdurman] must still go with the
men who died. Our men were perfect, but the dervishes were
superb—beyond perfection. It was their largest, best, and
bravest army that ever fought against us for Mahdism, and it
died worthily of the huge empire that Mahdism won and kept so
long. Their riflemen, mangled by every kind of death and
torment that man can devise, clung round the black flag and
the green, emptying their poor, rotten, homemade cartridges
dauntlessly. Their spearmen charged death at every minute
hopelessly. Their horsemen led each attack, riding into the
bullets till nothing was left but three horses trotting up to
our line, heads down, saying, 'For goodness' sake, let us in
out of this.' Not one rush, or two, or ten—but rush on rush,
company on company, never stopping, though all their view that
was not unshaken enemy was the bodies of the men who had
rushed before them. A dusky line got up and stormed forward:
it bent, broke up, fell apart, and disappeared. Before the
smoke had cleared, another line was bending and storming
forward in the same track.

"It was over. The avenging squadrons of the Egyptian cavalry


swept over the field. The Khalifa and the Sheikh-ed-Din had
galloped back to Omdurman. Ali Wad Helu was borne away on an
angareb with a bullet through his thigh-bone. Yakub lay dead
under his brother's banner. From the green army there now came
only death-enamoured desperadoes, strolling one by one towards
the rifles, pausing to shake a spear, turning aside to
recognise a corpse, then, caught by a sudden jet of fury,
bounding forward, checking, sinking limply to the ground. Now
under the black flag in a ring of bodies stood only three men,
facing the three thousand of the Third Brigade. They folded
their arms about the staff and gazed steadily forward. Two
fell. The last dervish stood up and filled his chest; he
shouted the name of his God and hurled his spear. Then he
stood quite still, waiting. It took him full; he quivered,
gave at the knees, and toppled with his head on his arms and
his face towards the legions of his conquerors. Over 11,000
killed, 16,000 wounded, 4,000 prisoners,—that was the
astounding bill of dervish casualties officially presented
after the battle of Omdurman. Some people had estimated the
whole dervish army at 1,000 less than this total: few had put
it above 50,000. The Anglo-Egyptian army on the day of battle
numbered, perhaps, 22,000 men: if the Allies had done the same
proportional execution at Waterloo, not one Frenchman would
have escaped. … The dervish army was killed out as hardly an
army has been killed out in the history of war. It will shock
you, but it was simply unavoidable. Not a man was killed
except resisting—very few except attacking. Many wounded were
killed, it is true, but that again was absolutely unavoidable.
… It was impossible not to kill the dervishes: they refused to
go back alive."

{198}

The same brilliant writer gives the following description of


Omdurman, as the British found it on entering the town after
the victory: "It began just like any other town or village of
the mean Sudan. Half the huts seemed left unfinished, the
other half to have been deserted and fallen to pieces. There
were no streets, no doors or windows except holes, usually no
roofs. As for a garden, a tree, a steading for a beast—any
evidence of thrift or intelligence, any attempt at comfort or
amenity or common cleanliness,—not a single trace of any of
it. Omdurman was just planless confusion of blind walls and
gaping holes, shiftless stupidity, contented filth and
beastliness. But that, we said, was only the outskirts: when
we come farther in we shall surely find this mass of
population manifesting some small symbols of a great dominion.
And presently we came indeed into a broader way than the
rest—something with the rude semblance of a street. Only it
was paved with dead donkeys, and here and there it disappeared
in a cullender of deep holes where green water festered. …
Omdurman was a rabbit-warren—a threadless labyrinth of tiny
huts or shelters, too flimsy for the name of sheds.
Oppression, stagnation, degradation, were stamped deep on
every yard of miserable Omdurman.

"But the people! We could hardly see the place for the people.
We could hardly hear our own voices for their shrieks of
welcome. We could hardly move for their importunate greetings.
They tumbled over each other like ants from every mud heap, from
behind every dung-hill, from under every mat. … They had been
trying to kill us three hours before. But they salaamed, none
the less, and volleyed, 'Peace be with you' in our track. All
the miscellaneous tribes of Arabs whom Abdullahi's fears or
suspicions had congregated in his capital, all the blacks his
captains had gathered together into franker
slavery—indiscriminate, half-naked, grinning the grin of the
sycophant, they held out their hands and asked for backsheesh.
Yet more wonderful were the women. The multitude of women whom
concupiscence had harried from every recess of Africa and
mewed up in Baggara harems came out to salute their new
masters. There were at least three of them to every man. Black
women from Equatoria and almost white women from Egypt,
plum-skinned Arabs and a strange yellow type with square, bony
faces and tightly-ringleted black hair, … the whole city was a
huge harem, a museum of African races, a monstrosity of
African lust."

G. W. Steevens,
With Kitchener to Khartum,
chapter 32-34
(copyright, Dodd, Mead & Company, quoted with permission).

"Anyone who has not served in the Sudan cannot conceive the
state of devastation and misery to which that unfortunate
country has been brought under Dervish rule. Miles and miles
of formerly richly cultivated country lies waste; villages are
deserted; the population has disappeared. Thousands of women
are without homes or families. Years must elapse before the
Sudan can recover from the results of its abandonment to
Dervish tyranny; but it is to be hoped and may be confidently
expected, that in course of time, under just and upright
government, the Sudan may be restored to prosperity; and the
great battle of September will be remembered as having
established peace, without which prosperity would have been
impossible; and from which thousands of misguided and wretched
people will reap the benefits of civilization."

E. S. Wortley,
With the Sirdar
(Scribner's Magazine, January, 1899).
EGYPT: A. D. 1898.
The country and its people after 15 years
of British occupation.

"The British occupation has now lasted for over fifteen years.
During the first five, comparatively little was accomplished,
owing to the uncertain and provisional character of our
tenure. The work done has been done in the main in the last
ten years, and was only commenced in earnest when the British
authorities began to realise that, whether we liked it or not,
we had got to stay; and the Egyptians themselves came to the
conclusion that we intended to stay. … Under our occupation
Egypt has been rendered solvent and prosperous; taxes have
been largely reduced; her population has increased by nearly
50 per cent.; the value and the productiveness of her soil has
been greatly improved; a regular and permanent system of
irrigation has been introduced into Lower Egypt, and is now in
the course of introduction into Upper Egypt; trade and
industry have made giant strides; the use of the Kurbash
[bastinado] has been forbidden; the Corvée has been
suppressed; regularity in the collection of taxes has been
made the rule, and not the exception; wholesale corruption has
been abolished; the Fellaheen can now keep the money they
earn, and are better off than they were before; the landowners
are all richer owing to the fresh supply of water, with the
consequent rapid increase in the saleable price of land;
justice is administered with an approach to impartiality;
barbarous punishments have been mitigated, if not abolished;
and the extraordinary conversion of Cairo into a fair
semblance of a civilised European capital has been repeated on
a smaller scale in all the chief centres of Egypt. To put the
matter briefly, if our occupation were to cease to-morrow, we
should leave Egypt and the Egyptians far better off than they
were when our occupation commenced.

"If, however, I am asked whether we have succeeded in the


alleged aim of our policy, that of rendering Egypt fit for
self-government, I should be obliged honestly to answer that
in my opinion we have made little or no progress towards the
achievement of this aim. The one certain result of our
interference in the internal administration of Egypt has been
to impair, if not to destroy, the authority of the Khedive; of
the Mudirs, who, as the nominees of the Effendina, rule over
the provinces; and of the Sheiks, who, in virtue of the favour
of the Mudirs, govern the villages. We have undoubtedly
trained a school of native officials who have learnt that it
is to their interest to administer the country more or less in
accordance with British ideas. Here and there we may have
converted an individual official to a genuine belief in these
ideas. But I am convinced that if our troops were withdrawn,
and our place in Egypt was not taken by any other civilised
European Power, the old state of things would revive at once,
and Egypt would be governed once more by the old system of
Baksheesh and Kurbash."

E. Dicey,
Egypt, 1881 to 1897
(Fortnightly Review, May, 1898).

{199}

Reviewing the report, for 1898, of Lord Cromer, the British


Agent and Consul-General, who is practically the director of
the government of Egypt, "The Spectator" (London) has noted
the principle of Lord Cromer's administration to be that of
"using English heads but Egyptian hands. In practice this
means the policy of never putting an Englishman into any post
which could be just as well filled by a native. In other
words, the Englishman is only used in the administration where
he is indispensable. Where he is not, the native, as is only just
and right, is employed. The outcome of this is that Lord
Cromer's work in Egypt has been carried out by 'a body of
officials who certainly do not exceed one hundred in number,
and might possibly, if the figures were vigorously examined,
be somewhat lower.' Lord Cromer adds, however, that 'these
hundred have been selected with the greatest care.' In fact,
the principle has been,—never employ an Englishman unless it
is necessary in the interests of good government to do so, but
then employ a first-class man. The result is that the
inspiring force in every Department of the Egyptian State is a
first-class English brain, and yet the natives are not
depressed by being deprived of their share of the
administration. The Egyptians, that is, do not feel the
legitimate grievance that is felt by the Tunisians and
Algerians when they see even little posts of a couple of
hundred a year filled by Frenchmen."

Spectator,
April 15, 1899.

EGYPT: A. D. 1898 (September-November).


The French expedition of M. Marchand at Fashoda.

On the 10th of September, eight days after destroying the


power of the Khalifa at Omdurman, the Sirdar, Lord Kitchener,
left that fallen capital with five gunboats and a considerable
force of Highlanders, Sudanese and Egyptians, to take possession
of the Upper Nile. At Fashoda, in the Shilluk country, a
little north of the junction of the Sobat with the White Nile,
he found a party of eight French officers and about a hundred
Senegalese troops, commanded by M. Marchand, entrenched at the
old government buildings in that place and claiming occupation
of the country. It had been known for some time that M.
Marchand was leading an expedition from the French Congo
towards the Nile, and the British government had been seeking
an explanation of its objects from the government of France,
uttering warnings, at the same time, that England would
recognize no rights in any part of the Nile Valley except the
rights of Egypt, which the evacuation of the Egyptian Sudan,
consequent on the conquests of the Mahdi and the Dervishes,
had not extinguished. Even long before the movements of M.
Marchand were known, it had been suspected that France
entertained the design of extending her great possessions in
West Africa eastward, to connect with the Nile, and, as early
as the spring of 1895, Sir Edward Grey, speaking for the
British Foreign Office, in reply to a question then asked in
the House of Commons, concerning rumors that a French
expedition from West Africa was approaching the Nile, said
with unmistakable meaning: "After all I have explained about
the claims we consider we have under past Agreements, and the
claims which we consider Egypt may have in the Nile Valley,
and adding to that the fact that those claims and the view of
the Government with regard to them are fully and clearly known
to the French Government, I cannot think it is possible that
these rumors deserve credence, because the advance of a French
expedition under secret instructions right from the other side
of Africa into a territory over which our claims have been
known for so long would be not merely an inconsistent and
unexpected act, but it must be perfectly well known to the
French Government that it would be an unfriendly act, and
would be so viewed by England." In December, 1897, the British
Ambassador at Paris had called the attention of the French
government to Sir Edward Grey's declaration, adding that "Her
Majesty's present Government entirely adhere to the language
that was on this occasion employed by their predecessors."

As between the two governments, then, such was the critical


situation of affairs when the Sirdar, who had been already
instructed how to act if he found intruders in the Nile
Valley, came upon M. Marchand and his little party at Fashoda.
The circumstances and the results of the meeting were reported by
him promptly as follows: "On reaching the old Government
buildings, over which the French flag was flying, M. Marchand,
accompanied by Captain Germain, came on board. After
complimenting them on their long and arduous journey, I
proceeded at once to inform M. Marchand that I was authorized
to state that the presence of the French at Fashoda and in the

You might also like