Professional Documents
Culture Documents
Gene expression
ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2507
Y.Saeys et al.
Table 1. A taxonomy of feature selection techniques. For each feature selection type, we highlight a set of characteristics which can guide the choice
for a technique suited to the goals and resources of practitioners in the field
Filter Univariate
Classifier
Multivariate
Wrapper Deterministic
Embedded Interacts with the classifier Classifier dependent selection Decision trees
Better computational Weighted naive Bayes
FS U hypothesis space
complexity than wrapper methods (Duda et al., 2001)
Classifier
Models feature dependencies Feature selection using
the weight vector of SVM
(Guyon et al., 2002;
Weston et al., 2003)
incorporate this search in the added space of feature subsets in Filter techniques assess the relevance of features by looking
the model selection. only at the intrinsic properties of the data. In most cases a
In the context of classification, feature selection techniques feature relevance score is calculated, and low-scoring features
can be organized into three categories, depending on how they are removed. Afterwards, this subset of features is presented as
combine the feature selection search with the construction of input to the classification algorithm. Advantages of filter
the classification model: filter methods, wrapper methods and techniques are that they easily scale to very high-dimensional
embedded methods. Table 1 provides a common taxonomy of datasets, they are computationally simple and fast, and they are
feature selection methods, showing for each technique the most independent of the classification algorithm. As a result, feature
prominent advantages and disadvantages, as well as some selection needs to be performed only once, and then different
examples of the most influential techniques. classifiers can be evaluated.
2508
A review of feature selection techniques
A common disadvantage of filter methods is that they ignore number growing exponentially with the pattern length k.
the interaction with the classifier (the search in the feature As many of them will be irrelevant or redundant, feature
subset space is separated from the search in the hypothesis selection techniques are then applied to focus on the subset of
space), and that most proposed techniques are univariate. relevant variables.
This means that each feature is considered separately, thereby
ignoring feature dependencies, which may lead to worse 3.1.1 Content analysis The prediction of subsequences that
classification performance when compared to other types of code for proteins (coding potential prediction) has been a focus
feature selection techniques. In order to overcome the problem of interest since the early days of bioinformatics. Because many
of ignoring feature dependencies, a number of multivariate features can be extracted from a sequence, and most
filter techniques were introduced, aiming at the incorporation dependencies occur between adjacent positions, many varia-
of feature dependencies to some degree. tions of Markov models were developed. To deal with the high
2509
Y.Saeys et al.
Table 2. Key references for each type of feature selection technique in the microarray domain
Filter methods
Parametric Model-free
t-test (Jafari and Wilcoxon rank sum Bivariate Sequential search Random forest
Azuaje, 2006) (Thomas et al., 2001) (Bø and Jonassen, 2002) (Inza et al., 2004; (Dı́az-Uriarte and
Xiong et al., 2001) Alvarez de Andrés, 2006;
Another line of research is performed in the context of the like noise and variability render the analysis of microarray data
gene prediction setting, where structural elements such as the an exciting domain.
translation initiation site (TIS) and splice sites are modelled as In order to deal with these particular characteristics of
specific classification problems. The problem of feature microarray data, the obvious need for dimension reduction
selection for structural element recognition was pioneered in techniques was realized (Alon et al., 1999; Ben-Dor et al., 2000;
Degroeve et al. (2002) for the problem of splice site prediction, Golub et al., 1999; Ross et al., 2000), and soon their application
combining a sequential backward method together with an became a de facto standard in the field. Whereas in 2001, the
embedded SVM evaluation criterion to assess feature relevance. field of microarray analysis was still claimed to be in its infancy
In Saeys et al. (2004), an estimation of distribution algorithm (Efron et al., 2001), a considerable and valuable effort has since
(EDA, a generalization of genetic algorithms) was used to gain been done to contribute new and adapt known FS methodol-
more insight in the relevant features for splice site prediction. ogies (Jafari and Azuaje, 2006). A general overview of the most
Similarly, the prediction of TIS is a suitable problem to apply influential techniques, organized according to the general FS
feature selection techniques. In Liu et al. (2004), the authors taxonomy of Section 2, is shown in Table 2.
demonstrate the advantages of using feature selection for this
problem, using the feature-class entropy as a filter measure to
remove irrelevant features. 3.2.1 The univariate filter paradigm: simple yet efficient
In future research, FS techniques can be expected to be useful Because of the high dimensionality of most microarray
for a number of challenging prediction tasks, such as analyses, fast and efficient FS techniques such as univariate
identifying relevant features related to alternative splice sites filter methods have attracted most attention. The prevalence
and alternative TIS. of these univariate techniques has dominated the field, and up
to now comparative evaluations of different classification and
FS techniques over DNA microarray datasets only focused on
3.2 Feature selection for microarray analysis
the univariate case (Dudoit et al., 2002; Lee et al., 2005;
During the last decade, the advent of microarray datasets Li et al., 2004; Statnikov et al., 2005). This domination of
stimulated a new line of research in bioinformatics. Microarray the univariate approach can be explained by a number
data pose a great challenge for computational techniques, of reasons:
because of their large dimensionality (up to several tens of
thousands of genes) and their small sample sizes (Somorjai the output provided by univariate feature rankings is
et al., 2003). Furthermore, additional experimental complications intuitive and easy to understand;
2510
A review of feature selection techniques
the gene ranking output could fulfill the objectives and alleviates the problem of small sample sizes in microarray
expectations that bio-domain experts have when wanting studies, enhancing the robustness against outliers.
to subsequently validate the result by laboratory techni- We also mention promising types of non-parametric metrics
ques or in order to explore literature searches. The experts which, instead of trying to identify differentially expressed
could not feel the need for selection techniques that take genes at the whole population level (e.g. comparison of sample
into account gene interactions; means), are able to capture genes which are significantly
the possible unawareness of subgroups of gene expression disregulated in only a subset of samples (Lyons-Weiler et al.,
domain experts about the existence of data analysis 2004; Pavlidis and Poirazi, 2006). These types of methods offer
techniques to select genes in a multivariate way; a more patient specific approach for the identification of
markers, and can select genes exhibiting complex patterns that
the extra computation time needed by multivariate gene are missed by metrics that work under the classical comparison
2511
Y.Saeys et al.
Table 3. . Key references for each type of feature selection technique in the domain of mass spectrometry
Parametric Model-free
t-test (Liu et al., 2002; Peak Probability CFS (Liu et al., 2002)
Wu et al., 2003) Contrast (Tibshirani et al., 2004) Relief-F (Prados et al., 2004)
F-test (Bhanot et al., 2006) Kolmogorov-Smirnov test (Yu et al., 2005)
evaluation measure, especially suited to the demand for mining is not comparable to the level of maturity reached in the
screening different types of errors in many biomedical microarray analysis domain, an interesting collection of
scenarios. methods has been presented in the last 4–5 years (see Hilario
The embedded capacity of several classifiers to discard input et al., 2006; Shin and Markey, 2006 for recent reviews) since the
features and thus propose a subset of discriminative genes, has pioneering work of Petricoin et al. (2002).
been exploited by several authors. Examples include the use of Starting from the raw data, and after an initial step to reduce
random forests (a classifier that combines many single decision noise and normalize the spectra from different samples
trees) in an embedded way to calculate the importance of each (Coombes et al., 2007), the following crucial step is to extract
gene (Dı́az-Uriarte and Alvarez de Andrés, 2006; Jiang et al., the variables that will constitute the initial pool of candidate
2004). Another line of embedded FS techniques uses the discriminative features. Some studies employ the simplest
weights of each feature in linear classifiers, such as SVMs approach of considering every measured value as a predictive
(Guyon et al., 2002) and logistic regression (Ma and Huang, feature, thus applying FS techniques over initial huge pools of
2005). These weights are used to reflect the relevance of each about 15 000 variables (Li et al., 2004; Petricoin et al., 2002), up
gene in a multivariate way, and thus allow for the removal of to around 100 000 variables (Ball et al., 2002). On the other
genes with very small weights. hand, a great deal of the current studies performs aggressive
Partially due to the higher computational complexity of feature extraction procedures using elaborated peak detection
wrapper and to a lesser degree embedded approaches, these and alignment techniques (see Coombes et al., 2007; Hilario
techniques have not received as much interest as filter et al., 2006; Shin and Markey, 2006 for a detailed description of
proposals. However, an advisable practice is to pre-reduce the these techniques). These procedures tend to seed the dimen-
search space using a univariate filter method, and only then sionality from which supervised FS techniques will start their
apply wrapper or embedded methods, hence fitting the work in less than 500 variables (Bhanot et al., 2006; Ressom
computation time to the available resources. et al., 2007; Tibshirani et al., 2004). A feature extraction step is
thus advisable to set the computational costs of many FS
3.3 Mass spectra analysis techniques to a feasible size in these MS scenarios. Table 3
Mass spectrometry technology (MS) is emerging as a new and presents an overview of FS techniques used in the domain of
attractive framework for disease diagnosis and protein-based mass spectrometry. Similar to the domain of microarray
biomarker profiling (Petricoin and Liotta, 2003). A mass analysis, univariate filter techniques seem to be the most
spectrum sample is characterized by thousands of different common techniques used, although the use of embedded
mass/charge (m / z) ratios on the x-axis, each with their techniques is certainly emerging as an alternative. Although
corresponding signal intensity value on the y-axis. A typical the t-test maintains a high level of popularity (Liu et al., 2002;
MALDI-TOF low-resolution proteomic profile can contain up Wu et al., 2003), other parametric measures such as F-test
to 15 500 data points in the spectrum between 500 and 20 000 (Bhanot et al., 2006), and a notable variety of non-parametric
m / z, and the number of points even grows using higher scores (Tibshirani et al., 2004; Yu et al., 2005) have also been
resolution instruments. used in several MS studies. Multivariate filter techniques on the
For data mining and bioinformatics purposes, it can initially other hand, are still somewhat underrepresented (Liu et al.,
be assumed that each m / z ratio represents a distinct variable 2002; Prados et al., 2004).
whose value is the intensity. As Somorjai et al. (2003) explain, Wrapper approaches have demonstrated their usefulness in
the data analysis step is severely constrained by both MS studies by a group of influential works. Different types of
high-dimensional input spaces and their inherent sparseness, population-based randomized heuristics are used as search
just as it is the case with gene expression datasets. Although the engines in the major part of these papers: genetic algorithms
amount of publications on mass spectrometry based data (Li et al., 2004; Petricoin et al., 2002), particle swarm
2512
A review of feature selection techniques
optimization (Ressom et al., 2005) and ant colony procedures approaches such as boosting have been adapted to improve the
(Ressom et al., 2007). It is worth noting that while the first two robustness and stability of final, discriminative methods
references start the search procedure in 15 000 dimensions by (Ben-Dor et al., 2000; Dudoit et al., 2002).
considering each m / z ratio as an initial predictive feature, Novel ensemble techniques in the microarray and mass
aggressive peak detection and alignment processes reduce the spectrometry domains include averaging over multiple single
initial dimension to about 300 variables in the last two feature subsets (Levner, 2005; Li and Yang, 2002), integrating a
references (Ressom et al., 2005; Ressom et al., 2007). collection of univariate differential gene expression purpose
An increasing number of papers uses the embedded capacity statistics via a distance synthesis scheme (Yang et al., 2005),
of several classifiers to discard input features. Variations of the using different runs of a genetic algorithm to asses relative
popular method originally proposed for gene expression importancies of each feature (Li et al., 2001, 2004), computing
domains by Guyon et al. (2002), using the weights of the the Kolmogorov–Smirnov test in different bootstrap samples
2513
Y.Saeys et al.
When the haplotype structure in the target region is 5.2 Text and literature mining
unknown, a widely used approach is to choose markers at Text and literature mining is emerging as a promising area for
regular intervals (Lee and Kang, 2004), given either the number data mining in biology (Cohen and Hersch, 2005; Jensen et al.
of SNPs to choose or the desired interval. In (Li et al., 2005) an 2006). One important representation of text and documents is
ensemble approach is successfully applied to the identification the so-called bag-of-words (BOW) representation, where each
of relevant SNPs for alcoholism, while Gong et al. (2005) word in the text represents one variable, and its value consists
propose a robust feature selection technique based on a hybrid of the frequency of the specific word in the text. It goes without
between a genetic algorithm and an SVM. The Relief-F feature saying that such a representation of the text may lead to very
selection algorithm, in conjunction with three classification high dimensional datasets, pointing out the need for feature
algorithms (k-NN, SVM and naive Bayes) has been proposed in selection techniques.
Wang et al., (2006). Genetic algorithms have been applied to Although the application of feature selection techniques is
the search of the best subset of SNPs, evaluating them with a common in the field of text classification (see e.g. Forman,
multivariate filter (CFS), and also in a wrapper manner, with 2003 for a review), the application in the biomedical domain
a decision tree as supervised classification paradigm (Shah and is still in its infancy. Some examples of FS techniques in
Kusiak, 2004). The multiple linear regression SNP prediction the biomedical domain include the work of Dobrokhotov et al.
algorithm (He and Zelikovsky, 2006) predicts a complete (2003), who use the Kullback–Leibler divergence as a
genotype based on the values of its informative SNPs (selected univariate filter method to find discriminating words in a
with a stepwise tag selection algorithm), their positions among medical annotation task, the work of Eom and Zhang (2000)
all SNPS, and a sample of complete genotypes. In Sham et al. who use symmetrical uncertainty (an entropy-based
(2007) the tag SNP selection method allows to specify variable filter method) for identifying relevant features for protein
tagging thresholds, based on correlations, for different SNPs. interaction discovery, and the work of Han et al. (2006), which
2514
A review of feature selection techniques
discusses the use of feature selection for a document classifica- To conclude, we would like to note that, in order to maintain
tion task. an appropriate size of the article, we had to limit the number of
It can be expected that, for tasks such as biomedical referenced studies. We therefore apologize to the authors of
document clustering and classification, the large number of papers that were not cited in this work.
feature selection techniques that were already developed in the
text mining community will be of practical use for researchers
in biomedical literature mining (Cohen and Hersch, 2005). ACKNOWLEDGEMENTS
We would like to thank the anonymous reviewers for their
constructive comments, which significantly improved the
6 FS SOFTWARE PACKAGES quality of this review. This work was supported by BOF
In order to provide the interested reader with some pointers to grant 01P10306 from Ghent University to Y.S., and the
2515
Y.Saeys et al.
In Proceedings of the 14th European Conference on Machine Learning Inza,I., et al. (2000) Feature subset selection by Bayesian networks based
(ECML-2003), pp. 84–95. optimization. Artif. Intell., 123, 157–184.
Daly,M., et al. (2001) High-resolution haplotype structure in the human genome. Inza,I., et al. (2004) Filter versus wrapper gene selection approaches in DNA
Nat. Genet., 29, 229–232. microarray domains. Artif. Intell. Med., 31, 91–103.
Dean,N. and Raftery,A. (2005) Normal uniform mixture differential gene Jafari,P. and Azuaje,F. (2006) An assessment of recently published gene
expression detection in cDNA microarrays. BMC Bioinformatics, 6, 173. expression data analyses: reporting experimental design and statistical factors.
Degroeve,S., et al. (2002) Feature subset selection for splice site prediction. BMC Med. Inform. Decis. Mak., 6, 27.
Bioinformatics, 18 (Suppl. 2), 75–83. Jensen,L., et al. (2006) Literature mining for the biologist: from information
Delcher,A., et al. (1999) Improved microbial gene identification with retrieval to biological discovery. Nat. Rev. Genet., 7, 119–129.
GLIMMER. Nucleic Acids Res., 27, 4636–4641. Jiang,H., et al. (2004) Joint analysis of two microarray gene-expression data sets
Dı́az-Uriarte,R. and Alvarez de Andrés,S. (2006) Gene selection and classifica- to select lung adenocarcinoma marker genes. BMC Bioinformatics, 5, 81.
tion of microarray data using random forest. BMC Bioinformatics, 7, 3. Jirapech-Umpai,T. and Aitken,S. (2005) Feature selection and classification for
Ding,C. and Peng,H. (2003) Minimum redundancy feature selection from microarray data analysis: evolutionary methods for identifying predictive
2516
A review of feature selection techniques
Mamitsuka,H. (2006) Selecting features in microarray classification using ROC Smyth,G. (2004) Linear models and empirical Bayes methods for assessing
curves. Pattern Recognit., 39, 2393–2404. differential expression in microarray experiments. Stat. Appl. in Genet. and
Ma,S. and Huang,J. (2005) Regularized ROC method for disease classifica- Mol. Biol., 3, Article 3.
tion and biomarker selection with microarray data. Bioinformatics, 21, Somorjai,R., et al. (2003) Class prediction and discovery using gene microarray
4356–4362. and proteomics mass spectroscopy data: curses, caveats, cautions.
Medina,I., et al. (2007) Prophet, a web-based tool for class prediction using Bioinformatics, 19, 1484–1491.
microarray data. Bioinformatics, 23, 390–391. Statnikov,A., et al. (2005) A comprhensive evaluation of multicategory
Molinaro,A., et al. (2005) Prediction error estimation: a comparison of classification methods for microarray gene expression cancer diagnosis.
resampling methods. Bioinformatics, 21, 3301–3307. Bioinformatics, 21, 631–643.
Newton,M., et al. (2001) On differential variability of expression ratios: Storey,J. (2002) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B,
improving statistical inference about gene expression changes from micro- 64, 479–498.
array data. J. Comput. Biol., 8, 37–52. Su,Y., et al. (2003) RankGene: identification of diagnostic genes based on
Ooi,C. and Tan,P. (2003) Genetic algorithms applied to multi-class prediction for expression data. Bioinformatics, 19, 1587–1579.
Tadesse,M., et al. (2004) Identification of DNA regulatory motifs using Bayesian
2517