Feature Selection

Feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or
variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use
in model construction. Feature selection techniques are used for several reasons:
simplification of models to make them easier to interpret by researchers/users,[1]

shorter training times,[2]
to avoid the curse of dimensionality,[3]
improve data's compatibility with a learning model class,[4]
encode inherent symmetries present in the input space.[5][6][7][8]
The central premise when using a feature selection technique is that the data contains some features that are
either redundant or irrelevant, and can thus be removed without incurring much loss of information.[9]
Redundant and irrelevant are two distinct notions, since one relevant feature may be redundant in the presence
of another relevant feature with which it is strongly correlated.[10]
Feature selection techniques should be distinguished from feature extraction.[11] Feature extraction creates new
features from functions of the original features, whereas feature selection returns a subset of the features.
Feature selection techniques are often used in domains where there are many features and comparatively few
samples (or data points). Archetypal cases for the application of feature selection include the analysis of written
texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of
samples.
Introduction
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature
subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is
to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive
search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of
evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between
the three main categories of feature selection algorithms: wrappers, filters and embedded methods.[10]
Wrapper methods use a predictive model to score feature subsets. Each new subset is used to
train a model, which is tested on a hold-out set. Counting the number of mistakes made on that
hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods
train a new model for each subset, they are very computationally intensive, but usually provide
the best performing feature set for that particular type of model or typical problem.
Filter methods use a proxy measure instead of the error rate to score a feature subset. This
measure is chosen to be fast to compute, while still capturing the usefulness of the feature set.
Common measures include the mutual information,[10] the pointwise mutual information,[12]
Pearson product-moment correlation coefficient, Relief-based algorithms,[13] and inter/intra
class distance or the scores of significance tests for each class/feature combinations.[12][14]
Filters are usually less computationally intensive than wrappers, but they produce a feature set
which is not tuned to a specific type of predictive model.[15] This lack of tuning means a feature
set from a filter is more general than the set from a wrapper, usually giving lower prediction
performance than a wrapper. However the feature set doesn't contain the assumptions of a
prediction model, and so is more useful for exposing the relationships between the features.
Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off
point in the ranking is chosen via cross-validation. Filter methods have also been used as a
preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems.
One other popular approach is the Recursive Feature Elimination algorithm,[16] commonly used
with Support Vector Machines to repeatedly construct a model and remove features with low
weights.
Embedded methods are a catch-all group of techniques which perform feature selection as part
of the model construction process. The exemplar of this approach is the LASSO method for
constructing a linear model, which penalizes the regression coefficients with an L1 penalty,
shrinking many of them to zero. Any features which have non-zero regression coefficients are
'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which
bootstraps samples;[17] Elastic net regularization, which combines the L1 penalty of LASSO
with the L2 penalty of ridge regression; and FeaLect which scores all the features based on
combinatorial analysis of regression coefficients.[18] AEFS further extends LASSO to nonlinear
scenario with autoencoders.[19] These approaches tend to be between filters and wrappers in
terms of computational complexity.
In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a
wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each
round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically
done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting.
More robust methods have been explored, such as branch and bound and piecewise linear network.
Subset selection
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be
broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through
the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be
computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the
search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques
are embedded in, and specific to, a model.
Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of
features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of
the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical,
so at some implementor (or operator) defined stopping point, the subset of features with the highest score
discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by
algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time
has been surpassed, etc.
Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional
projections of the data that score highly: the features that have the largest projections in the lower-dimensional
space are then selected.
Search approaches include:
Exhaustive[20]
Best first
Simulated annealing
Genetic algorithm[21]
Greedy forward selection[22][23][24]
Greedy backward elimination
Particle swarm optimization[25]
Targeted projection pursuit
Scatter search[26][27]
Variable neighborhood search[28][29]
Two popular filter metrics for classification problems are correlation and mutual information, although neither
are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality
and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are
computed between a candidate feature (or set of features) and the desired output category. There are, however,
true metrics that are a simple function of the mutual information;[30] see here.
Other available filter metrics include:
Class separability
Error probability
Inter-class distance
Probabilistic distance
Entropy
Consistency-based feature selection
Correlation-based feature selection
Optimality criteria
The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many
common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples
include Akaike information criterion (AIC) and Mallows's Cp , which have a penalty of 2 for each added
feature. AIC is based on information theory, and is effectively derived via the maximum entropy
principle.[31][32]
Other criteria are Bayesian information criterion (BIC), which uses a penalty of for each added
feature, minimum description length (MDL) which asymptotically uses , Bonferroni / RIC which use
, maximum dependency feature selection, and a variety of new criteria that are motivated by false
discovery rate (FDR), which use something close to . A maximum entropy rate criterion may also be
used to select the most relevant subset of features.[33]
Structure learning
Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection
finds the relevant feature set for a specific target variable whereas structure learning finds the relationships
between all the variables, usually by expressing these relationships as a graph. The most common structure
learning algorithms assume the data is generated by a Bayesian Network, and so the structure is a directed
graphical model. The optimal solution to the filter feature selection problem is the Markov blanket of the target
node, and in a Bayesian Network, there is a unique Markov Blanket for each node.[34]
Information Theory Based Feature Selection Mechanisms

There are different Feature Selection mechanisms around that utilize mutual information for scoring the
different features. They usually use all the same algorithm:
1. Calculate the mutual information as score for between all features ( ) and the target class
(c)
2. Select the feature with the largest score (e.g. ) and add it to the set of selected
features (S )
3. Calculate the score which might be derived from the mutual information
4. Select the feature with the largest score and add it to the set of select features (e.g.
)
5. Repeat 3. and 4. until a certain number of features is selected (e.g. )
The simplest approach uses the mutual information as the "derived" score.[35]
However, there are different approaches, that try to reduce the redundancy between features.
Minimum-redundancy-maximum-relevance (mRMR) feature selection
Peng et al.[36] proposed a feature selection method that can use either mutual information, correlation, or
distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the
presence of the other selected features. The relevance of a feature set S for the class c is defined by the average
value of all mutual information values between the individual feature fi and the class c as follows:
The redundancy of all features in the set S is the average value of all mutual information values between the
feature fi and the feature fj:
The mRMR criterion is a combination of two measures given above and is defined as follows:
Suppose that there are n full-set features. Let xi be the set membership indicator function for feature fi, so that
xi=1 indicates presence and xi=0 indicates absence of the feature fi in the globally optimal feature set. Let
and . The above may then be written as an optimization problem:
The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection
algorithm that maximizes the mutual information between the joint distribution of the selected features and the
classification variable. As mRMR approximates the combinatorial estimation problem with a series of much
smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are
more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to
measure interactions between features which can increase relevancy. This can lead to poor performance[35]
when the features are individually useless, but are useful when combined (a pathological case is found when
the class is a parity function of the features). Overall the algorithm is more efficient (in terms of the amount of
data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little
pairwise redundancy.
mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in
different ways.[35][37]
Quadratic programming feature selection
mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been
selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to
reduce some features, it might also be reformulated as a global quadratic programming optimization problem as
follows:[38]
where is the vector of feature relevancy assuming there are n features in

total, is the matrix of feature pairwise redundancy, and represents relative
feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards
features with smaller entropy,[39] due to its placement of the feature self redundancy term on the
diagonal of H.
Conditional mutual information
Another score derived for the mutual information is based on the conditional relevancy:[39]
where and .
An advantage of SPECCMI is that it can be solved simply via finding the dominant eigenvector of Q, thus is
very scalable. SPECCMI also handles second-order feature interaction.
Joint mutual information
In a study of different scores Brown et al.[35] recommended the joint mutual information[40] as a good score for
feature selection. The score tries to find the feature, that adds the most new information to the already selected
features, in order to avoid redundancy. The score is formulated as follows:
The score uses the conditional mutual information and the mutual information to estimate the redundancy
between the already selected features ( ) and the feature under investigation ( ).
Hilbert-Schmidt Independence Criterion Lasso based feature

selection
For high-dimensional and small sample data (e.g., dimensionality > 105 and the number of samples < 103 ), the
Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful.[41] HSIC Lasso optimization problem
is given as
where is a kernel-based independence measure called the (empirical) Hilbert-

Schmidt independence criterion (HSIC), denotes the trace, is the regularization parameter,
and are input and output centered Gram matrices, and
are Gram matrices, and are kernel functions, is the

centering matrix, is the m-dimensional identity matrix (m: the number of samples), is the m-
dimensional vector with all ones, and is the -norm. HSIC always takes a non-negative value, and is
zero if and only if two random variables are statistically independent when a universal reproducing kernel such
as the Gaussian kernel is used.
The HSIC Lasso can be written as
where is the Frobenius norm. The optimization problem is a Lasso problem, and thus it can be
efficiently solved with a state-of-the-art Lasso solver such as the dual augmented Lagrangian method.
Correlation feature selection

The correlation feature selection (CFS) measure evaluates subsets of features on the basis of the following
hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to
each other".[42][43] The following equation gives the merit of a feature subset S consisting of k features:
Here, is the average value of all feature-classification correlations, and is the average value of all
feature-feature correlations. The CFS criterion is defined as follows:
The and variables are referred to as correlations, but are not necessarily Pearson's correlation
coefficient or Spearman's ρ. Hall's dissertation uses neither of these, but uses three different measures of
relatedness, minimum description length (MDL), symmetrical uncertainty, and relief.
Let xi be the set membership indicator function for feature fi; then the above can be rewritten as an optimization
problem:
The combinatorial problems above are, in fact, mixed 0–1 linear programming problems that can be solved by
using branch-and-bound algorithms.[44]
Regularized trees
The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called
regularized tree[45] can be used for feature subset selection. Regularized trees penalize using a variable similar
to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build
one tree model (or one tree ensemble model) and thus are computationally efficient.
Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are
invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as
normalization. Regularized random forest (RRF)[46] is one type of regularized trees. The guided RRF is an
enhanced RRF which is guided by the importance scores from an ordinary random forest.
Overview on metaheuristics methods

A metaheuristic is a general description of an algorithm dedicated to solve difficult (typically NP-hard problem)
optimization problems for which there is no classical solving methods. Generally, a metaheuristic is a stochastic
algorithm tending to reach a global optimum. There are many metaheuristics, from a simple local search to a
complex global search algorithm.
Main principles
The feature selection methods are typically presented in three classes based on how they combine the selection
algorithm and the model building.
Filter method
Filter type methods select variables regardless of the

model. They are based only on general features like the Filter Method for feature selection
correlation with the variable to predict. Filter methods
suppress the least interesting variables. The other variables
will be part of a classification or a regression model used to classify or to predict data. These methods are
particularly effective in computation time and robust to overfitting.[47]
Filter methods tend to select redundant variables when they do not consider the relationships between
variables. However, more elaborate features try to minimize this problem by removing variables highly
correlated to each other, such as the Fast Correlation Based Filter (FCBF) algorithm.[48]
Wrapper method
Wrapper methods evaluate subsets of variables which

allows, unlike filter approaches, to detect the possible
interactions amongst variables.[49] The two main
disadvantages of these methods are:
Wrapper Method for Feature selection
The increasing overfitting risk when the number
of observations is insufficient.
The significant computation time when the number of variables is large.
Embedded method
Embedded methods have been recently proposed that try

to combine the advantages of both previous methods. A
learning algorithm takes advantage of its own variable
selection process and performs feature selection and
classification simultaneously, such as the FRMT
Embedded method for Feature selection
algorithm.[50]
Application of feature selection metaheuristics
This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey
was realized by J. Hammon in her 2013 thesis.[47]
Evaluation
Application Algorithm Approach Classifier Reference
Function
Feature Selection using Phuong

SNPs Filter r2
Feature Similarity 2005[49]
Classification
SNPs Genetic algorithm Wrapper Decision Tree
accuracy (10-fold) Shah 2004[51]
Filter + Predicted residual

SNPs Hill climbing
Wrapper
Naive Bayesian
sum of squares Long 2007[52]
Classification Ustunkar
SNPs Simulated annealing Naive bayesian
accuracy (5-fold) 2011[53]
Segments Artificial Neural
Ant colony Wrapper MSE Al-ani 2005
parole Network
Marketing Simulated annealing Wrapper Regression AIC, r2 Meiri 2006[54]
Simulated annealing, Kapetanios

Economics Wrapper Regression BIC
genetic algorithm 2007[55]
Multiple Linear
Spectral Regression, root-mean-square Broadhurst et
Genetic algorithm Wrapper
Mass Partial Least error of prediction al. 1997[56]
Squares
Zhang
Spam Binary PSO + Mutation Wrapper Decision tree weighted cost
2014[25]
Support Vector Chuang
Microarray Tabu search + PSO Wrapper Machine, K Euclidean Distance
Nearest Neighbors 2009[57]
Support Vector Classification

Microarray PSO + Genetic algorithm Wrapper
Machine accuracy (10-fold) Alba 2007[58]
Genetic algorithm + Support Vector Classification Duval

Microarray Embedded
Iterated Local Search Machine accuracy (10-fold) 2009[59]
Posterior
Microarray Iterated local search Wrapper Regression
Probability Hans 2007[60]
Classification Jirapech-
K Nearest accuracy (Leave- Umpai
Microarray Genetic algorithm Wrapper
Neighbors one-out cross-
validation) 2005[61]
Classification
K Nearest accuracy (Leave-
Microarray Hybrid genetic algorithm Wrapper
Neighbors one-out cross- Oh 2004[62]
validation)
Support Vector Sensitivity and
Machine specificity Xuan 2011[63]
Classification
All paired Support accuracy (Leave-
Vector Machine one-out cross- Peng 2003[64]
validation)
Support Vector Classification Hernandez

Microarray Genetic algorithm Embedded
Machine accuracy (10-fold) 2007[65]
Classification
Support Vector accuracy (Leave- Huerta
Microarray Genetic algorithm Hybrid
Machine one-out cross- 2006[66]
validation)
Support Vector Classification
Microarray Genetic algorithm
Machine accuracy (10-fold) Muni 2006[67]
Support Vector Jourdan

Microarray Genetic algorithm Wrapper EH-DIALL, CLUMP
Machine 2005[68]
Alzheimer's Support vector Classification Zhang

Welch's t-test Filter
disease machine accuracy (10-fold) 2015[69]
Infinite Feature Selection
(https://it.mathworks.com/
Computer Average Precision,
vision
matlabcentral/fileexchang Filter Independent
ROC AUC Roffo 2015[70]
e/56937-feature-selection-
library)
Eigenvector Centrality FS
(https://it.mathworks.com/ Average Precision,
Roffo & Melzi
Microarrays matlabcentral/fileexchang Filter Independent Accuracy, ROC
e/56937-feature-selection- AUC 2016[71]
library)
Structural Shaharanee
Accuracy,
XML Symmetrical Tau (ST) Filter Associative & Hadzic
Coverage
Classification 2014
Feature selection embedded in learning algorithms

Some learning algorithms perform feature selection as part of their overall operation. These include:
-regularization techniques, such as sparse regression, LASSO, and -SVM

Regularized trees,[45] e.g. regularized random forest implemented in the RRF package[46]
Decision tree[72]
Memetic algorithm
Random multinomial logit (RMNL)
Auto-encoding networks with a bottleneck-layer
Submodular feature selection[73][74][75]
Local learning based feature selection.[76] Compared with traditional methods, it does not
involve any heuristic search, can easily handle multi-class problems, and works for both linear
and nonlinear problems. It is also supported by a strong theoretical foundation. Numeric
experiments showed that the method can achieve a close-to-optimal solution even when data
contains >1M irrelevant features.
Recommender system based on feature selection.[77] The feature selection methods are
introduced into recommender system research.
See also
Cluster analysis
Data mining
Dimensionality reduction
Feature extraction
Hyperparameter optimization
Model selection
Relief (feature selection)
References
1. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to
Statistical Learning (http://www-bcf.usc.edu/~gareth/ISL/). Springer. p. 204.
2. Brank, Janez; Mladenić, Dunja; Grobelnik, Marko; Liu, Huan; Mladenić, Dunja; Flach, Peter A.;
Garriga, Gemma C.; Toivonen, Hannu; Toivonen, Hannu (2011), "Feature Selection" (http://link.
springer.com/10.1007/978-0-387-30164-8_306), in Sammut, Claude; Webb, Geoffrey I. (eds.),
Encyclopedia of Machine Learning, Boston, MA: Springer US, pp. 402–406, doi:10.1007/978-0-
387-30164-8_306 (https://doi.org/10.1007%2F978-0-387-30164-8_306), ISBN 978-0-387-
30768-8, retrieved 2021-07-13
3. Kramer, Mark A. (1991). "Nonlinear principal component analysis using autoassociative neural
networks" (https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.690370209). AIChE
Journal. 37 (2): 233–243. doi:10.1002/aic.690370209 (https://doi.org/10.1002%2Faic.69037020
9). ISSN 1547-5905 (https://www.worldcat.org/issn/1547-5905).
4. Kratsios, Anastasis; Hyndman, Cody (2021). "NEU: A Meta-Algorithm for Universal UAP-
Invariant Feature Representation" (http://jmlr.org/papers/v22/18-803.html). Journal of Machine
Learning Research. 22 (92): 1–51. ISSN 1533-7928 (https://www.worldcat.org/issn/1533-7928).
5. Persello, Claudio; Bruzzone, Lorenzo (July 2014). "Relevant and invariant feature selection of
hyperspectral images for domain generalization" (https://dx.doi.org/10.1109/igarss.2014.694725
2). 2014 IEEE Geoscience and Remote Sensing Symposium (https://ris.utwente.nl/ws/files/122
945513/Persello2014relevant.pdf) (PDF). IEEE. pp. 3562–3565.
doi:10.1109/igarss.2014.6947252 (https://doi.org/10.1109%2Figarss.2014.6947252). ISBN 978-
1-4799-5775-0. S2CID 8368258 (https://api.semanticscholar.org/CorpusID:8368258).
6. Hinkle, Jacob; Muralidharan, Prasanna; Fletcher, P. Thomas; Joshi, Sarang (2012). Fitzgibbon,
Andrew; Lazebnik, Svetlana; Perona, Pietro; Sato, Yoichi; Schmid, Cordelia (eds.). "Polynomial
Regression on Riemannian Manifolds" (https://link.springer.com/chapter/10.1007/978-3-642-33
712-3_1). Computer Vision – ECCV 2012. Lecture Notes in Computer Science. Berlin,
Heidelberg: Springer. 7574: 1–14. arXiv:1201.2395 (https://arxiv.org/abs/1201.2395).
doi:10.1007/978-3-642-33712-3_1 (https://doi.org/10.1007%2F978-3-642-33712-3_1).
ISBN 978-3-642-33712-3. S2CID 8849753 (https://api.semanticscholar.org/CorpusID:8849753).
7. Yarotsky, Dmitry (2021-04-30). "Universal Approximations of Invariant Maps by Neural
Networks" (https://doi.org/10.1007/s00365-021-09546-1). Constructive Approximation. 55: 407–
474. arXiv:1804.10306 (https://arxiv.org/abs/1804.10306). doi:10.1007/s00365-021-09546-1 (htt
ps://doi.org/10.1007%2Fs00365-021-09546-1). ISSN 1432-0940 (https://www.worldcat.org/issn/
1432-0940). S2CID 13745401 (https://api.semanticscholar.org/CorpusID:13745401).
8. Hauberg, Søren; Lauze, François; Pedersen, Kim Steenstrup (2013-05-01). "Unscented
Kalman Filtering on Riemannian Manifolds" (https://doi.org/10.1007/s10851-012-0372-9).
Journal of Mathematical Imaging and Vision. 46 (1): 103–120. doi:10.1007/s10851-012-0372-9
(https://doi.org/10.1007%2Fs10851-012-0372-9). ISSN 1573-7683 (https://www.worldcat.org/iss
n/1573-7683). S2CID 8501814 (https://api.semanticscholar.org/CorpusID:8501814).
9. Kratsios, Anastasis; Hyndman, Cody (June 8, 2021). "NEU: A Meta-Algorithm for Universal
UAP-Invariant Feature Representation" (https://jmlr.org/papers/v22/18-803.html). Journal of
Machine Learning Research. 22: 10312. Bibcode:2015NatSR...510312B (https://ui.adsabs.harv
ard.edu/abs/2015NatSR...510312B). doi:10.1038/srep10312 (https://doi.org/10.1038%2Fsrep10
312). PMC 4437376 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4437376). PMID 25988841
(https://pubmed.ncbi.nlm.nih.gov/25988841).
10. Guyon, Isabelle; Elisseeff, André (2003). "An Introduction to Variable and Feature Selection" (htt
p://jmlr.csail.mit.edu/papers/v3/guyon03a.html). JMLR. 3.
11. Sarangi, Susanta; Sahidullah, Md; Saha, Goutam (September 2020). "Optimization of data-
driven filterbank for automatic speaker verification". Digital Signal Processing. 104: 102795.
arXiv:2007.10729 (https://arxiv.org/abs/2007.10729). doi:10.1016/j.dsp.2020.102795 (https://doi.
org/10.1016%2Fj.dsp.2020.102795). S2CID 220665533 (https://api.semanticscholar.org/Corpu
sID:220665533).
12. Yang, Yiming; Pedersen, Jan O. (1997). A comparative study on feature selection in text
categorization (http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/yang97com
parative.pdf) (PDF). ICML.
13. Urbanowicz, Ryan J.; Meeker, Melissa; LaCava, William; Olson, Randal S.; Moore, Jason H.
(2018). "Relief-Based Feature Selection: Introduction and Review" (https://www.ncbi.nlm.nih.go
v/pmc/articles/PMC6299836). Journal of Biomedical Informatics. 85: 189–203.
arXiv:1711.08421 (https://arxiv.org/abs/1711.08421). doi:10.1016/j.jbi.2018.07.014 (https://doi.or
g/10.1016%2Fj.jbi.2018.07.014). PMC 6299836 (https://www.ncbi.nlm.nih.gov/pmc/articles/PM
C6299836). PMID 30031057 (https://pubmed.ncbi.nlm.nih.gov/30031057).
14. Forman, George (2003). "An extensive empirical study of feature selection metrics for text
classification" (http://www.jmlr.org/papers/volume3/forman03a/forman03a.pdf) (PDF). Journal of
Machine Learning Research. 3: 1289–1305.
15. Yishi Zhang; Shujuan Li; Teng Wang; Zigang Zhang (2013). "Divergence-based feature
selection for separate classes". Neurocomputing. 101 (4): 32–42.
doi:10.1016/j.neucom.2012.06.036 (https://doi.org/10.1016%2Fj.neucom.2012.06.036).
16. Guyon I.; Weston J.; Barnhill S.; Vapnik V. (2002). "Gene selection for cancer classification
using support vector machines" (https://doi.org/10.1023%2FA%3A1012487302797). Machine
Learning. 46 (1–3): 389–422. doi:10.1023/A:1012487302797 (https://doi.org/10.1023%2FA%3A
1012487302797).
17. Bach, Francis R (2008). Bolasso: model consistent lasso estimation through the bootstrap.
Proceedings of the 25th International Conference on Machine Learning. pp. 33–40.
doi:10.1145/1390156.1390161 (https://doi.org/10.1145%2F1390156.1390161).
ISBN 9781605582054. S2CID 609778 (https://api.semanticscholar.org/CorpusID:609778).
18. Zare, Habil (2013). "Scoring relevancy of features based on combinatorial analysis of Lasso
with application to lymphoma diagnosis" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC35498
10). BMC Genomics. 14 (Suppl 1): S14. doi:10.1186/1471-2164-14-S1-S14 (https://doi.org/10.1
186%2F1471-2164-14-S1-S14). PMC 3549810 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC
3549810). PMID 23369194 (https://pubmed.ncbi.nlm.nih.gov/23369194).
19. Kai Han; Yunhe Wang; Chao Zhang; Chao Li; Chao Xu (2018). Autoencoder inspired
unsupervised feature selection. IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP).
20. Hazimeh, Hussein; Mazumder, Rahul; Saab, Ali (2020). "Sparse Regression at Scale: Branch-
and-Bound rooted in First-Order Optimization". arXiv:2004.06152 (https://arxiv.org/abs/2004.061
52) [stat.CO (https://arxiv.org/archive/stat.CO)].
21. Soufan, Othman; Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B. (2015-02-26).
"DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm" (https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC4342225). PLOS ONE. 10 (2): e0117988.
Bibcode:2015PLoSO..1017988S (https://ui.adsabs.harvard.edu/abs/2015PLoSO..1017988S).
doi:10.1371/journal.pone.0117988 (https://doi.org/10.1371%2Fjournal.pone.0117988).
ISSN 1932-6203 (https://www.worldcat.org/issn/1932-6203). PMC 4342225 (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC4342225). PMID 25719748 (https://pubmed.ncbi.nlm.nih.gov/25719
748).
22. Figueroa, Alejandro (2015). "Exploring effective features for recognizing the user intent behind
web queries" (https://www.researchgate.net/publication/271911317). Computers in Industry. 68:
162–169. doi:10.1016/j.compind.2015.01.005 (https://doi.org/10.1016%2Fj.compind.2015.01.00
5).
23. Figueroa, Alejandro; Guenter Neumann (2013). Learning to Rank Effective Paraphrases from
Query Logs for Community Question Answering (https://www.researchgate.net/publication/2591
74469). AAAI.
24. Figueroa, Alejandro; Guenter Neumann (2014). "Category-specific models for ranking effective
paraphrases in community Question Answering" (https://www.researchgate.net/publication/260
519271). Expert Systems with Applications. 41 (10): 4730–4742.
doi:10.1016/j.eswa.2014.02.004 (https://doi.org/10.1016%2Fj.eswa.2014.02.004).
hdl:10533/196878 (https://hdl.handle.net/10533%2F196878).
25. Zhang, Y.; Wang, S.; Phillips, P. (2014). "Binary PSO with Mutation Operator for Feature
Selection using Decision Tree applied to Spam Detection". Knowledge-Based Systems. 64:
22–31. doi:10.1016/j.knosys.2014.03.015 (https://doi.org/10.1016%2Fj.knosys.2014.03.015).
26. F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving
feature subset selection problem by a Parallel Scatter Search (https://pdfs.semanticscholar.org/
ea5d/770e97b9330032e8713b0c105b523750a7c3.pdf), European Journal of Operational
Research, vol. 169, no. 2, pp. 477–489, 2006.
27. García-Torres, Miguel; Gómez-Vela, Francisco; Divina, Federico; Pinto-Roa, Diego P.; Noguera,
José Luis Vázquez; Román, Julio C. Mello (2021). "Scatter search for high-dimensional feature
selection using feature grouping" (https://dl.acm.org/doi/abs/10.1145/3449726.3459481).
Proceedings of the Genetic and Evolutionary Computation Conference Companion. pp. 149–
150. doi:10.1145/3449726.3459481 (https://doi.org/10.1145%2F3449726.3459481).
ISBN 9781450383516. S2CID 235770316 (https://api.semanticscholar.org/CorpusID:23577031
6).
28. F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving
Feature Subset Selection Problem by a Hybrid Metaheuristic (https://web.archive.org/web/2019
0830132140/https://pdfs.semanticscholar.org/9428/2985d2c2ea4eb9f49846bedc12003a47db4
9.pdf). In First International Workshop on Hybrid Metaheuristics, pp. 59–68, 2004.
29. M. Garcia-Torres, F. Gomez-Vela, B. Melian, J.M. Moreno-Vega. High-dimensional feature
selection via feature grouping: A Variable Neighborhood Search approach (https://www.researc
hgate.net/profile/Miguel_Garcia_Torres/publication/229763203_Parallel_Scatter_Search/links/
5b2788a00f7e9be8bdaeb0d0/Parallel-Scatter-Search.pdf), Information Sciences, vol. 326, pp.
102-118, 2016.
30. Kraskov, Alexander; Stögbauer, Harald; Andrzejak, Ralph G; Grassberger, Peter (2003).
"Hierarchical Clustering Based on Mutual Information". arXiv:q-bio/0311039 (https://arxiv.org/ab
s/q-bio/0311039). Bibcode:2003q.bio....11039K (https://ui.adsabs.harvard.edu/abs/2003q.bio....
11039K).
31. Akaike, H. (1985), "Prediction and entropy", in Atkinson, A. C.; Fienberg, S. E. (eds.), A
Celebration of Statistics (https://apps.dtic.mil/dtic/tr/fulltext/u2/a120956.pdf) (PDF), Springer,
pp. 1–24, archived (https://web.archive.org/web/20190830132141/https://apps.dtic.mil/dtic/tr/fullt
ext/u2/a120956.pdf) (PDF) from the original on August 30, 2019.
32. Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A practical
information-theoretic approach (https://books.google.com/books?id=fT1Iu-h6E-oC) (2nd ed.),
Springer-Verlag, ISBN 9780387953649.
33. Einicke, G. A. (2018). "Maximum-Entropy Rate Selection of Features for Classifying Changes in
Knee and Ankle Dynamics During Running". IEEE Journal of Biomedical and Health
Informatics. 28 (4): 1097–1103. doi:10.1109/JBHI.2017.2711487 (https://doi.org/10.1109%2FJB
HI.2017.2711487). PMID 29969403 (https://pubmed.ncbi.nlm.nih.gov/29969403).
S2CID 49555941 (https://api.semanticscholar.org/CorpusID:49555941).
34. Aliferis, Constantin (2010). "Local causal and markov blanket induction for causal discovery
and feature selection for classification part I: Algorithms and empirical evaluation" (http://jmlr.or
g/papers/volume11/aliferis10a/aliferis10a.pdf) (PDF). Journal of Machine Learning Research.
11: 171–234.
35. Brown, Gavin; Pocock, Adam; Zhao, Ming-Jie; Luján, Mikel (2012). "Conditional Likelihood
Maximisation: A Unifying Framework for Information Theoretic Feature Selection" (http://dl.acm.
org/citation.cfm?id=2188385.2188387). Journal of Machine Learning Research. 13: 27–66.[1] (h
ttp://www.jmlr.org/papers/volume13/brown12a/brown12a.pdf)
36. Peng, H. C.; Long, F.; Ding, C. (2005). "Feature selection based on mutual information: criteria
of max-dependency, max-relevance, and min-redundancy". IEEE Transactions on Pattern
Analysis and Machine Intelligence. 27 (8): 1226–1238. CiteSeerX 10.1.1.63.5765 (https://citese
erx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.5765). doi:10.1109/TPAMI.2005.159 (https://do
i.org/10.1109%2FTPAMI.2005.159). PMID 16119262 (https://pubmed.ncbi.nlm.nih.gov/1611926
2). S2CID 206764015 (https://api.semanticscholar.org/CorpusID:206764015). Program (http://ho
me.penglab.com/proj/mRMR/index.htm)
37. Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for
Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul,
Turkey. [2] (https://www.researchgate.net/publication/220928649_Towards_a_Generic_Feature-
Selection_Measure_for_Intrusion_Detection?ev=prf_pub)
38. Rodriguez-Lujan, I.; Huerta, R.; Elkan, C.; Santa Cruz, C. (2010). "Quadratic programming
feature selection" (http://jmlr.csail.mit.edu/papers/volume11/rodriguez-lujan10a/rodriguez-lujan1
0a.pdf) (PDF). JMLR. 11: 1491–1516.
39. Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global
Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New
York City, 2014. "[3] (http://people.eng.unimelb.edu.au/baileyj/papers/frp0038-Vinh.pdf)"
40. Yang, Howard Hua; Moody, John (2000). "Data visualization and feature selection: New
algorithms for nongaussian data" (https://papers.nips.cc/paper/1779-data-visualization-and-feat
ure-selection-new-algorithms-for-nongaussian-data.pdf) (PDF). Advances in Neural Information
Processing Systems: 687–693.
41. Yamada, M.; Jitkrittum, W.; Sigal, L.; Xing, E. P.; Sugiyama, M. (2014). "High-Dimensional
Feature Selection by Feature-Wise Non-Linear Lasso". Neural Computation. 26 (1): 185–207.
arXiv:1202.0515 (https://arxiv.org/abs/1202.0515). doi:10.1162/NECO_a_00537 (https://doi.org/
10.1162%2FNECO_a_00537). PMID 24102126 (https://pubmed.ncbi.nlm.nih.gov/24102126).
42. Hall, M. (1999). Correlation-based Feature Selection for Machine Learning (https://www.cs.waik
ato.ac.nz/~mhall/thesis.pdf) (PDF) (PhD thesis). University of Waikato.
43. Senliol, Baris; et al. (2008). "Fast Correlation Based Filter (FCBF) with a different search
strategy". 2008 23rd International Symposium on Computer and Information Sciences. pp. 1–4.
doi:10.1109/ISCIS.2008.4717949 (https://doi.org/10.1109%2FISCIS.2008.4717949). ISBN 978-
1-4244-2880-9. S2CID 8398495 (https://api.semanticscholar.org/CorpusID:8398495).
44. Nguyen, Hai; Franke, Katrin; Petrovic, Slobodan (December 2009). "Optimizing a class of
feature selection measures" (https://www.researchgate.net/publication/231175763).
Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning:
Submodularity, Sparsity & Polyhedra (DISCML). Vancouver, Canada.
45. H. Deng, G. Runger, "Feature Selection via Regularized Trees (https://arxiv.org/abs/1201.158
7)", Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE,
2012
46. RRF: Regularized Random Forest (https://cran.r-project.org/web/packages/RRF/index.html), R
package on CRAN
47. Hamon, Julie (November 2013). Optimisation combinatoire pour la sélection de variables en
régression en grande dimension: Application en génétique animale (https://tel.archives-ouverte
s.fr/tel-00920205) (Thesis) (in French). Lille University of Science and Technology.
48. Yu, Lei; Liu, Huan (August 2003). "Feature selection for high-dimensional data: a fast
correlation-based filter solution" (https://www.aaai.org/Papers/ICML/2003/ICML03-111.pdf)
(PDF). ICML'03: Proceedings of the Twentieth International Conference on International
Conference on Machine Learning: 856–863.
49. T. M. Phuong, Z. Lin et R. B. Altman. Choosing SNPs using feature selection. (http://htsnp.stanfo
rd.edu/FSFS/TaggingSNP.pdf) Archived (https://web.archive.org/web/20160913211229/http://ht
snp.stanford.edu/FSFS/TaggingSNP.pdf) 2016-09-13 at the Wayback Machine Proceedings /
IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems
Bioinformatics Conference, pages 301-309, 2005. PMID 16447987 (https://pubmed.ncbi.nlm.ni
h.gov/16447987).
50. Saghapour, E.; Kermani, S.; Sehhati, M. (2017). "A novel feature ranking method for prediction
of cancer stages using proteomics data" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC56082
17). PLOS ONE. 12 (9): e0184203. Bibcode:2017PLoSO..1284203S (https://ui.adsabs.harvard.
edu/abs/2017PLoSO..1284203S). doi:10.1371/journal.pone.0184203 (https://doi.org/10.1371%
2Fjournal.pone.0184203). PMC 5608217 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5608
51. Shah, S. C.; Kusiak, A. (2004). "Data mining and genetic algorithm based gene/SNP selection".
Artificial Intelligence in Medicine. 31 (3): 183–196. doi:10.1016/j.artmed.2004.04.002 (https://do
i.org/10.1016%2Fj.artmed.2004.04.002). PMID 15302085 (https://pubmed.ncbi.nlm.nih.gov/153
02085).
52. Long, N.; Gianola, D.; Weigel, K. A (2011). "Dimension reduction and variable selection for
genomic selection: application to predicting milk yield in Holsteins". Journal of Animal Breeding
and Genetics. 128 (4): 247–257. doi:10.1111/j.1439-0388.2011.00917.x (https://doi.org/10.111
1%2Fj.1439-0388.2011.00917.x). PMID 21749471 (https://pubmed.ncbi.nlm.nih.gov/2174947
1).
53. Üstünkar, Gürkan; Özöğür-Akyüz, Süreyya; Weber, Gerhard W.; Friedrich, Christoph M.; Aydın
Son, Yeşim (2012). "Selection of representative SNP sets for genome-wide association studies:
A metaheuristic approach". Optimization Letters. 6 (6): 1207–1218. doi:10.1007/s11590-011-
0419-7 (https://doi.org/10.1007%2Fs11590-011-0419-7). S2CID 8075318 (https://api.semantics
cholar.org/CorpusID:8075318).
54. Meiri, R.; Zahavi, J. (2006). "Using simulated annealing to optimize the feature selection
problem in marketing applications". European Journal of Operational Research. 171 (3): 842–
858. doi:10.1016/j.ejor.2004.09.010 (https://doi.org/10.1016%2Fj.ejor.2004.09.010).
55. Kapetanios, G. (2007). "Variable Selection in Regression Models using Nonstandard
Optimisation of Information Criteria". Computational Statistics & Data Analysis. 52 (1): 4–15.
doi:10.1016/j.csda.2007.04.006 (https://doi.org/10.1016%2Fj.csda.2007.04.006).
56. Broadhurst, D.; Goodacre, R.; Jones, A.; Rowland, J. J.; Kell, D. B. (1997). "Genetic algorithms
as a method for variable selection in multiple linear regression and partial least squares
regression, with applications to pyrolysis mass spectrometry". Analytica Chimica Acta. 348 (1–
3): 71–86. doi:10.1016/S0003-2670(97)00065-2 (https://doi.org/10.1016%2FS0003-2670%289
7%2900065-2).
57. Chuang, L.-Y.; Yang, C.-H. (2009). "Tabu search and binary particle swarm optimization for
feature selection using microarray data". Journal of Computational Biology. 16 (12): 1689–
1703. doi:10.1089/cmb.2007.0211 (https://doi.org/10.1089%2Fcmb.2007.0211).
PMID 20047491 (https://pubmed.ncbi.nlm.nih.gov/20047491).
58. E. Alba, J. Garia-Nieto, L. Jourdan et E.-G. Talbi. Gene Selection in Cancer Classification using
PSO-SVM and GA-SVM Hybrid Algorithms. (http://neo.lcc.uma.es/presentacionesCongreso/JM
cec2007.pdf) Congress on Evolutionary Computation, Singapor: Singapore (2007), 2007
59. B. Duval, J.-K. Hao et J. C. Hernandez Hernandez. A memetic algorithm for gene selection and
molecular classification of an cancer. (http://www.info.univ-angers.fr/pub/hao/papers/GECCO09.
pdf) In Proceedings of the 11th Annual conference on Genetic and evolutionary computation,
GECCO '09, pages 201-208, New York, NY, USA, 2009. ACM.
60. C. Hans, A. Dobra et M. West. Shotgun stochastic search for 'large p' regression (https://www.re
searchgate.net/profile/Adrian_Dobra/publication/228388856_Shotgun_Stochastic_Search_for_
Large_p_Regression/links/02bfe5125185997d06000000.pdf). Journal of the American
Statistical Association, 2007.
61. Aitken, S. (2005). "Feature selection and classification for microarray data analysis:
Evolutionary methods for identifying predictive genes" (https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC1181625). BMC Bioinformatics. 6 (1): 148. doi:10.1186/1471-2105-6-148 (https://doi.org/
10.1186%2F1471-2105-6-148). PMC 1181625 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC
62. Oh, I. S.; Moon, B. R. (2004). "Hybrid genetic algorithms for feature selection". IEEE
Transactions on Pattern Analysis and Machine Intelligence. 26 (11): 1424–1437.
CiteSeerX 10.1.1.467.4179 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.467.417
9). doi:10.1109/tpami.2004.105 (https://doi.org/10.1109%2Ftpami.2004.105). PMID 15521491
63. Xuan, P.; Guo, M. Z.; Wang, J.; Liu, X. Y.; Liu, Y. (2011). "Genetic algorithm-based efficient
feature selection for classification of pre-miRNAs" (https://doi.org/10.4238%2Fvol10-2gmr969).
Genetics and Molecular Research. 10 (2): 588–603. doi:10.4238/vol10-2gmr969 (https://doi.org/
10.4238%2Fvol10-2gmr969). PMID 21491369 (https://pubmed.ncbi.nlm.nih.gov/21491369).
64. Peng, S. (2003). "Molecular classification of cancer types from microarray data using the
combination of genetic algorithms and support vector machines" (https://doi.org/10.1016%2Fs0
014-5793%2803%2901275-4). FEBS Letters. 555 (2): 358–362. doi:10.1016/s0014-
5793(03)01275-4 (https://doi.org/10.1016%2Fs0014-5793%2803%2901275-4).
PMID 14644442 (https://pubmed.ncbi.nlm.nih.gov/14644442).
65. Hernandez, J. C. H.; Duval, B.; Hao, J.-K. (2007). "A Genetic Embedded Approach for Gene
Selection and Classification of Microarray Data". Evolutionary Computation,Machine Learning
and Data Mining in Bioinformatics. EvoBIO 2007. Lecture Notes in Computer Science.
Vol. 4447. Berlin: Springer Verlag. pp. 90–101. doi:10.1007/978-3-540-71783-6_9 (https://doi.or
g/10.1007%2F978-3-540-71783-6_9). ISBN 978-3-540-71782-9.
66. Huerta, E. B.; Duval, B.; Hao, J.-K. (2006). "A Hybrid GA/SVM Approach for Gene Selection and
Classification of Microarray Data". Applications of Evolutionary Computing. EvoWorkshops
2006. Lecture Notes in Computer Science. Vol. 3907. pp. 34–44. doi:10.1007/11732242_4 (http
s://doi.org/10.1007%2F11732242_4). ISBN 978-3-540-33237-4.
67. Muni, D. P.; Pal, N. R.; Das, J. (2006). "Genetic programming for simultaneous feature selection
and classifier design". IEEE Transactions on Systems, Man, and Cybernetics - Part B:
Cybernetics. 36 (1): 106–117. doi:10.1109/TSMCB.2005.854499 (https://doi.org/10.1109%2FT
SMCB.2005.854499). PMID 16468570 (https://pubmed.ncbi.nlm.nih.gov/16468570).
68. Jourdan, L.; Dhaenens, C.; Talbi, E.-G. (2005). "Linkage disequilibrium study with a parallel
adaptive GA". International Journal of Foundations of Computer Science. 16 (2): 241–260.
doi:10.1142/S0129054105002978 (https://doi.org/10.1142%2FS0129054105002978).
69. Zhang, Y.; Dong, Z.; Phillips, P.; Wang, S. (2015). "Detection of subjects and brain regions
related to Alzheimer's disease using 3D MRI scans based on eigenbrain and machine learning"
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4451357). Frontiers in Computational
Neuroscience. 9: 66. doi:10.3389/fncom.2015.00066 (https://doi.org/10.3389%2Ffncom.2015.00
066). PMC 4451357 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4451357). PMID 26082713
70. Roffo, G.; Melzi, S.; Cristani, M. (2015-12-01). "Infinite Feature Selection". 2015 IEEE
International Conference on Computer Vision (ICCV). pp. 4202–4210.
doi:10.1109/ICCV.2015.478 (https://doi.org/10.1109%2FICCV.2015.478). ISBN 978-1-4673-
8391-2. S2CID 3223980 (https://api.semanticscholar.org/CorpusID:3223980).
71. Roffo, Giorgio; Melzi, Simone (September 2016). "Features Selection via Eigenvector
Centrality" (http://www.di.uniba.it/~loglisci/NFmcp2016/NFmcp2016_paper_13.pdf) (PDF).
NFmcp2016. Retrieved 12 November 2016.
72. R. Kohavi and G. John, "Wrappers for feature subset selection (https://ai.stanford.edu/~ronnyk/w
rappersPrint.pdf)", Artificial intelligence 97.1-2 (1997): 273-324
73. Das, Abhimanyu; Kempe, David (2011). "Submodular meets Spectral: Greedy Algorithms for
Subset Selection, Sparse Approximation and Dictionary Selection". arXiv:1102.3975 (https://arx
iv.org/abs/1102.3975) [stat.ML (https://arxiv.org/archive/stat.ML)].
74. Liu et al., Submodular feature selection for high-dimensional acoustic score spaces (http://melo
di.ee.washington.edu/~bilmes/mypubs/liu-submodfeature2013-icassp.pdf) Archived (https://we
b.archive.org/web/20151017122628/http://melodi.ee.washington.edu/~bilmes/mypubs/liu-subm
odfeature2013-icassp.pdf) 2015-10-17 at the Wayback Machine
75. Zheng et al., Submodular Attribute Selection for Action Recognition in Video (http://papers.nips.
cc/paper/5565-deep-convolutional-neural-network-for-image-deconvolution) Archived (https://w
eb.archive.org/web/20151118001059/http://papers.nips.cc/paper/5565-deep-convolutional-neur
al-network-for-image-deconvolution) 2015-11-18 at the Wayback Machine
76. Sun, Y.; Todorovic, S.; Goodison, S. (2010). "Local-Learning-Based Feature Selection for High-
Dimensional Data Analysis" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445441). IEEE
Transactions on Pattern Analysis and Machine Intelligence. 32 (9): 1610–1626.
doi:10.1109/tpami.2009.190 (https://doi.org/10.1109%2Ftpami.2009.190). PMC 3445441 (http
s://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445441). PMID 20634556 (https://pubmed.ncbi.nl
m.nih.gov/20634556).
77. D.H. Wang, Y.C. Liang, D.Xu, X.Y. Feng, R.C. Guan(2018), "A content-based recommender
system for computer science publications (https://www.sciencedirect.com/science/article/pii/S09
50705118302107)", Knowledge-Based Systems, 157: 1-9
Further reading
Guyon, Isabelle; Elisseeff, Andre (2003). "An Introduction to Variable and Feature Selection" (htt
p://www.jmlr.org/papers/v3/guyon03a.html). Journal of Machine Learning Research. 3: 1157–
1182.
Harrell, F. (2001). Regression Modeling Strategies. Springer. ISBN 0-387-95232-2.
Liu, Huan; Motoda, Hiroshi (1998). Feature Selection for Knowledge Discovery and Data
Mining (https://books.google.com/books?id=aaDbBwAAQBAJ). Springer. ISBN 0-7923-8198-X.
Liu, Huan; Yu, Lei (2005). "Toward Integrating Feature Selection Algorithms for Classification
and Clustering". IEEE Transactions on Knowledge and Data Engineering. 17 (4): 491–502.
doi:10.1109/TKDE.2005.66 (https://doi.org/10.1109%2FTKDE.2005.66). S2CID 1607600 (http
s://api.semanticscholar.org/CorpusID:1607600).
External links
Feature Selection Package, Arizona State University (Matlab Code) (https://web.archive.org/we
b/20120511162342/http://featureselection.asu.edu/software.php)
NIPS challenge 2003 (http://www.clopinet.com/isabelle/Projects/NIPS2003/) (see also NIPS)
Naive Bayes implementation with feature selection in Visual Basic (http://paul.luminos.nl/docu
ments/show_document.php?d=198) (includes executable and source code)
Minimum-redundancy-maximum-relevance (mRMR) feature selection program (http://home.pen
glab.com/proj/mRMR/index.htm)
FEAST (http://mloss.org/software/view/386/) (Open source Feature Selection algorithms in C
and MATLAB)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Feature_selection&oldid=1166744703"

Feature Selection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Selection

Uploaded by

Copyright:

Available Formats

Feature selection

simplification of models to make them easier to interpret by researchers/users,[1]

Search approaches include:

Other available filter metrics include:

used to select the most relevant subset of features.[33]

Information Theory Based Feature Selection Mechanisms

5. Repeat 3. and 4. until a certain number of features is selected (e.g. )

Minimum-redundancy-maximum-relevance (mRMR) feature selection

Quadratic programming feature selection

where is the vector of feature relevancy assuming there are n features in

Conditional mutual information

Joint mutual information

Hilbert-Schmidt Independence Criterion Lasso based feature

where is a kernel-based independence measure called the (empirical) Hilbert-

are Gram matrices, and are kernel functions, is the

The HSIC Lasso can be written as

Correlation feature selection

Overview on metaheuristics methods

Filter type methods select variables regardless of the

Wrapper methods evaluate subsets of variables which

Embedded methods have been recently proposed that try

Application of feature selection metaheuristics

Feature Selection using Phuong

Filter + Predicted residual

Marketing Simulated annealing Wrapper Regression AIC, r2 Meiri 2006[54]

Simulated annealing, Kapetanios

Support Vector Classification

Genetic algorithm + Support Vector Classification Duval

Support Vector Classification Hernandez

Support Vector Jourdan

Alzheimer's Support vector Classification Zhang

Feature selection embedded in learning algorithms

-regularization techniques, such as sparse regression, LASSO, and -SVM

You might also like