You are on page 1of 11

Feature Selection

Remember
! ! !
s p x ∈ℜn c
Pre Extracción (Cálculo)
Sensado Clasificación
Procesamiento de Características

n: número de c: clase
características (escalar)

Mundo
Real

•  This diagram represents the operation the operation of the system.


•  Before operation, the classifier needs to be trained.
•  Before training, features need to be selected.
•  Feature selection is different from feature extraction (or feature use).
•  Feature selection is different from feature design/creation.

1
Dataset: Señales EMG
● Mediciones de 8 señales del antebrazo
utilizando electromiografía (EMG), de personas
realizando diferentes gestos con sus manos. Gesture Detection & Classification
● Existen 7 gestos diferentes, pero se descarta
uno por falta de datos
using EMG (electromyography)
● Contiene mediciones de 36 personas, 2 signals
mediciones por persona. Las últimas 8 se dejan -  8 forearm signals
como conjunto de prueba.
● Datos de entrenamiento: 3317739 filas x 8
-  7 gestures to be recognized
columnas.

Possible features in each signal, after


defining an analysis window:
Dataset: Señales EMG -  Mean
EMG Signals
-  Variance
-  Minimum
-  Maximum
-  Range
-  Square Mean
-  Skew
-  Kurtosis

8 features per signal and 8 signals means 64 features.


Can we reduce this number?
Are these features the right ones?

Feature Selection for Classification


Main focus.
•  to select subset of relevant feature
•  to achieve classification accuracy.
Thus: relevancy -> correct prediction

Why can’t we use the full original feature set?


•  too computational expensive to examine all features.
•  not necessary to include all features
(ie. irrelevant - gain no further information).

2
Feature/Attribute Selection
•  Selection based on attribute characteristics, e.g. variability or
correlation (filter strategy)
•  Selection using a classifier (wrapper strategy)
•  Embedded methods (classification algorithm may have a built-in
feature selection method)
•  Dimensionality reduction

Image from: https://moredvikas.wordpress.com/2018/10/09/machine-learning-introduction-to-feature-selection-variable-selection-or-


attribute-selection-or-dimensionality-reduction/

Filter Strategy
Selection based on attribute characteristics, e.g. variability or
correlation (filter strategy)

3
Filter Strategy
Dependency measure
•  correlation between a feature and a class label.
•  how close is the feature related to the outcome of the class label?
•  dependence between features = degree of redundancy.
- if a feature is heavily dependence on another, than it is redundant.
but take care, highly correlation between variables does not mean
absence of variable complementarity

Example of how to
measure the correlation
between a feature and a
class label (Pearson
correlation):

G UYON AND E LISSEEFF

The criteria described in this section extend to the case of binary variables. Forman (2003)
presents in this issue an extensive study of such criteria for binary variables with applications in text
classification.

2.4 Information Theoretic Ranking Criteria

Filter Strategy
Several approaches to the variable selection problem using information theoretic criteria have been
proposed (as reviewed in this issue by Bekkerman et al., 2003, Dhillon et al., 2003, Forman, 2003,
Torkkola, measure
Information 2003). Many rely on empirical estimates of the mutual information between each variable
and the target:
•  entropy - measurement of information Z Z content.
p(xi , y)
I (i) = p(xi , y) log dxdy , (3)
•  information gain of a feature : (eg. Induction
xi y of decision tree)
p(xi )p(y)
gain(A) = I(p,n) - E(A)
where p(xi ) and p(y) are the probability densities of xi and y, and p(xi , y) is the joint density. The
gain(A) = before A is branched - sum of all nodes after branched
criterion I (i) is a measure of dependency between the density of variable xi and the density of the
•  select
target A
y. if gain(A) > gain(B).
The difficulty is that the densities p(xi ), p(y) and p(xi , y) are all unknown and are hard to
estimate from data. The case of discrete or nominal variables is probably easiest because the integral
becomes a sum:
Example of how to
measure the discrete P(X = xi ,Y = y)
mutual information I(i) = ∑ ∑ P(X = xi ,Y = y) log . (4)
xi y P(X = xi )P(Y = y)
between a feature and a
class: The probabilities are then estimated from frequency counts. For example, in a three-class
problem, if a variable takes 4 values, P(Y = y) represents the class prior probabilities (3 fre-
quency counts), P(X = xi ) represents the distribution of the input variable (4 frequency counts),
and P(X = xi ,Y = y) is the probability of the joint observations (12 frequency counts). The estima-
tion obviously becomes harder with larger numbers of classes and variable values.
The case of continuous variables (and possibly continuous targets) is the hardest. One can
consider discretizing the variables or approximating their densities with a non-parametric method
such as Parzen windows (see, e.g., Torkkola, 2003). Using the normal distribution to estimate
densities would bring us back to estimating the covariance between Xi and Y , thus giving us a
criterion similar to a correlation coefficient.
4
3 Small but Revealing Examples
We present a series of small examples that outline the usefulness and the limitations of variable
Filter Strategy
Consistency measure
•  two instances are inconsistent
if they have matching feature values
but group under different class label.

f1 f2 class
instance 1 a b c1 inconsistent
instance 2 a b c2

Wrapper Strategy

•  “It consists in using the prediction performance of a given learning


machine to assess the relative usefulness of subsets of variables” [1].

•  “In forward selection, variables are pro- gressively incorporated into


larger and larger subsets, whereas in backward elimination one starts
with the set of all variables and progressively eliminates the least
promising ones” [1].

[1] Guyon&Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003) 1157-1182.

5
Wrapper Strategy
Four main steps in a feature selection method.

Original Subset
Generation Evaluation
feature of Validation
set feature

Selected
no Stopping yes subset of
criterion feature

process

Generation = select feature subset candidate.


Evaluation = uses classification error.
Stopping criterion = determine whether subset is relevant.
Validation = verify subset validity.

Wrapper Strategy
Four main steps in a feature selection method.

Original Subset
Generation Evaluation
feature of Validation
set feature

Selected
no Stopping yes subset of
criterion feature

process

Generation = select feature subset candidate.


Evaluation = uses classification error.
Stopping criterion = determine whether subset is relevant.
Validation = verify subset validity.

6
Wrapper Strategy
Generation
•  select candidate subset of feature for evaluation.
•  Start = no feature, all feature, random feature subset.
•  Subsequent = add, remove, add/remove.
•  categorise feature selection = ways to generate feature subset
candidate.
•  3 ways in how the feature space is examined.
- Complete
- Heuristic
- Random

Wrapper Strategy
Complete/exhaustive
•  examine all combinations of feature subset.
{f1,f2,f3} => { {f1},{f2},{f3},{f1,f2},{f1,f3},{f2,f3},{f1,f2,f3} }
p
•  order of the search space O(2 ), p - # feature.
•  optimal subset is achievable.
•  too expensive if feature space is large.

Heuristic
•  selection is directed under certain guideline
- selected feature taken out, no combination of feature.
- candidate = { {f1,f2,f3}, {f2,f3}, {f3} }
•  incremental generation of subsets.
•  search space is smaller and faster in producing result.
•  miss out features of high order relations (parity problem).
- Some relevant feature subset may be omitted {f1,f2}.

7
Wrapper Strategy
Random
•  no predefined way to select feature candidate.
•  pick feature at random (ie. probabilistic approach).
•  optimal subset depend on the number of try
- which then rely on the available resource.
•  require more user-defined input parameters.
- result optimality will depend on how these parameters are defined.
- eg. number of try

Wrapper Strategy
Four main steps in a feature selection method.

Original Subset
Generation Evaluation
feature of Validation
set feature

Selected
no Stopping yes subset of
criterion feature

process

Generation = select feature subset candidate.


Evaluation = uses classification error.
Stopping criterion = determine whether subset is relevant.
Validation = verify subset validity.

8
Dimensionality Reduction
•  Features/attributes can be correlated and therefore reduced
using algorithms such as PCA

•  New features are created, which corresponds to mixtures of the


original ones

Image taken from https://towardsdatascience.com/feature-extraction-using-principal-component-analysis-a-


simplified-visual-demo-e5592ced100a

PCA - Principal Component Analysis

9
PCA - Principal Component Analysis
Matriz Covarianza

Valor propio vector propio

C = U TC'U

Representación de vectores
entrada en nuevas
Proyección vectores entrada en nuevas
coordenadas: coordenadas:

PCA - Principal Component Analysis

The normalized Residual Mean Square Error (RMSE) can be used as a


criterion for selecting the appropriate number of axes (reduced space
dimension) to be employed. Let m be the number of selected axes, then the
RMSE is given by:

∑λ j
RMSE(m) = j=m+1
n

∑λ
j=1
j

Remember:

10
PCA - Principal Component Analysis

Image taken from https://towardsdatascience.com/feature-extraction-using-principal-component-analysis-a-


simplified-visual-demo-e5592ced100a

11

You might also like