Professional Documents
Culture Documents
Remember
! ! !
s p x ∈ℜn c
Pre Extracción (Cálculo)
Sensado Clasificación
Procesamiento de Características
n: número de c: clase
características (escalar)
Mundo
Real
1
Dataset: Señales EMG
● Mediciones de 8 señales del antebrazo
utilizando electromiografía (EMG), de personas
realizando diferentes gestos con sus manos. Gesture Detection & Classification
● Existen 7 gestos diferentes, pero se descarta
uno por falta de datos
using EMG (electromyography)
● Contiene mediciones de 36 personas, 2 signals
mediciones por persona. Las últimas 8 se dejan - 8 forearm signals
como conjunto de prueba.
● Datos de entrenamiento: 3317739 filas x 8
- 7 gestures to be recognized
columnas.
2
Feature/Attribute Selection
• Selection based on attribute characteristics, e.g. variability or
correlation (filter strategy)
• Selection using a classifier (wrapper strategy)
• Embedded methods (classification algorithm may have a built-in
feature selection method)
• Dimensionality reduction
Filter Strategy
Selection based on attribute characteristics, e.g. variability or
correlation (filter strategy)
3
Filter Strategy
Dependency measure
• correlation between a feature and a class label.
• how close is the feature related to the outcome of the class label?
• dependence between features = degree of redundancy.
- if a feature is heavily dependence on another, than it is redundant.
but take care, highly correlation between variables does not mean
absence of variable complementarity
Example of how to
measure the correlation
between a feature and a
class label (Pearson
correlation):
The criteria described in this section extend to the case of binary variables. Forman (2003)
presents in this issue an extensive study of such criteria for binary variables with applications in text
classification.
Filter Strategy
Several approaches to the variable selection problem using information theoretic criteria have been
proposed (as reviewed in this issue by Bekkerman et al., 2003, Dhillon et al., 2003, Forman, 2003,
Torkkola, measure
Information 2003). Many rely on empirical estimates of the mutual information between each variable
and the target:
• entropy - measurement of information Z Z content.
p(xi , y)
I (i) = p(xi , y) log dxdy , (3)
• information gain of a feature : (eg. Induction
xi y of decision tree)
p(xi )p(y)
gain(A) = I(p,n) - E(A)
where p(xi ) and p(y) are the probability densities of xi and y, and p(xi , y) is the joint density. The
gain(A) = before A is branched - sum of all nodes after branched
criterion I (i) is a measure of dependency between the density of variable xi and the density of the
• select
target A
y. if gain(A) > gain(B).
The difficulty is that the densities p(xi ), p(y) and p(xi , y) are all unknown and are hard to
estimate from data. The case of discrete or nominal variables is probably easiest because the integral
becomes a sum:
Example of how to
measure the discrete P(X = xi ,Y = y)
mutual information I(i) = ∑ ∑ P(X = xi ,Y = y) log . (4)
xi y P(X = xi )P(Y = y)
between a feature and a
class: The probabilities are then estimated from frequency counts. For example, in a three-class
problem, if a variable takes 4 values, P(Y = y) represents the class prior probabilities (3 fre-
quency counts), P(X = xi ) represents the distribution of the input variable (4 frequency counts),
and P(X = xi ,Y = y) is the probability of the joint observations (12 frequency counts). The estima-
tion obviously becomes harder with larger numbers of classes and variable values.
The case of continuous variables (and possibly continuous targets) is the hardest. One can
consider discretizing the variables or approximating their densities with a non-parametric method
such as Parzen windows (see, e.g., Torkkola, 2003). Using the normal distribution to estimate
densities would bring us back to estimating the covariance between Xi and Y , thus giving us a
criterion similar to a correlation coefficient.
4
3 Small but Revealing Examples
We present a series of small examples that outline the usefulness and the limitations of variable
Filter Strategy
Consistency measure
• two instances are inconsistent
if they have matching feature values
but group under different class label.
f1 f2 class
instance 1 a b c1 inconsistent
instance 2 a b c2
Wrapper Strategy
[1] Guyon&Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003) 1157-1182.
5
Wrapper Strategy
Four main steps in a feature selection method.
Original Subset
Generation Evaluation
feature of Validation
set feature
Selected
no Stopping yes subset of
criterion feature
process
Wrapper Strategy
Four main steps in a feature selection method.
Original Subset
Generation Evaluation
feature of Validation
set feature
Selected
no Stopping yes subset of
criterion feature
process
6
Wrapper Strategy
Generation
• select candidate subset of feature for evaluation.
• Start = no feature, all feature, random feature subset.
• Subsequent = add, remove, add/remove.
• categorise feature selection = ways to generate feature subset
candidate.
• 3 ways in how the feature space is examined.
- Complete
- Heuristic
- Random
Wrapper Strategy
Complete/exhaustive
• examine all combinations of feature subset.
{f1,f2,f3} => { {f1},{f2},{f3},{f1,f2},{f1,f3},{f2,f3},{f1,f2,f3} }
p
• order of the search space O(2 ), p - # feature.
• optimal subset is achievable.
• too expensive if feature space is large.
Heuristic
• selection is directed under certain guideline
- selected feature taken out, no combination of feature.
- candidate = { {f1,f2,f3}, {f2,f3}, {f3} }
• incremental generation of subsets.
• search space is smaller and faster in producing result.
• miss out features of high order relations (parity problem).
- Some relevant feature subset may be omitted {f1,f2}.
7
Wrapper Strategy
Random
• no predefined way to select feature candidate.
• pick feature at random (ie. probabilistic approach).
• optimal subset depend on the number of try
- which then rely on the available resource.
• require more user-defined input parameters.
- result optimality will depend on how these parameters are defined.
- eg. number of try
Wrapper Strategy
Four main steps in a feature selection method.
Original Subset
Generation Evaluation
feature of Validation
set feature
Selected
no Stopping yes subset of
criterion feature
process
8
Dimensionality Reduction
• Features/attributes can be correlated and therefore reduced
using algorithms such as PCA
9
PCA - Principal Component Analysis
Matriz Covarianza
C = U TC'U
Representación de vectores
entrada en nuevas
Proyección vectores entrada en nuevas
coordenadas: coordenadas:
∑λ j
RMSE(m) = j=m+1
n
∑λ
j=1
j
Remember:
10
PCA - Principal Component Analysis
11