Professional Documents
Culture Documents
mechanistic data-driven
predictions & features,
biomarkers biomarkers, indices
07/11/2023 | 3
Problem setting for classification
• We have many variables recorded from different people (e.g.
patients). How can we build a useful system that uses those
variables as input and generates outputs that tells us to which
class they belong (e.g. disease-class, healthy class)?
https://doi.org/10.3389/fnbot.2017.00019
07/11/2023 | 6
Features can be obtained from in many different ways from
processing the signals, for example:
summary results from descriptors of
measures mathematical underlying shape
transforms or or structure
projections onto
lower dimensions
Feature Selection
reduce the feature set even further in size; select those features that are
most relevant to the classification task at hand
Feature engineering
• This process (feature extraction and selection) is referred to as feature
engineering. It can be done manually or (semi)-automated. Often using a
combination of pre-existing knowledge, common sense, creativity, and trial and
error. It is application and data-source dependent.
• It is a critical step, since after doing this, the information that describes, e.g., your
patient, is solely covered by the features you chose. If they were chosen poorly,
you are stuck with that, and you get garbage-in garbage-out in your classifiers.
• The choice of features is often much more important than the flavour of AI/ML
method you use.
• In some neural-network research circles, feature engineering may be rumoured
to be obsolete, since a neural network may ‘learn all needed information itself’.
However, this requires lots and lots of relevant example data and acceptance of
black-box approaches. Image recognition and speech recognition are notable
examples where this may work. In many higher-level healthcare applications
where decisions are to be made, it is not an option.
07/11/2023 | 9
Combining features into Feature Vectors
• a combination of features is used to effectively describe an observation
of raw data (a ’pattern’); the combination of features can be written as a
feature vector (feature_1, feature_2,…feature_n), e.g.
x = (heartrate, systolic_blood_pressure, body_weight, EEGamplitude)
Could be (72, 123, 89, 12) for an observation from a given person
• it is a reduced-dimensional representation of the raw data. Used for
classifying and interpreting the observations
• So, we reduce the total amount of data we had. But, the reduced-data
set is (hopefully) more effective in describing the essential properties of
the data sources.
Classification
(sometimes also called pattern recognition)
Identify to which class, C, an observation (pattern), x, from which a
selected feature vector was calculated, belongs.
Patterns that are quantitatively similar may belong to the same class, in
such cases clustering techniques or matching of patterns against
prototypes can be used.
However, also patterns that bear no obvious similarities to one another
can belong to the same class.
Statistical Pattern Recognition Techniques
A set of N features extracted from a pattern can be treated as
an N-dimensional vector x in an N-dimensional space. Such a
vector is called a feature vector and the N-dimensional space
is called feature space. Usually, the feature space
is R N , but it can also be a subspace of R N.
0 if x C2
For an M-class separation problem x1
we’ll need M-1 such linear discriminant T C2
w x < 0
functions
Decision Functions*
In the general case the decision region to which a feature vector belongs is
determined by a decision function (which is not necessary linear). A decision
function gi(x) is defined for each decision region i. Often, the feature vector
x is assigned to region m for which the relationship
g m ( x ) gi ( x ) i m
holds.
07/11/2023 | 16
Estimation of discriminant functions (2/2)
2. Minimize the expected error the classifier will make
http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/
07/11/2023 | 17
Minimum Distance Classifier
An example of a feature space with two decision regions their decision functions
being:
1 1
d1 ( x ) d 2 ( x)
x x1 and x x2 .
The feature vector is assigned to the decision region with the largest decision
function value; this is a minimum distance classifier.
d1 = d2
x1
C1
C2
x2
k-Nearest Neighbour Classifiers
Suppose we have N sample patterns {s1, s2, ..., sN} that we know belong
to one of M classes {C1, C2, ..., CM}.
For a new, to classify, pattern x, we calculate the distance D(si ,x) and
assign x to the class of the sample pattern that is closest to x (the ’nearest
neighbour’).
•http://vision.stanford.edu/teaching/cs231n-demos/knn/
Clustering Methods
• Sometimes we don’t want (or cannot) assign classes beforehand. We
may then examine possible formation of inherent groups or clusters of
the data in feature space and (possibly) label the clusters with class
labels.
• This may be relatively easy in 2D, but for higher-dimensionial feature
spaces this is a difficult task.
• We have to group the feature vectors on the basis of similarity,
dissimilarity or distance measures.
• Examples of commonly used measures are the Euclidean distance,
the normalised dot product, or, more fancy, the Mahalanobis distance,
(there exist many more)
Distance measures
• Euclidean distance, between feature vectors x and z
n
D x z (x z ) (x z ) ( xi zi ) 2
2 2 T
E
i 1
xT z
• Normalised dot product
Dd
x z
• Mahalanobis distance
Note: one has to define the number of clusters and initial cluster centres beforehand.
http://shabal.in/visuals/k
means/6.html
Mapping to higher dimensions
•As we saw, feature extraction is typically done to go
from high-dimensional data to more manageable and
effective lower-dimensional data.
•However, it may be sometimes be useful to go in the
other direction - to higher dimensions.
•Samples that are not linearly separable in a low
dimension may become so when transformed to a
higher dimensional space.
•A classic approach that uses this idea successfully is
Support Vector Machine (SVM) classification algorithms
Data that are not linearly separable in the original space
may become linearly separable when projecting the to a
higher dimension dimensional space can make it linearly
separable.
https://medium.com/analytics-vidhya/how-to-classify-non-
linear-data-to-linear-data-bb2df1a6b781
A short look ahead to neural networks…
07/11/2023 | 28
ANN (Artificial Neural Network)-based
classifiers
Many artificial neural networks operate similarly to methods used in the statistical
approach, i.e., they try to estimate decision regions in feature space. We can
distinguish:
probabilistic classifiers that use a priori information about distributions and use
supervised learning to estimate their parameters
hyperplane classifiers that construct hyperplane boundaries by calculating a
weighted sum of inputs and passing this through a non-linear function
kernel (receptive-field) classifiers that establish receptive fields across feature
space; if an input is near the centre of a receptive field of a processing unit, the
output of that unit will be high
exemplar classifiers that use the distance between input pattern and training
exemplars to come to classifications
Use for Recognition and Classification
generate suitable decision regions in feature space by
looking for the right weight configuration
a: decision regions realised
by a linear classifier
b: decision regions realised
by a non-linear classifier
http://scikit-
learn.org/stable/auto_examples/classification/
plot_classifier_comparison.html
A nice explanation (with cats and dogs
Curse of dimensionality ) can be found here
https://www.visiondummy.com/2014/04/
curse-dimensionality-affect-
classification/
The higher dimensional our feature space gets, the sparser its contents become (available measured data points per ‘feature
space hypervolume unit’ decreases). This increases the risk of overfitting.
One way to deal with this issue is reduction of the dimensions of feature space. We will continue with this in Lecture 7.
07/11/2023 | 33
Features can be obtained from different ways of
processing the signals, for example:
summary results from descriptors of
measures mathematical underlying shape
transforms or or structure
projections onto
lower dimensions
https://damassets.autodesk.net/content/dam/autodesk/www/autodesk-reasearch/Publications/pdf/same-stats-different-graphs.pdf