You are on page 1of 38

BBT.HTI.

509/NBE-E4080 Decision Support in


Healthcare

Lecture 5 – Data-driven approaches – concepts:


features and feature space

Mark van Gils


Platforms, tools for Biomedical Research &
Decision Support in practice
• healthcare
• research
• citizens

mechanistic data-driven
predictions & features,
biomarkers biomarkers, indices

Modelling Approaches Data-driven Approaches


Top-down Bottom-up
• Biophysical FE Models • Statistical associations
• Pathways • Machine Learning, AI
• Biosignals Modelling • Feature generation, indices
Data-driven approaches – traditional uses
•Provided that we have enough data available, we can
do, for example:
• Classification (diagnosis, stratification etc)
• Regression (prediction, uncover relationships)
• Time-series analysis (trend- and event- detection)
• Clustering (stratification, data reduction)
• Data exploration, visualization (knowledge discovery,
hypotheses generation, sanity checks)

07/11/2023 | 3
Problem setting for classification
• We have many variables recorded from different people (e.g.
patients). How can we build a useful system that uses those
variables as input and generates outputs that tells us to which
class they belong (e.g. disease-class, healthy class)?

• We often start with whatever variables that may be expected to


be useful and then try to figure out how that variable set (or part
of it) can be associated with known example outputs via a
mathematical function (a ‘classifier’) that maps inputs to outputs.
From raw data to features
•Raw data (e.g., separate samples in a signal, pixels in
an image) describe information at very detailed level,
meaning we have (very) many measurements.
•We usually want to reduce the amount of data we work
with for classification purposes
• To lessen computing resources requirements
• To pre-extract useful descriptors from the data that are
powerfully summarizing main properties of the signal and
make building classifiers easier
• To avoid the curse of dimensionality
• To lessen the chance of overfitting (fitting on noise)
07/11/2023 | 5
https://derangedphysiology.com

We can calculate ‘summary descriptors’ that


efficiently capture the character, or essential
information, from a raw signal.
We call them features.

https://doi.org/10.3389/fnbot.2017.00019
07/11/2023 | 6
Features can be obtained from in many different ways from
processing the signals, for example:
summary results from descriptors of
measures mathematical underlying shape
transforms or or structure
projections onto
lower dimensions

- RMS amplitude - regression coeff Shape descriptors describing


- offset - Fourier transform how signal visually looks when
- mean, median coefficients plotted
- standard dev - wavelet transform. - peaks
- range, coefficients - slopes
percentiles - Principal - 'spikyness’, ‘smoothness’
- spectral Components - ‘flatness’
entropy,
bicoherence…
Feature Extraction
reduces the amount of raw data while maintaining (or enhancing) the vital
information for the classification task at hand
features can range from low-level features (typically measurements, e.g.,
numerical values like signal amplitudes) to high-level features (e.g.,
'symptom' feature)

Feature Selection
reduce the feature set even further in size; select those features that are
most relevant to the classification task at hand
Feature engineering
• This process (feature extraction and selection) is referred to as feature
engineering. It can be done manually or (semi)-automated. Often using a
combination of pre-existing knowledge, common sense, creativity, and trial and
error. It is application and data-source dependent.
• It is a critical step, since after doing this, the information that describes, e.g., your
patient, is solely covered by the features you chose. If they were chosen poorly,
you are stuck with that, and you get garbage-in garbage-out in your classifiers.
• The choice of features is often much more important than the flavour of AI/ML
method you use.
• In some neural-network research circles, feature engineering may be rumoured
to be obsolete, since a neural network may ‘learn all needed information itself’.
However, this requires lots and lots of relevant example data and acceptance of
black-box approaches. Image recognition and speech recognition are notable
examples where this may work. In many higher-level healthcare applications
where decisions are to be made, it is not an option.
07/11/2023 | 9
Combining features into Feature Vectors
• a combination of features is used to effectively describe an observation
of raw data (a ’pattern’); the combination of features can be written as a
feature vector (feature_1, feature_2,…feature_n), e.g.
x = (heartrate, systolic_blood_pressure, body_weight, EEGamplitude)
Could be (72, 123, 89, 12) for an observation from a given person
• it is a reduced-dimensional representation of the raw data. Used for
classifying and interpreting the observations
• So, we reduce the total amount of data we had. But, the reduced-data
set is (hopefully) more effective in describing the essential properties of
the data sources.
Classification
(sometimes also called pattern recognition)
Identify to which class, C, an observation (pattern), x, from which a
selected feature vector was calculated, belongs.

Patterns that are quantitatively similar may belong to the same class, in
such cases clustering techniques or matching of patterns against
prototypes can be used.
However, also patterns that bear no obvious similarities to one another
can belong to the same class.
Statistical Pattern Recognition Techniques
A set of N features extracted from a pattern can be treated as
an N-dimensional vector x in an N-dimensional space. Such a
vector is called a feature vector and the N-dimensional space
is called feature space. Usually, the feature space
is R N , but it can also be a subspace of R N.

Classifications are based on the determination of decision


regions. The feature space is thus divided into decision
regions that are associated with classes.
The borders between decision regions are called decision
boundaries.
Decision Regions
The classification is performed by determining the decision region in
which the applied feature vector falls. The main problem in constructing a
good classifier is how to find the boundaries between these regions and
implement and use them in an algorithm.

Examples of decision regions and


boundaries in a 2-dimensional feature
space:
a: decision regions separated by linear
decision boundaries;
b: a general division of feature space into
decision regions.
Linear Separation of Decision Regions
n-dimensional feature vector x with x0 set to1 and a
weight vector, w
y  w 0  1  w1  x1    w n  x n
2-class separation problem
x2 C1
  0 if x  C1 w Tx = 0 w Tx > 0
d ( x)  w  x 
T

 0 if x  C2
For an M-class separation problem x1
we’ll need M-1 such linear discriminant T C2
w x < 0
functions
Decision Functions*
In the general case the decision region to which a feature vector belongs is
determined by a decision function (which is not necessary linear). A decision
function gi(x) is defined for each decision region i. Often, the feature vector
x is assigned to region m for which the relationship

g m ( x )  gi ( x )  i  m
holds.

The linear discriminant function of the previous slide is a special (but


popular) choice.

*Often, also the term ’Discriminant Functions’ is used.


Estimation of discriminant functions (1/2)
Many methods exist to estimate optimal decision regions and
discriminant functions. Roughly grouped into 2 approaches
1. Convert the so-called a priori class probability P(Ci) into an a
posteriori probability, based on measured patterns x; P(Ci|x). This
is called the (naive) Bayesian approach [naïve: if we assume the
x measurements are independent]

07/11/2023 | 16
Estimation of discriminant functions (2/2)
2. Minimize the expected error the classifier will make

This can be done e.g. iteratively

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/
07/11/2023 | 17
Minimum Distance Classifier
An example of a feature space with two decision regions their decision functions
being:

1 1
d1 ( x )  d 2 ( x) 
x  x1 and x  x2 .

The feature vector is assigned to the decision region with the largest decision
function value; this is a minimum distance classifier.

d1 = d2
x1
C1
C2
x2
k-Nearest Neighbour Classifiers
Suppose we have N sample patterns {s1, s2, ..., sN} that we know belong
to one of M classes {C1, C2, ..., CM}.

For a new, to classify, pattern x, we calculate the distance D(si ,x) and
assign x to the class of the sample pattern that is closest to x (the ’nearest
neighbour’).

This is not a very stable way of classifying as the classification is made


upon the basis of one single sample pattern (which may be an outlier).

A more reliable estimate may be obtained by classifying using several


sample patterns; k nearest neighbours and then use a majority vote.
This is the k-nearest-neighbour (k-NN) rule.
k=1, observe the effect the
blue outlier has on the boundary placement k=3

•http://vision.stanford.edu/teaching/cs231n-demos/knn/
Clustering Methods
• Sometimes we don’t want (or cannot) assign classes beforehand. We
may then examine possible formation of inherent groups or clusters of
the data in feature space and (possibly) label the clusters with class
labels.
• This may be relatively easy in 2D, but for higher-dimensionial feature
spaces this is a difficult task.
• We have to group the feature vectors on the basis of similarity,
dissimilarity or distance measures.
• Examples of commonly used measures are the Euclidean distance,
the normalised dot product, or, more fancy, the Mahalanobis distance,
(there exist many more)
Distance measures
• Euclidean distance, between feature vectors x and z
n
D  x  z  (x  z ) (x  z )   ( xi  zi ) 2
2 2 T
E
i 1
xT z
• Normalised dot product
Dd 
x z

• Mahalanobis distance

x is a feature vector compared to a pattern class with m as mean class vector


and C as covariance matrix
DM2  (x  m)T C 1 (x  m) 
C  E (y  m)(y  m)T 
The expectation, E, is calculated over all vectors y that belong to the class. C
provides covariance of all possible pairs of features. The diagonal contains
variance of individual features. The matrix represents the ’scatter’ of features
belonging to a certain class.
Mahalanobis distance concept
Example of a clustering algorithm, the
K-means algorithm
 goal: (iteratively) minimise sum of squared distances from all points in a cluster
domain to the cluster centre.
1. choose K initial cluster centres, z1(1), z2(1)..., zK(1)
(the index in parenthesis indicates iteration number)
2. at iteration k distribute samples {x} among the K clusters using
x  S j (k ) if x  z j (k )  x  z i (k )  i  1,2,..., K , i  j
3. from results of 2. calculate new cluster centres zj(k+1) so that the sum of squared
distances from all points in Sj(k) to the new cluster centre is minimised. The zj(k+1)
that minimises this distance is simply the sample mean of Sj(k), so
1
z j (k  1) 
Nj
x,
xS j ( k )
j  1,2,..., K .
4. if zj(k+1) = zj(k) for j=1,2,...K the algorithm has converged: terminate the iteration,
otherwise go to step 2.

Note: one has to define the number of clusters and initial cluster centres beforehand.
http://shabal.in/visuals/k
means/6.html
Mapping to higher dimensions
•As we saw, feature extraction is typically done to go
from high-dimensional data to more manageable and
effective lower-dimensional data.
•However, it may be sometimes be useful to go in the
other direction - to higher dimensions.
•Samples that are not linearly separable in a low
dimension may become so when transformed to a
higher dimensional space.
•A classic approach that uses this idea successfully is
Support Vector Machine (SVM) classification algorithms
Data that are not linearly separable in the original space
may become linearly separable when projecting the to a
higher dimension dimensional space can make it linearly
separable.

https://medium.com/analytics-vidhya/how-to-classify-non-
linear-data-to-linear-data-bb2df1a6b781
A short look ahead to neural networks…

Rosenblatt’s Perceptron (1960)

07/11/2023 | 28
ANN (Artificial Neural Network)-based
classifiers
Many artificial neural networks operate similarly to methods used in the statistical
approach, i.e., they try to estimate decision regions in feature space. We can
distinguish:

probabilistic classifiers that use a priori information about distributions and use
supervised learning to estimate their parameters
hyperplane classifiers that construct hyperplane boundaries by calculating a
weighted sum of inputs and passing this through a non-linear function
kernel (receptive-field) classifiers that establish receptive fields across feature
space; if an input is near the centre of a receptive field of a processing unit, the
output of that unit will be high
exemplar classifiers that use the distance between input pattern and training
exemplars to come to classifications
Use for Recognition and Classification
generate suitable decision regions in feature space by
looking for the right weight configuration
a: decision regions realised
by a linear classifier
b: decision regions realised
by a non-linear classifier

Provided you have enough


example data, ANNs can be
very powerful in estimating
the tricky regions of Fig b.
https://playground.tensorflow.org/
07/11/2023 | 31
Sci-kit machine learning library

http://scikit-
learn.org/stable/auto_examples/classification/
plot_classifier_comparison.html
A nice explanation (with cats and dogs
Curse of dimensionality  ) can be found here

https://www.visiondummy.com/2014/04/
curse-dimensionality-affect-
classification/

The higher dimensional our feature space gets, the sparser its contents become (available measured data points per ‘feature
space hypervolume unit’ decreases). This increases the risk of overfitting.

One way to deal with this issue is reduction of the dimensions of feature space. We will continue with this in Lecture 7.

07/11/2023 | 33
Features can be obtained from different ways of
processing the signals, for example:
summary results from descriptors of
measures mathematical underlying shape
transforms or or structure
projections onto
lower dimensions

- RMS amplitude - regression coeff Label presence of


- offset - Fourier tr. coeff - peaks
- mean, median - wavelet tr. coeff - slopes
- standard dev - AR model coeffs - 'spikyness'
- range, - PCA, ICA, LVQ - ‘flatness’
percentiles - ….
- spectral
entropy,
bicoherence…
Examples of the many features that can be
obtained from a ‘simple’ ECG curve
Feature Vectors - recap
•a combination of features is used to effectively describe an
observation of raw data (a ’pattern’); the combination of
features can be considered as a vector
•it is a reduced-dimensional representation of the raw data
used for classifying and interpreting the observations
•mathematically: a projection from a higher dimension to a
lower dimension
• We in effect throw away information, but we may lessen the
Curse of Dimensionality effect
• The projection to a lower dimension may allow us to understand
the problem better
Keep in mind: as we are now projecting from higher
dimensions to lower ones; even if feature vectors are
similar, it does not mean that the original raw data was
similar – visualise raw data before doing anything else!

https://damassets.autodesk.net/content/dam/autodesk/www/autodesk-reasearch/Publications/pdf/same-stats-different-graphs.pdf

You might also like