Part 2 金融資料前處理

Preprocessing for Financial Data
授課教師: 翁詠祿
日期: 10/18/2022、
11/01/2022
Department of Electrical Engineering

National Tsing-Hua University, HsinChu, Taiwan
Outline
q Data Preprocessing (Ch. 4, [2])
q Dimensionality Reduction For Data (Ch. 5, [2])
q Financial Data Structure (Ch. 2, [1][3])
q Labeling (Ch. 3, [1][3])
[1] Marcos Lopez de Prado, Advances in Financial Machine Learning, Wiely, 2018
[2] Sebastian Raschka and Vahid Mirjalili, Python Machine Learning: Machine Learning and Deep Learning with
Python, scikit-learn, and TensorFlow 2, 3rd Edition, 2019
[3] Stefan Jansen, Machine Learning for Algorithmic Trading - Second Edition,2020
Dept. of Electrical Engineering

2 National Tsing Hua University, Taiwan
Data Preprocessing

Outline
q Dealing with missing data
q Handling categorical data
q Feature scaling
q Selecting meaningful feature
q Assessing feature with random forests

Outline
Ø Identifying miss values in tabular data
Ø Eliminating training data with missing values
Ø Imputing missing values
q Feature scaling

Dealing with missing data
q It’s not uncommon in real-world applications for the
training examples.
q There could have been an error in the data collection

process, certain measurements may not be applicable, or
particular fields could have been simply left black (“NaN”
or NULL ) in a survey.
q There are two ways to deal with missing data:

1. Removing entries from dataset.
2. Imputing missing values from other training examples and
features.

Identifying miss values in tabular data
q Example: there is a matrix as follow:
q There are two “NaN” data, which mean the missing data.
Ø column C and D have one “NaN” respectively.

Eliminating training data with missing values
q There are five ways to remove missing values:
Original data
1. Remove rows that contain missing values:
2. Remove columns that contain missing values:

Eliminating training data with missing values
3. Only drop rows where all columns are NaN:
4. Drop rows that have fewer than 3 real values:
5. Only drop rows where NaN appear in specific:

Imputing missing values
q The removal of training examples by dropping of entire
feature columns is simply not feasible, because we might
lose too much valuable data.
q One of most common techniques is mean imputation,

where we simply replace the missing value with the mean
value of the entire feature column.
Original data

Outline
Ø Categorical data encoding
Ø Mapping ordinal features
Ø Encoding class labels
Ø One-hot encoding on nominal features
q Partitioning a dataset
q Feature scaling

Categorical data encoding
q Categorical data can further distinguish between ordinal
and nominal features.
Ø Ordinal features : Categorical values that can be sorted or
ordered.
Ø Nominal features: Categorical values that don’t imply any order.
q For example, t-shirt size would be an ordinal feature, but

t-shirt color would be a nominal feature.
Ø Size can define an order : XL>L>M.
Ø Color can’t define an order. (e.g. red and blue)

Categorical data encoding
q Example for nominal and ordinal features:
q This dataset contains:

Ø A nominal feature (color).
Ø An ordinal feature (size).
Ø a numerical feature (price).
q The class are stored in the last column.

Mapping ordinal features
q To make sure that ordinal features interpret correctly, we
need to convert the categorical string values into
integers.
Ø 𝑋𝐿 −> 3, 𝐿 → 2, 𝑀 → 1

Encoding class labels
q Many machine learning libraries require that class labels
are encoded as integer values.
q To encode the class labels, we can use an ordinal

features discussed previously.
q We need to remember that the class labels are not ordinal

, and it doesn’t matter which integer number we assign to
a particular string label.

Encoding class labels
q Transform the class labels into integers:
Ø class1 → 0 ; class2 → 1

One-hot encoding on nominal features
q If the nominal features(e.g. color) encode the string
labels into integers, there are some mistakes because
nominal features (color) don’t have any particular order.
q The idea behind one-hot is to create a new dummy

feature for each unique value in the nominal feature
column.
q In this example, we need to convert color into one-hot

encoding.

One-hot encoding on nominal features
q Use one-hot encoder to color:
Ø 𝑏𝑙𝑢𝑒 = 0 ; 𝑔𝑟𝑒𝑒𝑛 = 1 ; 𝑟𝑒𝑑 = 2
q Transform columns in a multi-feature array:

Outline
q Feature scaling
Ø Normalization and standardization

Feature scaling
q Feature scaling is a crucial step that can easily be
forgotten.
q Decision trees and random forests are two for the very
few machine learning algorithms where we don’t need to
worry about feature scaling. (scale invariant)
q The majority of machine learning and optimization

algorithms behave much better if features are on the
same scale. (e.g. gradient descent optimization)

Normalization and Standardization
q There are two common approaches to bring different
features onto the same scale:
1. Normalization: (min-max scaling)
(") x (") − x$"%
x = , the range is [0,1]
x$&' − x$"%
2. Standardization:
(") x (") − µ'
x()* =
σ'

Outline
q Feature scaling
Ø L1 and L2 regularization as penalties
Ø A geometric interpretation of L2 regularization
Ø Sparse solutions with L1 regularization
Ø Sequential feature selection algorithms

Selecting meaningful feature
q If a model performs much better on a training dataset
than on the test dataset, this observation is a strong
indicator of overfitting.
q Common solution to reduce the generalization error are

as follows:
Ø Collect more training data
Ø Introduce a penalty for complexity via regularization
Ø Choose a simpler model with fewer parameters
Ø Reduce the dimensionality of the data

L1 and L2 regularization as penalties
q L2 regularization is one approach to reduce the complexity
of a model by penalizing large individual weights.
$
L2: 𝐰 +
+ = I w,+
,-.
q L1 regularization usually sparse feature vectors and most

feature weights will be zero.
$
L1: 𝐰 . = I 𝑤/
,-.
q Sparse can be useful in practice if we have a high-

dimensional dataset with many features that are irrelevant.

A geometric interpretation of L2 regularization
q Without penalty :

A geometric interpretation of L2 regularization
q With L2 regularization :

Sparse solutions with L1 regularization

q Before using regularization, we use the Wine dataset to
train model:
Ø 13 different features.
Ø 178 wine examples.
Ø 3 classes (𝑐𝑙𝑎𝑠𝑠0, 𝑐𝑙𝑎𝑠𝑠1, 𝑐𝑙𝑎𝑠𝑠2) = (59,71,48).

q All feature weights will be zero if we penalize the model
with a strong regularization parameter (C<0.01); C is the
inverse of the regularization parameter,𝜆.

Sequential feature selection algorithms
q Sequential Feature selection:
Ø Greedy search algorithms.
• Selects a subset of original features.
• It’s an alternative way to reduce the complexity of the model and
avoid overfitting.
• Greedy search algorithms make locally optimal choices.
(k < d; k = initial feature space; d = feauture subspace)
q The motivation behind feature selection algorithms:

Ø Automatically select a subset of features that are most relevant to
the problem.
Ø Improve computational efficiency.
Ø Reduce the generalization error of the model by removing
irrelevant features or noise.
It’s useful for algorithms that don’t support regularization.
q A classic sequential feature selection algorithm is
sequential backward selection (SBS).
Ø It aims to reduce the dimensionality of the initial feature subspace
with a minimum decay in the performance of the classifier.
q The idea behind SBS:

Ø Removes features sequentially from the full feature subset until
the new feature subspace contains the desired number of
features.
q In order to determine which feature is to removed at each

stage, we need to define the criterion function, 𝐽, that we
want to minimize.

q The feature to be removed at each stage can simply be
defined as the feature that maximizes this criterion; or in
more simple terms, at each stage we eliminate the feature
that causes the least performance loss after removal.
q The outline of preceding definition of SBS is in four simple

steps: (Reduce the number of features to desired number.)
1. Initialize the algorithm with 𝑘 = 𝑑, where d is the dimensionality of the full
feature space.
2. Calculate the criterion function (𝐽) with each feature. (𝑘)
3. Remove the feature which has maximum criterion function.
• Dimension: 𝑘 = 𝑘 – 1.
4. Terminate if 𝑘 equals the number of desired features; otherwise, go to
Step 2.

q SBS implementation using the KNN classifier.
Ø Dataset: Wine dataset.
Ø Neighbors=5. Choose K=3
q The accuracy of KNN improved on the validation dataset

as we reduced the number of features.

Outline
q Feature scaling

Assessing feature with random forests
q We can measure the feature importance as the average
impurity decrease computed from all decision trees in the
forest, without making any assumptions about whether
our data is linearly separable or not.
q Use Wine dataset and rank 13 features.

Ø 500 decision trees.
Ø The number represents how important of each features is.
Sum up to 1

Assessing feature with random forests
q We can conclude that these are the most discriminative
features in the dataset based on the average impurity
decrease in the 500 decision trees.
1. Proline
2. Flavonoid levels
3. Color intensity
4. OD280/OD315 diffraction
5. Alcohol concentration
q We can simplify the model by using only five features, the

prediction accuracy declined slightly.

REFERENCE
[1] S. Raschka, “Model Evaluation, Model Selection, and Algorithm Selection in Machine
Learning.” ArXiv abs/1811.12808 (2018): n. pag.

Dimensionality Reduction for Data

Outline
q Eigen values and Eigen vectors
q Unsupervised dimensionality reduction via PCA
q Supervised data compression via LDA
q Using KPCA for nonlinear mappings

Outline
Ø Properties of eigen values and eigen vectors
Ø Eigen decomposition

Eigen values and Eigen vectors
q When the number of features in dataset are large, the
data processing may be complex.
Ø PCA and LDA are based on this concept to extract the feature.
• PCA: Principal Component Analysis
• LDA: Linear Discriminant Analysis
Ø PCA and LDA needs the knowledge of eigen values and vectors.

q Let A be any square matrix. A scalar 𝜆 is called an eigen
value of A if there exists a nonzero (column) vector 𝒗
(eigen vector):
𝐴𝒗 = λ𝒗
λ𝒗 − 𝐴𝒗 = 0 (𝜆𝐼 − 𝐴)𝒗 = 0
q Characteristic equation:
Ø It’s solved to find a matrix's eigen vectors. (non-zero solution)
det(𝜆𝐼 − 𝐴) = 0
q Note: each scalar multiple 𝑘𝒗 of an eigen vector

belonging to λ is also an eigen vector.
𝐴 𝑘𝒗 = 𝑘 𝐴𝒗 = k 𝜆𝒗 = λ(𝑘𝒗)

q Calculate two eigen values and eigen vectors:
0.8 0.3 0.8 − 𝜆 0.3 +

3 1 1
𝐴= → det = 𝜆 − 𝜆 + = (𝜆 − 1)(𝜆 − )
0.2 0.7 0.2 0.7 − 𝜆 2 2 2
𝐴 − 𝐼 𝒙𝟏 = 0 → 𝐴𝒙𝟏 = 𝒙𝟏 → 𝒙𝟏 = (0.6,0.4)"
1 1
𝐴 − 𝐼 𝒙𝟐 = 0 → 𝐴𝒙𝟐 = 𝒙𝟐 → 𝒙𝟐 = (1, −1)"
2 2
q Check:
0.8 0.3 0.6
𝐴𝒙𝟏 = = 𝒙𝟏 → 𝜆 = 1 𝐴$%% 𝒙𝟏 = 𝐴&& 𝒙𝟏 = ⋯ = 𝒙𝟏
0.2 0.7 0.4
0.8 0.3 1 1 1
𝐴𝒙𝟐 = = 𝒙𝟐 → 𝜆 = 1 1 $%%
0.2 0.7 −1 2 2 𝐴$%% 𝒙𝟐 = 𝐴&& 𝒙 = ⋯ = ( ) 𝒙𝟐
2 𝟐 2
Small value

q Other vectors do change direction. But all other vectors
are combinations of the two eigenvectors.
Ø Separate into eigenvectors. (first column)
0.8 0.6 0.2
= 𝒙𝟏 + 0.2 𝒙𝟐 = +
0.2 0.4 −0.2
0.8 0.7 1 0.6 0.1

𝐴 = = 𝐴𝒙𝟏 + 0.2 𝐴𝒙𝟐 = 𝒙𝟏 + 0.2 𝒙𝟐 = +
0.2 0.3 2 0.4 −0.1
q First column of 𝐴JKK:
1 very
0.8 0.6
𝐴&& = 𝒙𝟏 + ( )&& 0.2 𝒙𝟐 = + small
0.2 2 0.4
vector

q It’s showed that:
Ø The eigen vector 𝒙𝟏 is a steady state. (∵ 𝜆 = 1)
Ø The eigen vector 𝒙𝟐 is a decaying mode. (∵ 𝜆 = 0.5)
q The higher the power of 𝐴, the closer its columns

approach the steady state.

Properties of eigen values and eigen vectors
q Let 𝐴 be a square matrix. Then the following are
equivalent :
1. A scalar λ is an eigenvalue of 𝐴.
2. The (𝜆𝐼 − 𝐴)𝒗 = 0 has nontrivial solutions. (singular matrix).
3. There is a nonzero vector 𝒗 in ℛ 2 such that A𝒗 = λ𝒗.
4. λ is a solution of det(𝜆𝐼 − 𝐴) = 0.

Eigen decomposition
q Some properties of deriving eigen decomposition:
Ø Similarity:
• Suppose 𝐴 and 𝐵 are square matrices for which there exists an
invertible matrix P, then 𝐵 is said to be obtained from 𝐴 by a
similarity transformation such that:
𝐵 = 𝑃3. 𝐴𝑃
Ø Linear dependence and independence:

• Let 𝑽 be a vector space over a field ℱ.
• We say that the vectors 𝒗$ , 𝒗' , … , 𝒗( in 𝑽 are linearly dependent if
there exist scalars 𝑎$ , 𝑎' , … , 𝑎( in ℱ, not all of them 0, such that:
𝑎. 𝒗. + 𝑎+ 𝒗+ + ⋯ + 𝑎4 𝒗4 = 0
• Otherwise, we say that vectors are linearly independent.

Eigen decomposition
q An n-square matrix 𝐴 is similar to a diagonal matrix 𝐷 if
and only if 𝐴 has n linearly independent eigen vectors.
Ø 𝐷 = 𝑃3. 𝐴𝑃 .
Ø The diagonal elements of 𝐷 are the corresponding eigen values
Ø 𝑃 is the matrix whose columns are the eigen vectors. (𝑃 is
invertible)
Proof: 𝜆. ⋯ 0
𝐴𝑃 = 𝐴 𝒙𝟏 , … , 𝒙𝒏 = 𝜆. 𝒙𝟏 , … , 𝜆2 𝒙𝒏 = 𝒙𝟏 , … , 𝒙𝒏 ⋮ ⋱ ⋮ = 𝑃𝐷
0 ⋯ 𝜆2
q 𝐴 = 𝑃𝐷𝑃WJ
𝐴X = (𝑃𝐷𝑃WJ)X = 𝑃𝐷 X 𝑃WJ

Example
3 1 1 1
q Let 𝐴 = , and let 𝒗𝟏 = and 𝒗𝟐 =
2 2 −2 1
3 1 1 1 3 1 1 4
𝐴𝒗𝟏 = = = 𝒗𝟏 𝐴𝒗𝟐 = = = 4𝒗𝟐
2 2 −2 −2 2 2 1 4
1 1
−
1 1
𝑃= 𝑎𝑛𝑑 𝑃3. = 3 3
−2 1 2 1
3 3
1 1
−
𝐷 = 𝑃3. 𝐴𝑃 = 3 3 3 1 1 1
=
1 0
2 1 2 2 −2 1 0 4
3 3
1 1
−
1 1 1 0 3 3 = 171 85
𝐴6 = 𝑃𝐷 6 𝑃3. =
−2 1 0 256 2 1 170 86
3 3
Outline
Ø The main steps behind PCA
Ø Extracting the PCA
Ø Total and explained variance
Ø Feature transformation
Ø PCA in scikit-learn

The main steps behind PCA
q PCA, which is principal component analysis, is an
unsupervised linear transformation technique that is
widely used across different fields, most prominently for
feature extraction and dimensionality reduction.
q PCA helps us to identify patterns in data based on the

correlation between features.
q In a nutshell, PCA aims to find the directions of maximum

variance in high-dimensional data and projects the data
onto a new subspace with equal or fewer dimensions
than the original one.

q The orthogonal axes (principal components) of the new
subspace can be interpreted as the directions of maximum
variance given the constraint that the new feature axes are
orthogonal to each other.
More variance à More information

q If we use PCA for dimensionality reduction, we construct:
Ø 𝑑×𝑘-dimensional transformation matrix, W. (k ≪ 𝑑)
Ø Mapping a vector, x.
Ø The features of a training example, onto a new k-dimensional
feature subspace (few than original).
x= x. , x+ , ⋯ , x* , x ∈ ℝ*
xW = 𝐳 , W ∈ ℝ*×8
𝐳 = z. , z+ , ⋯ , z8 , 𝐳 ∈ ℝ8

q Summarize the PCA algorithm: [1][2]
1. Standardization:
• Standardize the range of the continuous initial variables so that
each one of them contributes equally to the analysis.
value − mean
𝑧=
standard deviation
2. Covariance matrix computation:

• Understand how the variables of the input dataset are varying from
the mean with respect to each other, or in other words, to see if
there is any relationship between them.
𝐶𝑜𝑣(𝑥, 𝑥) 𝐶𝑜𝑣(𝑥, 𝑦) 𝐶𝑜𝑣(𝑥, 𝑧)

𝐶𝑜𝑣(𝑦, 𝑥) 𝐶𝑜𝑣(𝑦, 𝑦) 𝐶𝑜𝑣(𝑦, 𝑧)
𝐶𝑜𝑣(𝑧, 𝑥) 𝐶𝑜𝑣(𝑧, 𝑦) 𝐶𝑜𝑣(𝑧, 𝑧)
3x3 covariance matrix

2. Covariance matrix computation:
• 𝐶𝑜𝑣 𝑎, 𝑎 = 𝑉𝑎𝑟 𝑎 , which 𝑎 is x, y, z.
• The result of covariance:
Result Relationship
Positive Two variables increase or decrease together.
sign (correlated)
Negative One increases when the other decreases.
sign (Inversely correlated)
Zero Two variables are not related
(uncorrelated)
q But, the covariance matrix is not more than a table that

summaries the correlations between all the possible pairs
of variables.

3. Compute the eigen vectors and eigen values of the covariance
matrix:
• Eigen vectors and eigen values are the linear algebra concepts that
we need to compute from the covariance matrix in order to
determine the principal components of the data.
• Principal components are new variables that are constructed as

linear combinations or mixtures of initial variables.
• These combinations are done in such a way that the new

variables(i.e., principal components) are uncorrelated and most of
the information within the initial variables is squeezed or
compressed into the first components.

matrix:
• 10-dimensional data gives you 10 principal components, but PCA
tries to put maximum possible information in the first component,
then maximum remaining information in the second and so on until
having something like shown as follow.

matrix:
• Example: Suppose the dataset is 2-dimensional with 2 variables
𝑥, 𝑦 and that the eigen vectors and eigen values of covariance
matrix are as follows:
0.6778736 −0.7351785
𝒗𝟏 = 𝒗𝟐 =
0.7351785 0.6778736
𝜆. = 1.284028 ; 𝜆+ = 0.04908323
𝜆$ > 𝜆' à eigenvector that corresponds to the first principal component (PC1)
is 𝒗𝟏 and the corresponds to the second component (PC2) is 𝒗𝟐 .

4. Feature vector:
• Choose whether to keep all these components or discard those of
lesser significance (of low eigen values), and form with the
remaining ones a matrix of vectors.
• In other word, selecting 𝑘 eigenvectors, which correspond to the 𝑘
largest eigenvalues, where 𝑘 is the dimensionality of the new
feature subspace (𝑘 ≤ 𝑑) .
• Example:
0.6778736 −0.7351785
𝒗𝟏 = 𝒗𝟐 =
0.7351785 0.6778736
𝜆. = 1.284028 ; 𝜆+ = 0.04908323
We can either form a feature vector with both of the eigen vectors or
discard the eigen vector 𝒗𝟐 , which is the one of lesser significance.

5. Recast the data along the principal components axes:
• Use the feature vector formed using the eigen vectors of the
covariance matrix, to reorient the data from the original axes to the
ones represented by the principal components.

Example for PCA
q Let the matrix be the score of three students:[3]
Student Math English Art
1 90 60 90
2 90 90 30
3 60 60 60
4 60 60 90
5 30 30 30
1. Calculate the mean values:

90 60 90
90 90 30
𝐴 = 60 60 60 𝐴̅ = 66 60 60
60 60 90
30 30 30
Score matrix
Example for PCA
2. Compute the covariance matrix of the whole dataset:
+
1
𝐶𝑜𝑣 𝑋, 𝑌 = m ) − 𝑌)
l(𝑋) − 𝑋)(𝑌 m
𝑛−1
90 60 90 )*$
90 90 30 504 360 180

𝐴 = 60 60 60 360 360 0
60 60 90 180 0 720
30 30 30
𝐴̅ = 66 60 60 Covariance Matrix
3. Compute eigen vectors and eigen values:

• Let 𝐴 be a square matrix, 𝒗 a vector and 𝜆 a scalar that satisfies:
𝐴𝒗 = 𝜆𝒗.
det 𝐴 − 𝜆𝐼 = 0
Characteristic equation

Example for PCA
3. Compute eigen vectors and eigen values:
504 360 180 1 0 0
det 𝐴 − 𝜆𝐼 = det( 360 360 0 −𝜆 0 1 0 )
180 0 720 0 0 1
504 − 𝜆 360 180

= det 360 360 − 𝜆 0
180 0 720 − 𝜆
= −𝜆, + 1584𝜆' − 641520𝜆 + 25660800 = 0
𝜆$ ≈ 44.81966 … ; 𝜆' ≈ 629.11036 … ; 𝜆, ≈ 910.06995 …
−3.75100 … −0.50494 … 1.05594 …

𝒗$ = 4.28441 … ; 𝒗' = −0.67548 … ; 𝒗, = 0.69108 …
1 1 1

Example for PCA
4. Sort the eigenvectors by decreasing eigen values and choose 𝑘
eigen vectors with the largest eigen values to form 𝑑×𝑘
dimensional matrix 𝑊.
• If we want to choose 2-dimensional feature subspace: (𝑑 = 3)
1.05594 −0.50494
∵ 𝜆9 > 𝜆+ > 𝜆. 𝑤 = 0.69108 −0.67548
1 1
𝒗, 𝒗'
5. Transform the samples onto new subspace:
• Use 3×2 dimensional matrix 𝑊: (𝑥 is data point)
• x= x$ , x' , ⋯ , x- , x ∈ ℝ-
𝑦 = 𝑊:x
So far, we have computed the two principal components and projected

the data points onto the new subspace.

Total and explained variance
q Since we want to reduce the dimensionality of the dataset
by compressing it onto a new feature subspace, we only
select the subset of the eigenvectors that most of the
information (variance).
q The eigenvalues define the magnitude of the eigenvectors.

(Sort the eigenvalues by decreasing magnitude)
q Explained variance ratio: [4]

Ø The variance explained by each of the principal components
(eigen vectors)
λ!
Explained variance ratio =
∑$!"# λ!

Total and explained variance
q Calculate the cumulative sum of explained variances.
Most information
(eigen value)
Feature transformation
q Using projection matrix onto the PCA subspace:
𝑿; = 𝐗𝐖 124×13 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 → hard to visualize.
𝐱 ; = 𝐱𝐖 (2 − dimension)
q Plot the PCA :

Ø 𝐱 ; = 𝐱𝐖 (2 − dimension).

Feature transformation
q We can plot the decision region with two-dimensional
data:
Ø Classifier: Logistic regression.
training dataset test dataset

Outline
Ø PCA versus LDA
Ø The inner workings of LDA
Ø Example for LDA
Ø Wine dataset with LDA
Ø Selecting linear discriminants
Ø Projecting examples onto new feature space

Supervised data compression via LDA
q LDA, which is linear discriminant analysis, can be used as
a technique for feature extraction to increase the
computational efficiency and reduce the degree of
overfitting due to the curse of dimensionality in non-
regularized models.
q The goal in LDA is to find the feature subspace that

optimizes class separability.

PCA versus LDA
q PCA versus LDA :
Ø Similar
• Both of them are linear transformation techniques
(reduce the number of dimensionality)
Ø Different
• PCA: unsupervised algorithm
• LDA: supervised algorithm
q In general, LDA is a superior feature extraction technique

for classification task compared to PCA. But in certain
class, preprocessing via PCA tends to result in better
classification.[5]

PCA versus LDA
q The concept of LDA for two-class problems:
q A linear discriminant (LD1) would separate the two

normal distributed classes well.
q LD2 (y-axis) captures a lot of the variance in the dataset ,

it would fail as a good linear discriminant since it doesn’t
capture any of the class-discriminatory information.

PCA versus LDA
q LDA assumption:
1. The data is normally distributed.
2. The classes have identical covariance matrices.
3. The training examples are statistically independent of each
other.
q However, even if one, or more, of those assumptions is

(slightly) violated, LDA for dimensionality reduction can
still work reasonably well.[6]

The inner workings of LDA (Example)
q The main steps that required to perform LDA:
1. Standardize the 𝑑-dimensional dataset (𝑑 is feature number)
2. For each class, compute the 𝑑-dimensional mean vector.
• Take the label information into account (supervised learning).
1
𝐦" = I 𝐱 i ∈ class label
n"
𝐱∈@!
3. Construct the between-class scatter matrix, 𝐒< , and the within-

class scatter matrix, 𝐒=.
• Compute the between-class scatter matrix, 𝐒. .
B
𝐒< = I 𝑛A (𝐦" − 𝐦)(𝐦" − 𝐦):
A-.
𝑛) is the number of data in class 𝑖.
𝐦 is whole data mean.
The inner workings of LDA (Example)
3. Construct the between-class scatter matrix, 𝐒< , and the within-
class scatter matrix, 𝐒=. [7]
• 𝐒/ is calculated by summing up the individual scatter matrix, 𝐒) .
E
1
ΣA = 𝑆A = I (𝒙 − 𝐦" )(𝒙 − 𝐦" ): S= = I S" , c ∈ class label
𝑛A
𝒙∈@! D-.
Suppose class labels in the training dataset are uniformly distributed.
Maximize the between-class scatter matrix.
Minimize the within-class scatter matrix.

The inner workings of LDA(Example)
4. Compute the eigenvectors and corresponding eigenvalues of
3.
the matrix,𝐒F 𝐒< .
1$
• 𝐒0 𝐒. is derived by Fisher’s criterion.
5. Sort the eigenvalues by decreasing order to rank the
corresponding eigenvectors.
6. Choose the k eigenvectors that correspond to the k largest
eigenvalues to construct a d×k-dimensional transformation
matrix, 𝑾; the eigenvectors are the columns of this matrix.
7. Project the examples onto the new feature subspace using the
transformation matrix, 𝑾.
Step 4 to Step 7 of LDA is similar to PCA.

Example for LDA
q Example for LDA:[8]
Ø Classify man or woman by height and weight. (d=2)
J
3+ 3K.M 𝑒𝑥𝑝
𝑓 𝒙 𝝁, 𝜮 = (2𝜋) 𝜮 −0.5 𝒙 − 𝝁 : Σ 3. (𝒙 − 𝝁)
Multivariate normal distribution
𝑥height
𝒙= 𝑥
weight
𝜇height Cov(height, height) Cov(weight, height)

𝝁4G2 = 𝜇 𝜮4G2 =
weight Cov(height, weight) Cov(weight, weight) 4G2
𝜇height Cov(height, height) Cov(weight, height)

𝝁HI4G2 = 𝜇 𝜮 =
weight HI4G2 Cov(height, weight) Cov(weight, weight) HI4G2

Example for LDA
q The likelihood function:
Ø Man:
𝑝 𝒙 𝝁4G2 , 𝜮4G2
1
= 𝜮4G2 3K.M 𝑒𝑥𝑝 −0.5 𝒙 − 𝝁4G2 : Σ 3. (𝒙 − 𝝁4G2 )
2𝜋
Ø Woman:
𝑝 𝒙 𝝁HI4G2 , 𝜮HI4G2
1
= 𝜮HI4G2 3K.M 𝑒𝑥𝑝 −0.5 𝒙 − 𝝁HI4G2 : Σ 3. (𝒙 − 𝝁HI4G2 )
2𝜋

Example for LDA
q Maximum a posteriori:
𝒘%&' = arg max 𝑝 𝒘( 𝑝(𝑥|𝒘( )

(∈ *+,,./*+,
= arg max 𝑙𝑛𝑝(𝑥|𝒘( )

(∈ *+,,./*+,
Nature logarithm and ignore 𝑝 𝒘)
q Result:
𝑙𝑛 𝑝 𝒙 𝝁*+, , 𝜮*+,
= −𝑙𝑛 2𝜋 − 0.5 𝜮*+, − 0.5 𝒙 − 𝝁*+, 0 Σ1# (𝒙 − 𝝁*+, )
𝑙𝑛 𝑝 𝒙 𝝁./*+, , 𝜮./*+,
= −𝑙𝑛 2𝜋 − 0.5 𝜮./*+, − 0.5 𝒙 − 𝝁./*+, 0 Σ1# (𝒙 − 𝝁./*+, )

Example for LDA
q Classification:
𝑙𝑛 𝑝 𝒙 𝝁*+, , 𝜮*+, − 𝑙𝑛 𝑝 𝒙 𝝁./*+, , 𝜮./*+, > 0 → man
𝑙𝑛 𝑝 𝒙 𝝁*+, , 𝜮*+, − 𝑙𝑛 𝑝 𝒙 𝝁./*+, , 𝜮./*+, < 0 → woman
Which:
𝑙𝑛 𝑝 𝒙 𝝁*+, , 𝜮*+, − 𝑙𝑛 𝑝 𝒙 𝝁./*+, , 𝜮./*+,

= 0.5 𝜮./*+, − 𝜮*+,
+ 0.5 𝒙 − 𝝁./*+, 0 Σ1# (𝒙 − 𝝁./*+, ) − 𝒙 − 𝝁*+, 0 Σ1# (𝒙 − 𝝁*+, )

Wine dataset with LDA
q Compute the 𝑑-dimensional mean vector for each class:
Ø Dataset: Wine dataset. (3 classes)
Ø 𝐦" ∈ mean vector ; 𝑑 = 13.
𝜇(,+56/7/5
1
𝐦2 = R 𝐱 = 𝜇(,*+5(6 +6(8 , i ∈ 1,2,3
n2 ⋮
:3,567839:
𝐱∈42

q Construct the within-class and between-class scatter
matrix.
B
1
S= = I 𝑆A 𝑆A = I (𝒙 − 𝐦" )(𝒙 − 𝐦" ):
𝑛A
A-. 𝒙∈@!
q Between-class scatter matrix, 𝐒z:
B
𝐒< = I 𝑛A (𝒎𝒊 − 𝐦 )(𝒎A − 𝐦 ):
A-.

q Solve the generalized eigenvalue problem of the
WJ
matrix, 𝐒{ 𝐒z.
Ø 13 eigen values (∵ Wine dataset have 13 features.)

Selecting linear discriminants
q Measure the class-discriminatory information
discriminability:
The others approach 0.
q Create transformation matrix, 𝑾 by two most

discriminative eigenvector columns:

Projecting examples onto new feature space
q Use the equation to project onto new feature space:
Ø Example: Two feature spaces.
𝑿; = 𝑿𝑾

LDA result
q We can plot the decision region with two-dimensional
data:
Ø Classifier: Logistic regression.
training dataset test dataset

Outline
Ø Kernel functions and the kernel trick
Ø KPCA example

Using KPCA for nonlinear mappings
q KPCA, which is kernel principal component analysis, is a
best choice when the problem is non-linear
transformation.
q KPCA relates to the concepts of kernel SVM.

q Intuition behind KPCA:[9]
Ø The idea of KPCA relies on the intuition that many dataset, which
are not linearly separable in their space, can be made linearly
separable by projecting them onto a higher dimensional space.
Ø The added dimensions are just simple arithmetic operations
performed on the original data dimensions.
2-d dataset.

q Intuition behind KPCA: Linear separable
𝜙
𝐱 = x$ , x' ;
𝑧 = x$ , x' , x$' + x'' ;
Project the data from a lower dimensional (2D) to a higher dimensional (3D) space.
Kernel functions and the kernel trick
q We can tackle nonlinear problems by projecting them
onto a new feature space of higher dimensionality where
the classes become linearly separable.
𝜙: ℝ* ⟶ ℝ8 (k ≫ d)
q For example :
O
𝐱 = x. , x+
↓𝜙
:
𝑧= 𝑥.+ , 2𝑥. 𝑥+ , 𝑥++

q One downside of KPCA is that it’s computationally very
expensive, and this is where we use the kernel trick.
q Using the kernel trick, we can compute the similarity

between two high-dimension feature vectors in the
original feature space.

q Review: The covariance definition (Σ)
%
1 (") (") <
Σ = 4𝒙 𝒙
n
"-.
q Covariance between two features, k and j :
%
1 " "
σ,8 = I(x, − µ, )(x8 − µ8 )
n
"-.
%
1 " "
σ,8 = I x, x8 , Standardization(~N(0,1))
n
"-.

q By Bernhard Scholkopf [10] , we can replace the dot
products between examples in the original feature space
with the nonlinear feature combinations via 𝜙:
Ø 𝒙(") → 𝜙(𝒙(") )
% %
1 (") (")<
1
Σ = 4𝒙 𝒙 ≈ 4 𝜙(𝒙(") )𝜙(𝒙 A )O
n n
"-. "-.

q We use the kernel trick to avoid calculating the pairwise
dot products of the examples, 𝒙, under 𝜙 explicitly by
using a kernel function, 𝜅 :
κ 𝒙 " ,𝒙 , = 𝜙(𝒙 " ): 𝜙(𝒙 , )
q In other words, we can omit the complex calculation by

doing KPCA with kernel trick, which are already projected
onto the respective components.

q The most commonly used kernels are as follows:
• The polynomial kernel :
= !
κ 𝒙 = ,𝒙 >
= (𝒙 𝒙 >
+ 𝜃)? θ ∈ threshold ; p ∈ power
• The hyperbolic tangent (sigmoid) kernel :

= !
κ 𝒙 = ,𝒙 >
= tanh(𝜂𝒙 𝒙 >
+ 𝜃)
• The radial basis function (RBF) or Gaussian kernel :
= > '
𝒙 −𝒙
κ 𝒙 = ,𝒙 > = exp −
2𝜎 '
' $
κ 𝒙 = ,𝒙 > = exp −𝛾 𝒙 = −𝒙 > , 𝑤ℎ𝑒𝑟𝑒 𝛾 =
'@ "

q We define three steps to implement an RBF KPCA :
1. Compute the kernel (similarity) matrix, K, where we need to
calculate the following :
+
κ 𝒙" ,𝒙 , = exp −𝛾 𝒙 " −𝒙 ,
κ 𝒙 . ,𝒙 . κ 𝒙 . ,𝒙 + κ 𝒙 . ,𝒙 %
+ . + +
⋯ +
K= κ 𝒙 ,𝒙 κ 𝒙 ,𝒙 κ 𝒙 ,𝒙 %
⋮ ⋱ ⋮
κ 𝒙 % ,𝒙 . κ 𝒙 % ,𝒙 + ⋯ κ 𝒙 % ,𝒙 %
q For example, if the dataset contains 100 training

examples, the symmetric kernel matrix of the pairwise
would be 100×100-dimensional.

q We define three steps to implement an RBF KPCA :
2. Center the kernel matrix, K, using the following equation :
K; = K−𝟏%K−K𝟏%+𝟏%K𝟏%
1
𝟏" : 𝑛×𝑛 − dimensional matrix where all values are equal to .
𝑛
3. Collect the top k eigenvectors of the centered kernel matrix
based on their corresponding eigenvalues, which are ranked by
decreasing magnitude.
q The centering of the kernel matrix in the second step

becomes necessary since we don’t compute the new
feature space explicitly so that we cannot guarantee the
new feature space is also centered at zero.

KPCA example
q Example 1 – separating half-moon shapes:
Using PCAà it’s unable to classify.

KPCA example
q Example 1 – separating half-moon shapes :
Ø Use the kernel PCA function (KPCA):
Using KPCA àit’s able to classify.

KPCA example
q Example 2 – separating concentric circles:
Using PCAà it’s unable to classify.

KPCA example
q Example 2 – separating concentric circles:
Ø Use the kernel PCA function (KPCA):
Using KPCA àit’s able to classify.

REFERENCE
[1]https://builtin.com/data-science/step-step-explanation-principal-component-analysis
[2]https://setosa.io/ev/principal-component-analysis/
[3]https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643
[4]https://vitalflux.com/pca-explained-variance-concept-python-example/
[5] A. M. Martinez and A. C. Kak, “PCA versus LDA”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 23, no. 2, pp. 228-233, 2001.
[6]R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, New York, 2001.
[7] Bhattacharyya, S.K. and Rahul, K., 2013. Face Recognition by Linear Discriminant Analysis.
International Journal of Communication Network Security, ISSN.
[8]https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-
lda%E5%88%86%E9%A1%9E%E6%BC%94%E7%AE%97%E6%B3%95-14622f29e4dc
[9]https://iq.opengenus.org/kernal-principal-component-analysis/
[10]B.Scholkopf, A. Smola, and K.R. Muller, Kernel principal component analysis, pages 583-588, 1997.

Financial Data Structure

Outlines
q Essential Types Of Financial Data
q Bars

Outlines
Ø Fundamental Data
Ø Market Data
Ø Analytics
Ø Alternative Data
q Bars

Fundamental Data
q Fundamental data
Ø Mostly accounting data, reported quarterly or monthly.
Ø For example:
• Assets (資產)
• Liabilities (負債)
• Sales
• Costs/earnings

Market Data
q Market Data
Ø Market data includes all trading activity.
Ø For example:
• Price/yield(殖利率)/implied volatility
• Volume
• Dividends/coupons
• Aggressor side

Analytics
q Analytics
Ø A derivative data, based on an original source, which could be
fundamental, market, alternative, or even a collection of other
analytics.
Ø For example:
• Analyst recommendations
• Credit ratings
• Earnings expectations

Alternative Data
q Alternative Data
Ø The data used to obtain insight into the investment process.
Ø For example:
• Satellite/CCTV images
• Google researches
• Twitter/chats
• 台積電用水量變化
• 沃爾瑪停車場擁擠度變化

Alternative Data

Outlines
q Bars
Ø Standard Bars
Ø Information Driven Bars

Bars
q Candlestick charts:
Ø Based on four prices formed in a day (or a certain period) trend:
• Open price
• Highest price
• Lowest price Bearish Bullish
• Close price

Bars
q Volume-weighted Average Price (VWAP)
™(š›œ•žŸ ∗ ¡¢£¤Ÿ)
Ø VWAP =
™š›œ•žŸ

Standard Bars
q Standard Bars :
Ø Transform a series of observations that arrive at irregular
frequency into a homogeneous series derived from regular
sampling.
Ø For example:
• Time bars
• Tick bars Quiz
• Volume bars
• Dollars bars

Standard Bars
q Time bars:
Ø Sampling information at fixed time intervals.
Ø For example: sampling every 15 minutes

Standard Bars
q Example of time bar:

Standard Bars
q Number of ticks when grouped by time:

Standard Bars
q Order book:
Ø The list of orders that a trading venue uses to record the interest
of buyers and sellers in a particular financial instrument
q The primary source of market data is the order book.
• tickDirection: 每筆的價格走向
• minusTick: 價格向下
• zeroMinusTick: 價格不變但前一個 tick 價格向下
• plusTick: 價格向上
• zeroPlusTick: 價格不變但前一個 tick 價格向上

Standard Bars
q Tick:
Ø Transaction data in financial product transactions.

Standard Bars
q From tick to bar:

Standard Bars: Tick Bars
q Tick Bars:
Ø The sample variables listed earlier will be extracted each time a
pre-defined number of transactions takes place.
Ø For example:
• Ticks of a bar = total ticks / number of bars
• 6/3=2

Standard Bars
q Example of tick bar:

Standard Bars
q Number of ticks when grouped by tick:

Standard Bars: Volume Bars
q Volume Bars:
Ø Sampling every time a pre-defined amount of the security’s units.
Ø For example:
• Volume of a bar = total volume / number of bars
• 1050 / 3 = 350

Standard Bars
q Example of volume bar:

Standard Bars
q Number of ticks when grouped by volume:

Standard Bars: Dollar Bars
q Dollar Bars:
Ø Sampling an observation every time a pre-defined market value
is exchanged.
Ø For example:
• Dollars of a bar = total amount / number of bars
• 13500 / 3 = 4500

Standard Bars
q Example of dollar bar:

Standard Bars
q Number of ticks when grouped by dollar:

Information-Driven Bars
q Information-Driven Bars:
Ø The purpose of information-driven bars is to sample more
frequently when new information arrives to the market.
Ø Detect the imbalance of trading in the market.
q Tick imbalance bars:
Ø To sample bars whenever tick imbalances exceed our
expectations.
𝑏A1$ i𝑓 Δ𝑃A = 0
𝑏A = Š Δ𝑃A 𝑃A :the price associated with tick t
i𝑓 Δ𝑃A ≠ 0
Δ𝑃A

Time t Price 𝚫Price b 𝜽

0 100 1
1 110 10 1 1
2 100 -10 -1 0
3 100 0 -1 -1
4 110 10 1 0
5 120 10 1 1
Ø Define the tick imbalance at time T as: 𝜃" = ∑"A*$ 𝑏A

Ø The expected value of 𝜃" at the beginning of the bar is calculated as:
𝐸% 𝜃" = 𝐸% 𝑇 (𝑃 𝑏A = 1 − 𝑃 𝑏A = −1 )= 𝐸% 𝑇 (2𝑃 𝑏A = 1 − 1)
𝑇 ∗ = arg min{ 𝜃" ≥ 𝐸% 𝑇 max{𝑃 𝑏A = 1 , 1 − 𝑃 𝑏A = 1 } }
Ø When 𝜃" exhibits more runs than expected, a low T will satisfy these conditions.

q Example of tick imbalance bar:


q Volume/Dollar Imbalance Bars
Ø Define the volumes or dollars associated with a run as:
# #
𝜃# = max V 𝑏% 𝑣% − V 𝑏% 𝑣%
%|&!'( %|&!')(
where 𝑣P represent number of securities traded or dollar amount

exchanged
𝐸% 𝜃" = 𝐸% 𝑇 (𝑃 𝑏A = 1 𝐸% 𝑣A |𝑏A = 1 − 𝑃 𝑏A = −1 𝐸% 𝑣A |𝑏A = −1 )

𝑇 ∗ = arg min{ 𝜃" ≥ 𝐸% 𝑇 |2(𝑃 𝑏A = 1 𝐸% 𝑣A |𝑏A = 1 ) −𝐸% [𝑣A ]| }


q Example of volume imbalance bar:

q Example of dollar imbalance bar:

q Tick Runs Bars
Ø To monitor the sequence of buys in the overall volume, and take
samples when that sequence diverges from our expectations
Ø Define the length of the current run as:

# #
𝜃# = max V 𝑏% − V 𝑏%
%|&!'( %|&!')(
EK θO = EK T max{P b) = 1 ,1 − P b) = 1 }
T ∗ = arg min{ θO ≥ EK T max{P b) = 1 ,1 − P b) = 1 } }
Ø P b) = 1 as an exponentially weighted moving average of the

proportion of buy ticks from prior bars

q Example of tick run bar:

q Volume/Dollar Runs Bars
Ø Define the length of the current run as:
# #
𝜃# = max V 𝑏% 𝑣% − V 𝑏% 𝑣%
%|&!'( %|&!')(
EK θO = EK T max{P,(1 − P b) = 1 )EK v) |b) = −1 }

T ∗ = arg min{ θO ≥ EK T max{P b) = 1 EK v) |b) = 1 , (1
− P b) = 1 ) EK v) |b) = −1 } }

q Example of volume run bar:

q Example of dollar run bar:

REFERENCE
q https://tw.stock.yahoo.com/
q https://tw.tradingview.com/
q https://cloud.tencent.com/developer/article/1457661

Labeling

Outline
q The Fixed-time Horizon Method
q The Triple-barrier Method
q Meta-labeling

Outline
q Meta-labeling

The Fixed-time Horizon Method
q In order to train the supervised machine learning model, labeling is
necessary.
q Consider a features matrix 𝑋 with 𝐼 rows,{𝑋A }A-.,…,T , drawn from some
bars with index 𝑡 = 1, … , 𝑇, 𝑤ℎ𝑒𝑟𝑒 𝐼 ≤ 𝑇.
q An observation 𝑋A is assigned a label 𝑦A ∈ {−1,0,1}.
, 𝑖𝑓 𝑟A#,%,A#,%CD < −𝜏
−1
𝑦) = ™ 0 , 𝑖𝑓 |𝑟A#,%,A#,%CD | ≤ 𝜏
1 , 𝑖𝑓 𝑟A#,%,A#,%CD > 𝜏
,where 𝜏 is a pre-defined constant threshold, 𝑡),% is the index of the

bar immediately after 𝑋) takes place, 𝑡),% + ℎ is the index of the h-th
bar after 𝑡),% , and 𝑟A#,%,A#,%CD is the price return over a bar horizon h,
𝑝A#,%CD
𝑟A#,%,A#,%CD = −1
𝑝A#,%

q For example:
−𝜏

q Disadvantage of the fixed-time horizon method:
Ø There are no good statistical characteristics in time bar usually.
Ø The threshold 𝜏 is constant.

Outline
q Meta-labeling

The Triple-barrier Method
q There are two horizontal barriers and one vertical barrier
in the triple-barrier method.
q The two horizontal barriers are defined by profit-taking
and stop-loss limits.
q The vertical barrier is the time limit.
q The horizontal barriers are dynamic and are defined by
the volatility
Ø Estimated by Exponential Weight Moving Average

The Triple-barrier Method
q For example:
𝑦) = −1
𝑦) = 0

Outline
q Meta-labeling

Meta-labeling
q Confusion matrix:
:U
Ø Precision =
:UVWU
:U
Ø Recall = :UVWX
.VY* Z[\BA]AI2∗[\BG^^
Ø F1 − score = *
Y (Z[\BA]AI2V[\BG^^)
:UV:X
Ø Accuracy =
:UVWXV:XVWU

Meta-labeling
q Meta-labeling: separate the problem to side and size.
Ø Side: the predicted direction of the bet.
Ø Size: how much money we should risk in the bet.
q Use two machine learning models to get the better

performance.
Ø First model for side:
• A binary classification model which decide the direction.
• Enhance the recall as high as possible.
Ø Second model for size:
• A multi-classification model which decide what is the appropriate
size we should invest in the bet.
• Enhance the precision as high as possible.


Part 2 金融資料前處理

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 2 金融資料前處理

Uploaded by

Copyright:

Available Formats

Preprocessing for Financial Data

Department of Electrical Engineering

Dept. of Electrical Engineering

Department of Electrical Engineering

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q There could have been an error in the data collection

q There are two ways to deal with missing data:

Dept. of Electrical Engineering

Dept. of Electrical Engineering

2. Remove columns that contain missing values:

Dept. of Electrical Engineering

4. Drop rows that have fewer than 3 real values:

5. Only drop rows where NaN appear in specific:

Dept. of Electrical Engineering

q One of most common techniques is mean imputation,

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q For example, t-shirt size would be an ordinal feature, but

Dept. of Electrical Engineering

q This dataset contains:

q The class are stored in the last column.

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q To encode the class labels, we can use an ordinal

q We need to remember that the class labels are not ordinal

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q The idea behind one-hot is to create a new dummy

q In this example, we need to convert color into one-hot

Dept. of Electrical Engineering

q Transform columns in a multi-feature array:

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q The majority of machine learning and optimization

Dept. of Electrical Engineering

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q Common solution to reduce the generalization error are

Dept. of Electrical Engineering

q L1 regularization usually sparse feature vectors and most

q Sparse can be useful in practice if we have a high-

Dept. of Electrical Engineering

Dept. of Electrical Engineering

Dept. of Electrical Engineering

Dept. of Electrical Engineering

Dept. of Electrical Engineering

Dept. of Electrical Engineering

(k < d; k = initial feature space; d = feauture subspace)

q The motivation behind feature selection algorithms:

q The idea behind SBS:

q In order to determine which feature is to removed at each

Dept. of Electrical Engineering

q The outline of preceding definition of SBS is in four simple

Dept. of Electrical Engineering

q The accuracy of KNN improved on the validation dataset

Dept. of Electrical Engineering

Dept. of Electrical Engineering

q Use Wine dataset and rank 13 features.

Dept. of Electrical Engineering

q We can simplify the model by using only five features, the

Dept. of Electrical Engineering

Dept. of Electrical Engineering

Department of Electrical Engineering