Professional Documents
Culture Documents
Comp2712 l05 ML Feature
Comp2712 l05 ML Feature
Dr Trent Lewis
College of Science and Engineering
Feature A Feature B
Feature Normalisation
Feature A Feature B
Feature Normalisation
• Z-score: zero mean and unit standard deviation
Feature Normalisation
𝑥 − 𝜇
𝑧 =
𝜎
𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 𝑛𝑜𝑟𝑚 =
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛
Feature Reduction
• Curse of Dimensionality overfitting
• Correlations between features misleading results
• Principle Components Analysis (PCA) to the rescue
• Reduces the original data features into uncorrelated principal
components
• Each component representing a different set of correlated
features with different amounts of variation.
• “Retain components that account for 90% variation”
• Depends on data
• Could reduce from 100’s to 10’s of features
https://colab.research.google.com/drive/1rOL7B6PGb-bovZ7z26K0daqTCzErZJpX?usp=sharing
Output Transformation
• Categorical variables are often called nominal
• Some examples include:
• A “pet” variable with the values: “dog” and “cat“.
• A “color” variable with the values: “red“, “green” and “blue“.
• A “place” variable with the values: “first”, “second” and “third“.
• A “passing grade”: “fail”, “pass”
• A “iris variety”: “Iris-setosa”, “Iris-versicolor”, “Iris-virginica”
• Each value represents a different category
• Classifiers (like MLP) need numbers!
Output Transformation
• Two Solutions
• Integer Encoding
• each unique category value is assigned an integer value.
• for example, “red” is 1, “green” is 2, and “blue” is 3.
• okay for ordinal (order matters), but not for true nominal
• One-Hot Encoding
• a new binary variable is added for each unique integer value.
• In the “color” variable example, there are 3 categories and therefore 3
binary variables are needed. A “1” value is placed in the binary variable
for that color and “0” values for the other colors.
y_oh = pd.get_dummies(df['class']).values
# one hot encoding - much easier using keras!
train_labels_oh = tf.keras.utils.to_categorical(train_labels)
Output Transformation
Instance Class Integer Encoding
• Two Solutions1 “Red”
2 “Green”
• Integer Encoding
3 “Blue”
• each unique category value is assigned an Instance
integer value. Class
4 “Green”
• for example, “red” is 1, “green” is 2, and “blue”1 is 3. 1
2 2
• okay for ordinal (order matters), but not for true nominal
3 3
• One-Hot Encoding 4 2
• a new binary variable is added for each unique integer value.
Instance Red Green Blue
• In1the “color” variable example, One-Hot
there are 3 Encoding
categories and therefore 3
1 0 0
binary variables are needed. A “1” value is placed in the binary variable
for2 the color
0 1
and “0” 0
values for the other colors.
3 0 0 1
4 0 1 0
Output Transformation
• Integer Encoding input hidden output