You are on page 1of 34

EE2211 Introduction to

Machine Learning
Lecture 2

Wang Xinchao
xinchao@nus.edu.sg

!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization

3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Data

4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data

• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative

• Other aspects
– Available or Missing Data

5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Levels/Scales of Measurement
Highest

Ratio
NOIR Named + Ordered
+ Equal Interval +
Interval Has “True” Zero

Named + Ordered
+ Equal Interval
Ordinal

Named + Ordered
Nominal

Named
Lowest

6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Nominal Data
• Lowest Level of Measurement
• Discrete Categories
• NO natural order
• Estimating a mean, median, or standard deviation, would be
meaningless.
• Possible Measure: mode, frequency distribution
601
401
• Example:
A
B
Gender Occupation
T
mode

1:man 2:woman Doctor Police Teacher

7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ordinal Data discretedata

• Ordered Categories
• Relative Ranking
• Unknown “distance” between categories: orders matter but not the
difference between values
• Possible Measure: mode, frequency distribution + median

• Example:
– Evaluate the difficulty level of an exam
• 1: Very Easy, 2: Easy, 3: About Right, 4: Difficulty, 5: Very Difficulty
– Rate a movie in Google Review
controversialbecause 4.8
f

8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Interval Data continuousdata

• Ordered Categories
• Well-defined “unit” measurement:
– Distances between points on the scale are measurable
– Can measure differences!
• Equal Interval
• Zero is arbitrary (not absolute), in many cases human-defined
• Ratio is meaningless can't 20 C istwiceashotas100C
say
• Possible Measure: mode, frequency distribution + median + mean,
standard deviation, addition/subtraction
T
unit is
• Example: welldefined
– Temperature measured in Celsius
• 10 degrees C, 20 degrees C, 30 degrees C
– Year of someone’s birth
• 1990, 2000, 2010, 2020

9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ratio Data
• Most precise and highest level of measurement
• Ordered
• Equal Intervals
• Natural Zeros:
– When the variable equals zero, it means there is none of that variable
– Not arbitrary
• Possible Measure: mode, frequency distribution + median + mean, standard
deviation, addition/subtraction + multiplication and division (ratio)

• Example:
– Weights
• 10 KG, 20 KG, 30 KG

– Time
• 10 Seconds, 1 Hour, 1 Day

10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR

We can estimate Nominal Ordinal Interval Ratio

Frequency
Distribution
Yes Yes Yes Yes

Median No Yes Yes Yes

Add or subtract No No Yes Yes

Mean, standard
deviation
No No Yes Yes

Ratios No No No Yes

11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR
Yes.
Ratio

Yes.
Zero means
No.
none?
Interval
Yes.
Equally
split? No.
Ordinal

No.
Nominal

Ordered?

12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Quick Quiz
Yes.
Ratio
• Which level of measurement? Yes.

Nominal, Ordinal, Interval, Ratio Zero means


none?
No.
Interval

Yes.
Equally split? No.

1. Favorite Restaurant Ordinal

• Mcdonald’s, Burger King, Subway, KFC, Nominal


… No.
Nominal
2. Weight of luggage
Ordered?
• 10 KG, 20 KG, 25 KG, … pay
3. SAT Scores with ranges [400, 1600] Interval
4. Size of Packed Eggs in supermarkets
• Small, Medium, Large, Extra Large, … Ordinal
5. Military rank
• General, Major, Captain, … Ording
6. Number of people in a household
• 1, 2, 3, 4, 5, … Ratio
7. Credit Score in United States [300, 850] Interval

13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Measures can (often) be controversial
• Measure of Intelligence (IQ Test):
– Ordinal or Interval?

• Money in the back account:


– Ratio (0 means no money) or Interval (one can own money to a
bank)

14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data

• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative

• Other aspects
– Available or Missing Data

15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Numerical or Categorical
Variable

Categorical Numerical
(qualitative) (quantitative)

Discrete Continuous
Nominal Ordinal (Whole numerical (Can take any value
(Unordered (Ordered values) within a rage)
categories) categories) Example: outcome of Example:
tossing a die temperature in day

Quantitativedataaremeasuresofvaluesorcountsandare Numerical
expressed asnumbers OR (quantitative)
Qualitativedataaredataaboutcategoricalvariables

Interval Ratio
(may compute (may compute
difference but no difference, real
absolute zero) zero exists)

16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data

• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative

• Other aspects
– Available or Missing Data

17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Missing Data
• Missing data: data that are missing and you do not know
the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.

18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization

19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
• Data wrangling
– The process of transforming and mapping data from one "raw" data
form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes
such as analytics.
– In short, transforms data to gain insight
– General process

Credit:https://en.wikipedia.org/wiki/Data_wrangling
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
Example:
customer data
stored in SQL;
Determine Make the most social network Remove invalid Ensure data Use data
your goal of dataset stored as graphs data correctness (training/test)

Unifying format Remove noisy


samples Checking that Use data to
to .png all images train your face
Women
have labels. detector!

Men

Collect Human Face Images for Face Detector


Credit:https://understandingdata.com/what-is-data-wrangling/
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Scaling to a range Data Cleaning
When the bounds or range of each independent dimension of
data•isThe process
known, a commonofnormalization
detecting technique
and correcting
removing)(or
is min-max.
corrupt
Feature or inaccurate records from a record set, table, or
clipping
database.
• Example:
– Clipping outliers

removethispart
G
https://developers.google.com/machine-learning/data-prep/transform/normalization
– Handling missing features
EE2211 Introduction to Machine Learning 17
S. All Rights Reserved.

22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features

1. Removing the examples with missing features from the


dataset
– Can be done if the dataset is big enough so we can sacrifice some
training examples

2. Using a learning algorithm that can deal with missing


feature values
– Example: random forest classifier

3. Using a data imputation technique

23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features: Imputation

• Replace the missing value of a feature by an average


value of this feature in the dataset:

• Highlight the missing value


– Replace the missing value with a value outside the normal range of
values. to indicatemissingvalue
– For example, if the normal range is [0, 1], then you can set the
missing value to 2 or −1.
– Enforce the learning algorithm will learn what is best to do when
the feature has a value significantly different from regular values.

24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Formatting Data

• Binary Coding to convert categories into binary form


– One-hot encoding: unify several entities within one vector
• Example: the color of a pixel can be red, yellow, or green
• Very common in classification tasks!

salary Co IBillion To L

• Normalization
– Linear Scaling:
scale each variable to [0 1]

mean
– Z-score standardization:
each independent dimension
of data is normally distributed

25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization

26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity
• Data integrity is the maintenance and the assurance of data accuracy
and consistency;
– A critical aspect to the design, implementation, and usage of any system that stores,
processes, or retrieves data.
– Very broad concept!
• Example:
• In a dataset, numeric columns/cells should not accept alphabetic data.
• A binary entry should only allow binary inputs

27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity programto
makesurethatdatainputisvalid

28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Visualization

Graphical
Representation
of data!

29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example: Showing Distribution
Visualization: Distribution Visualization: Bars

Distribution
Probability Mass Function

(a) a probability mass function, (b) a probability density function

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.

Probability Density Function

nction, (b) a probability density function


30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Visualization: Boxplots

Maximum (100th percentile)

Third Quartile (75th percentile)

Medium (50th percentile)

First Quartile (25th percentile)

Minimum (0th percentile)

31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Why Visualization is Necessary

Four datasets with


identical means,
variances and
regression lines.

32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• Types of data
– NOIR falsedata
• Data wrangling and cleaning missingdata
I random
initiation
forest

mmmm
BinaryCoding

• Data integrity and visualization


– Integrity: Design
– Visualization: Graphical Representation

33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"

You might also like