Professional Documents
Culture Documents
Machine Learning
Lecture 2
Wang Xinchao
xinchao@nus.edu.sg
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Data
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data
• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative
• Other aspects
– Available or Missing Data
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Levels/Scales of Measurement
Highest
Ratio
NOIR Named + Ordered
+ Equal Interval +
Interval Has “True” Zero
Named + Ordered
+ Equal Interval
Ordinal
Named + Ordered
Nominal
Named
Lowest
6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Nominal Data
• Lowest Level of Measurement
• Discrete Categories
• NO natural order
• Estimating a mean, median, or standard deviation, would be
meaningless.
• Possible Measure: mode, frequency distribution
601
401
• Example:
A
B
Gender Occupation
T
mode
7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ordinal Data discretedata
• Ordered Categories
• Relative Ranking
• Unknown “distance” between categories: orders matter but not the
difference between values
• Possible Measure: mode, frequency distribution + median
• Example:
– Evaluate the difficulty level of an exam
• 1: Very Easy, 2: Easy, 3: About Right, 4: Difficulty, 5: Very Difficulty
– Rate a movie in Google Review
controversialbecause 4.8
f
8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Interval Data continuousdata
• Ordered Categories
• Well-defined “unit” measurement:
– Distances between points on the scale are measurable
– Can measure differences!
• Equal Interval
• Zero is arbitrary (not absolute), in many cases human-defined
• Ratio is meaningless can't 20 C istwiceashotas100C
say
• Possible Measure: mode, frequency distribution + median + mean,
standard deviation, addition/subtraction
T
unit is
• Example: welldefined
– Temperature measured in Celsius
• 10 degrees C, 20 degrees C, 30 degrees C
– Year of someone’s birth
• 1990, 2000, 2010, 2020
9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ratio Data
• Most precise and highest level of measurement
• Ordered
• Equal Intervals
• Natural Zeros:
– When the variable equals zero, it means there is none of that variable
– Not arbitrary
• Possible Measure: mode, frequency distribution + median + mean, standard
deviation, addition/subtraction + multiplication and division (ratio)
• Example:
– Weights
• 10 KG, 20 KG, 30 KG
– Time
• 10 Seconds, 1 Hour, 1 Day
10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR
Frequency
Distribution
Yes Yes Yes Yes
Mean, standard
deviation
No No Yes Yes
Ratios No No No Yes
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR
Yes.
Ratio
Yes.
Zero means
No.
none?
Interval
Yes.
Equally
split? No.
Ordinal
No.
Nominal
Ordered?
12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Quick Quiz
Yes.
Ratio
• Which level of measurement? Yes.
Yes.
Equally split? No.
13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Measures can (often) be controversial
• Measure of Intelligence (IQ Test):
– Ordinal or Interval?
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data
• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative
• Other aspects
– Available or Missing Data
15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Numerical or Categorical
Variable
Categorical Numerical
(qualitative) (quantitative)
Discrete Continuous
Nominal Ordinal (Whole numerical (Can take any value
(Unordered (Ordered values) within a rage)
categories) categories) Example: outcome of Example:
tossing a die temperature in day
Quantitativedataaremeasuresofvaluesorcountsandare Numerical
expressed asnumbers OR (quantitative)
Qualitativedataaredataaboutcategoricalvariables
Interval Ratio
(may compute (may compute
difference but no difference, real
absolute zero) zero exists)
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data
• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative
• Other aspects
– Available or Missing Data
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Missing Data
• Missing data: data that are missing and you do not know
the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
• Data wrangling
– The process of transforming and mapping data from one "raw" data
form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes
such as analytics.
– In short, transforms data to gain insight
– General process
Credit:https://en.wikipedia.org/wiki/Data_wrangling
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
Example:
customer data
stored in SQL;
Determine Make the most social network Remove invalid Ensure data Use data
your goal of dataset stored as graphs data correctness (training/test)
Men
removethispart
G
https://developers.google.com/machine-learning/data-prep/transform/normalization
– Handling missing features
EE2211 Introduction to Machine Learning 17
S. All Rights Reserved.
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features
23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features: Imputation
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Formatting Data
salary Co IBillion To L
• Normalization
– Linear Scaling:
scale each variable to [0 1]
mean
– Z-score standardization:
each independent dimension
of data is normally distributed
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity
• Data integrity is the maintenance and the assurance of data accuracy
and consistency;
– A critical aspect to the design, implementation, and usage of any system that stores,
processes, or retrieves data.
– Very broad concept!
• Example:
• In a dataset, numeric columns/cells should not accept alphabetic data.
• A binary entry should only allow binary inputs
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity programto
makesurethatdatainputisvalid
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Visualization
Graphical
Representation
of data!
29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example: Showing Distribution
Visualization: Distribution Visualization: Bars
Distribution
Probability Mass Function
31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Why Visualization is Necessary
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• Types of data
– NOIR falsedata
• Data wrangling and cleaning missingdata
I random
initiation
forest
mmmm
BinaryCoding
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"