CV w5 - Machine Learning (Kepanjangan)

5.
Machine Learning Overview

COMPUTER VISION – TK44019
Materi kuliah untuk Chapter ini mengikuti modul HCIA - Artificial
Intelligence (hak cipta materi adalah milik Huawei).
M a c h i n e Learning O v e r v i e w
Foreword
l M a c h i n e learning is a core research field o f AI, a n d i t is also a necessary

k n o w l e d g e f o r deep learning. Therefore, this chapter m a i n l y introduces t h e
m a i n concepts o f m a c h i n e learning, t h e classification o f m a c h i n e learning,
t h e overall process o f m a c h i n e learning, a n d t h e c o m m o n a l g o r i t h m s o f
m a c h i n e learning.
2 Hu a we i Confidential
Objectives
U p o n c o m p l e t i o n o f this course, y o u w i l l be able to:

p M a s t e r t h e l e a r n i n g a l g o r i t h m d e f i n i t i o n a n d m a c h i n e l e a r n i n g process.
p
K n o w c o m m o n m a c h i n e l e a r n i n g a lg o r ith ms .
p U n d e r s t a n d concepts such as hyperparameters, g r a d i e n t descent, a n d cross
validation.
Conten ts
1 . M a c h i n e Learning Definition
2 . M a c h i n e Learning Types
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case Study
M a c h i n e Learning A l g o r i t h m s ( 1 )
l M a c h i n e learning ( i n c l u d i n g deep l e a r n i n g ) is a study o f learning algorithms. A
c o m p u t e r p r o g r a m is said t o learn f r o m experience 𝐸 w i t h respect t o s om e class
o f tasks 𝑇 a n d p e r f o r m a n c e measure 𝑃 if its p e r f o r m a n c e a t tasks i n 𝑇, as
measured by 𝑃, improves w i t h experience 𝐸.
Data Learning Basic

algorithms u n dersta n d i n g
(Experience E)
(Task T) ( M e a s u r e P)
M a c h i n e Learning A l g o r i t h m s ( 2 )
Experience H isto r ical

data
Induction Training
In p u t Pre d ictio n In p u t Prediction

New New F u tu re
Regularity Future Model
problems data a ttr ib u te s
Differences B e t w e e n M a c h i n e Learning A l g o r i t h m s a n d
Tradi ti onal Rule-Based A l g o r i t h m s
Rule-based a l g o r i t h m s M a c h i n e lear ning
Trai n i n g
data
M ach ine
lear ning
New data M o del Pred ic ti o n
• Samples are used f o r training.

• Explicit p r o g r a m m i n g is used t o solve problems. • The dec is ion - m ak ing rules are c o m p l e x o r
diff ic ult t o describe.
• Rules can be m a n u a l l y specified.
• Rules are a u t o m a t i c a l l y lear ned by machines.
Ap p l i ca ti o n Scenarios o f M a c h i n e Learning ( 1 )
l The sol u tio n to a p r o b l e m is c o m plex, o r t he proble m m a y involv e a large
a m o u n t o f d a t a w i t h o u t a clear d a t a distribution function.
l M a c h i n e learning can be used in t h e f o l l o w i n g scenarios:
Rules are c o m p le x o r Task rules change over tim e . D a t a d is tr ib u tio n changes

c a n n o t be described, such For exam ple, in t h e p a r t - o f - over tim e , re q u irin g c ons tant
as facial re c o g n itio n a n d speech t a g g i n g task, n e w r e a d a p ta tio n o f p ro g ra m s ,
voice recognition. w o r d s o r m e a n in g s are such as p re d ic tin g t h e t r e n d o f
g e n e ra te d a t a n y tim e . c o m m o d i t y sales.
Ap p l i ca ti o n Scenarios o f M a c h i n e Learning ( 2 )
Com plex
Machine learning
M a n u a l rules
Rule c omplex ity
al gor i t hm s
Rule-based
Simple
Simple probl ems

al gor i t hm s
Small Lar ge
Scale o f t h e p r o b l e m
Rational U n d e r s t a n d i n g o f M a c h i n e Learning A l g o r i t h m s
Target e q u a t i o n
𝑓: 𝑋 → 𝑌
Ideal
A c tu a l
Training d a t a Hypothesis f u n c t i o n
𝐷: {(𝑥1, 𝑦1) ⋯ , (𝑥𝑛, 𝑦𝑛)} Learning a l g o r i t h m s 𝑔 ≈𝑓
l Target f u n c t i o n f is u n k n o w n . Learning a l g o r i t h m s c a n n o t o b t a i n a perfect f u n c t i o n f .

l Assume t h a t hypothesis f u n c t i o n g a pprox im a t e s f u n c t i o n f , b u t m a y be d i ffe r e n t f r o m
f u n c t i o n f.
M a i n Problems Solved by M a c h i n e Learning
l M a c h i n e lear ning can deal w i t h m a n y types o f tasks. The f o l l o w i n g describes t h e m o s t typical a n d c o m m o n
types o f tasks.
p Classification: A c o m p u t e r p r o g r a m needs t o specify w h i c h o f t h e k categories s o m e i n p u t belongs to. To accomplish this
task, lear ning a l g o r i t h m s usually o u t p u t a f u n c t i o n 𝑓: 𝑅 𝑛 → (1,2, … , 𝑘). For example, t h e i m a g e classification a l g o r i t h m in
c o m p u t e r vision is developed t o h a n d le classification tasks.
p Regression: For this t y p e o f task, a c o m p u t e r p r o g r a m predicts t h e o u t p u t f o r t h e given input . Learning a l g o r i t h m s

typically o u t p u t a f u n c t i o n 𝑓: 𝑅 𝑛 → 𝑅. A n e x a m p le o f this task t y p e is t o predict t h e c l a i m a m o u n t o f a n insured person ( t o
set t h e insurance p r e m i u m ) o r predict t h e security price.
p Clustering: A lar ge a m o u n t o f d a t a f r o m a n unlabeled dataset is divided i n t o m u l t i p l e categories according t o int er nal

sim ilar it y o f t h e data. D a t a in t h e s am e cat egor y is m o r e similar t h a n t h a t in d iff e r e n t categories. This f e a t u r e can be
used in scenarios such as i m a g e retrieval a n d user prof ile m a n a g e m e n t .
l Classification a n d regression are t w o m a i n types o f prediction, a c c o u n t i n g f r o m 8 0 % t o 90%. The o u t p u t o f

classification is discrete category values, a n d t h e o u t p u t o f regression is c ontinuous numbers.
Conten ts
1 . M a c h i n e Learning D e f i n i t i o n
6. Case study
M a c h i n e Learning Classification
l Supervised learning: O b t a i n a n o p t i m a l m o d e l w i t h r equir ed p e r f o r m a n c e t h r o u g h t r a i ni ng a n d learning
based o n t h e samples o f k n o w n categories. Then, use t h e m o d e l t o m a p all inputs t o o u t p u t s a n d check t h e
o u t p u t f o r t h e purpose o f classifying u n k n o w n data.
l Unsupervised learning: For u n l a b e l e d samples, t h e learning a l g o r i t h m s directly m o d e l t h e i n p u t datasets.

Clustering is a c o m m o n f o r m o f unsupervised learning. W e o n l y need t o p u t h i g h l y similar samples together,
calculate t h e similarity b e t w e e n n e w samples a n d existing ones, a n d classify t h e m by similarity.
l Semi-supervised learning: In o n e task, a m a c h i n e lear ning m o d e l t h a t a u t o m a t i c a l l y uses a large a m o u n t o f

unlabeled d a t a t o assist learning directly o f a s mall a m o u n t o f labeled data.
l R einf or cement learning: It is a n area o f m a c h i n e lear ning concerned w i t h h o w agents o u g h t t o t a k e actions

in a n e n v i r o n m e n t t o m a x i m i z e s o m e n o t i o n o f c u m u l a t i v e reward. The difference b e t w e e n r e i n f o r c e m e n t
learning a n d supervised learning is t h e teacher signal. The r e i n f o r c e m e n t signal pr ov ided by t h e e n v i r o n m e n t
in r e i n f o r c e m e n t learning is used t o evaluate t h e ac tion (scalar signal) r a t h e r t h a n telling t h e learning system
h o w t o p e r f o r m correct actions.
Supervised Learning
D a t a f e a t u re Label
Feature 1 ... Feature n Goal
Supervised l e a r n i n g
Feature 1 ... Feature n Goal
algorithm
Feature 1 ... Feature n G oal
Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair No
Sunny Cold Weak Yes
Supervised Learning - Regression Questions
l Regression: reflects t h e features o f a t t r i b u t e values o f samples i n a s a m p l e dataset. The
dependency b e t w e e n a t t r i b u t e values is discovered b y expressing t h e relationship o f
s a m p l e m a p p i n g t h r o u g h functions.
p H o w m u c h w i l l I b e n e f i t f r o m t h e stock n e xt week?
p Wh a t ' s t h e t e m p e r a t u r e o n Tuesday?
Supervised Learning - Classification Questions
l Classification : m a ps samples in a sam p le d a ta set to a specifie d categor y by
using a classification model.
p W i l l th er e be a traffic j a m o n XX r o a d d u r i n g
t h e m o r n i n g rush h o u r t o m o r r o w ?
p W h i c h m e t h o d is m o r e attractive t o customers:
5 y u a n voucher o r 2 5 % off?
Unsupervised Learning
D a t a Feature
Feature 1 ... Feature n
Unsupervised I n t e rn a l
Feature 1 ... Feature n si m i l a r it y
learning algorithm
Feature 1 ... Feature n
Monthly Consumption
Commodity
Consumption Time Category
Badminton Cluster 1
1000–2000 6:00–12:00
racket
Cluster 2
500–1000 Basketball 18:00–24:00
1000–2000 G a m e console 00:00–6:00
Unsupervised Learning - Clustering Questions
l Clustering: classifies samples i n a sample dataset i n t o several categories based
o n t h e clustering m o d e l . The similarity o f samples b e l o n g i n g t o t h e same
category is high.
p W h i c h audiences like t o w a t c h movies
o f t h e s a m e subject?
p W h i c h o f these c o m p o n e n t s are
d a m a g e d in a similar way ?
Semi-Supervised Learning
D a t a Feature Label
Feature 1 ... Feature n G oal
Semi-supervised
Feature 1 ... Feature n Unknown
learning algorithms
Feature 1 ... Feature n Unknown
Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair /
Sunny Cold Weak /
R ei nforcement Learning
l The m o d el p e rceives t he en v ir o n m e n t , ta kes actions, an d m a k e s adjust m e n ts
a n d choices based o n t h e status a n d a w a r d o r p u n i s h m e n t .
M o del
Award or A ct io n 𝑎𝑡
Status 𝑠𝑡
p u n i s h m e n t 𝑟𝑡
𝑟𝑡 +1
𝑠𝑡 + Environment
1
R ei nforcement Learning - Best Behavior
l Reinforcement learning: always looks f o r best behaviors. Reinforcement learning
is t a r g e t e d a t machines o r robots.
p A u to p i l o t: Should i t b r a k e o r accelerate w h e n t h e y e l l o w l i g h t starts t o flash?
p Cleaning r o bo t: Should i t keep w o r k i n g o r g o back f o r charging?
Conten ts
1 . M a c h i n e learning a l g o r i t h m
2 . M a c h i n e Learning Classification
6. Case study
M a c h i n e Learning Process
Feature Model
Data Data Model Model deployment
ex t r ac t ion a n d
c ollect io n cleansing t rain i n g ev aluat ion
selection and integration
Feedback a n d i te r a ti o n
Basic M a c h i n e Learning Concept — Dataset
l Dataset: a collection o f d a t a used i n m a c h i n e learning tasks. Each d a t a record is
called a sample. Events o r attributes t h a t reflect t h e p e r f o r m a n c e o r n a t u r e o f a
sample in a particular aspect are called features.
l Training set: a dataset used i n t h e t r a i n i n g process, w h e r e each sample is

referred t o as a t r a i n i n g sample. The process o f creating a m o d e l f r o m d a t a is
called learning (training).
l Test set: Testing refers t o t h e process o f using t h e m o d e l o b t a i n e d a f t e r learning

f o r prediction. The dataset used is called a test set, a n d each sampl e is called a
test sample.
Checking D a t a O v e r v i e w
l Typical dataset f o r m
Feature 1 Feature 2 Feature 3 Label
No. Area School Districts Direction House Price
1 100 8 South 1000
2 120 9 S o u t h we st 1300
Training
set 3 60 6 North 700
4 80 9 Southeast 1100
Test set 5 95 3 South 850
I m p o r t a n c e o f D a t a Processing
l D a t a is crucial t o models. It is t h e ceiling o f m o d e l capabilities. W i t h o u t g o o d
data, there is n o g o o d model.
Data
Data
D a t a cleansing
prepro cessi n g normalization
Fill in missing values, Normalize data to

a n d det ect a n d reduce noise a n d
e lim in a t e causes o f i m p r o v e m o d e l accuracy.
dataset exceptions.
D a t a di mensi on
reduction
Sim plif y d a t a
at t r ibut es t o avoid
d im e n s io n explosion.
W o r k l o a d o f D a t a Cleansing
l Statistics o n d a t a scientists' w o r k in m a c h i n e learning
3 % R e m o d e lin g t r a i n i n g datasets
5 % Ot her s
4 % O p t i m i z i n g m odels
9 % Mining modes f r o m data
1 9 % Collecting datasets
6 0 % Cleansing a n d sor t ing d a t a
CrowdFlower Data Science Report 2016
D a t a Cleansing
l M ost m a chin e learnin g m o d e l s proce ss fe ature s, w h i c h are usually n u m e r ic
representations o f i n p u t variables t h a t can be used in t h e model.
l In m o st cases, the collected d a ta can be used b y a l g o r it h m s o nly af t e r b e ing

preprocessed. The preprocessing operations include t h e f o l l o w i n g :
p D a t a filte r ing
p Processing o f lost d a t a
p Processing o f possible exceptions, errors, o r a b n o r m a l values
p C o m b i n a t i o n o f d a t a f r o m m u l t i p l e d a t a sources
p
D a t a consolidation
Dirty Data (1)
l Generally, real d a t a m a y have s o m e q u a l i t y problems.
p Incompleteness: contains missing values o r t h e d a t a t h a t lacks a ttr ib u te s
p Noise: contains incorrect records o r exceptions.
p Inconsistency: contains inconsistent records.
Dirty Data (2)
IsTe #Stu
# Id Name Birthday Ge nde r ach dent Country City
er s
1 111 John 31/12/1990 M 0 0 Ireland D u b lin
2 222 Mery 15/10/1978 F 1 15 Iceland Missing value

Madri
3 333 Alice 19/04/2000 F 0 0 Spain
d
4 444 Mark 01/11/1997 M 0 0 France Paris

Invalid value
5 555 Alex 15/03/2000 A 1 23 Germany Berlin
6 555 Peter 1983-12-01 M 1 10 Italy Rome
7 777 Calvin 05/05/1995 M 0 0 Italy Italy Value t h a t

should be in
8 888 Roxane 03/08/1948 F 0 0 P ortugal Lisbon another column
Switzerlan Genev
Invalid duplicat e i t e m 9 999 Anne 05/09/1992 F 0 5
d a
10 101010 Paul 14/11/1992 M 1 26 Ytali Rome M isspelling
Incorrect f o r m a t A t t r i b u t e dependency
D a t a Conversion
l A f t e r b e in g preprocessed, t h e d a t a needs t o be converted i n t o a representation f o r m suitable f o r
t h e m a c h i n e le a rn in g mo d e l. C o m m o n d a t a conversion f o r m s include t h e f o l l o w i n g :
p W i t h respect t o classification, category d a t a is encoded i n t o a corresponding n u me r i c a l representation.
p Value d a t a is converted t o category d a t a t o reduce t h e value o f variables ( f o r age s e g m e n t a t i o n ) .
p Other data
n In t h e text, t h e w o r d is conver t ed i n t o a w o r d vector t h r o u g h w o r d e m b e d d i n g ( gener ally using t h e w o r d 2 v e c m o d e l ,

BERT m odel, etc).
n Process i m a g e d a t a ( c olor space, grayscale, g e o m e t r ic change, H a a r feature, a n d i m a g e e n h a n c e m e n t )
p Feature engineering
n N o r m a l i z e features t o ensure t h e s am e value ranges f o r i n p u t variables o f t h e s am e m odel.
n Feature expansion: C o m b in e o r convert existing variables t o generat e n e w features, such as t h e average.
Necessity o f Feature Selection
l Generally, a d a taset h a s m a n y feature s, s o m e o f w h i c h m a y be re d u n d a n t or
irrelevant t o t h e value t o be predicted.
l Feature selection is necessary in t h e f o l l o w i n g aspects:
Sim plif y
models t o
Reduce t h e
make them
training time
easy f o r users
t o interpret
Improve
Avoid model
d i m e nsion genera liza t io n
explosion a n d avoid
o v e r f it t in g
Feature Selection M e t h o d s - Filter
l Filter m e t h o d s are i n d e p e n d e n t o f t h e m o d e l d u r i n g feature selection.
By ev aluating t h e c orrelation b e t w e e n each
featur e a n d t h e t a r g e t attribute, these m e t h o d s
use a statistical measure t o assign a value t o
each feature. Features are t h e n sorted by score,
w h i c h is h e l p f u l f o r preserving o r e l i m i n a t i n g
specific features.
Select t h e Common methods

Traverse all Train Evaluate t h e
features optimal m o d els p erform a n ce • Pearson c orrelation coefficient
feature subset • Chi-square coefficient
• Mutual information
Procedure of a filter m e t h o d
Limitations
• The filter m e t h o d tends t o select r e d u n d a n t
variables as t h e relationship b e t w e e n features
is n o t considered.
Feature Selection M e t h o d s - W r a p p e r
l Wr a p p e r m e t h o d s use a prediction m o d e l t o score feature subsets.
Wr a p p e r m e t h o d s consider featur e selection as a

search issue f o r w h i c h differ ent c o m b i n a t i o n s are
evaluated a n d c ompared. A predictive m o d e l is
used t o evaluate a c o m b i n a t i o n o f features a n d
Select t h e o p t i m a l assign a score based o n m o d e l accuracy.
featur e subset
Common methods
Genera te
Traverse all
a feature
Train
Evaluate • Recursive featur e e l i m i n a t i o n (RFE)
features m o d els
subset m o d e ls
Limitations
Procedure of a • Wr a p p e r m e t h o d s t r a i n a n e w m o d e l f o r each
wrapper m e t h o d subset, resulting in a h u g e n u m b e r of
computations.
• A featur e set w i t h t h e best p e r f o r m a n c e is
usually prov ided f o r a specific ty pe o f model.
Feature Selection M e t h o d s - E m b e d d e d
l E m b e d d e d m e t h o d s consider feature selection as a p a r t o f m o d e l construction.
The m o s t c o m m o n ty pe o f e m b e d d e d featur e
selection m e t h o d is t h e regulariz ation m e t h o d .
Regularization m e t h o d s are also called penalization
Select t h e o p t i m a l featur e subset m e t h o d s t h a t intr oduc e a d d i t i o n a l constraints i n t o
t h e o p t i m i z a t i o n o f a predictive a l g o r i t h m t h a t bias
t h e m o d e l t o w a r d l o w e r c omplex ity a n d reduce t h e
Traverse all Generate a Train m o d e ls
features feature subset + Evaluate t h e n u m b e r o f features.
effect
Common methods
Procedure of an embedded m e t h o d
• Lasso regression
• Ridge regression
Overall Procedure o f Building a M o d e l
M o d e l Building Procedure
1 2 3
D a t a splitting: M o d e l training: M o d e l verification:

Divide d a t a i n t o t ra in in g Use d a t a t h a t has been Use validation sets t o
sets, test sets, a n d cleaned u p a n d feature validate t h e m o d e l
validation sets. engineering t o t ra in a model. validity.
6 5 4
M o d e l fine-tuning: M o d e l de pl oy m e nt: M o d e l test:

Continuously t u n e t h e D e p lo y t h e m o d e l in Use test d a t a t o evaluate
m o d e l based o n t h e a n actual t h e generalization
actual d a t a o f a service p r o d u c t io n scenario. capability o f t h e m o d e l in
scenario. a real e n viro n m e n t .
Examples o f Supervised Learning - Learning Phase
l Use t h e classification m o d e l t o predict w h e t h e r a person is a basketball player.
Fe a tur e
( a t t r i b u te ) Ta r g e t
Service Name City Age Label

Training set
data Mike Miami 42 yes The m o d e l searches
Jerry N e w York 32 no f o r t h e relationship
(Cleansed features a n d tags)
b e t w e e n features
Sp lit t i n g Bryan Orlando 18 no
a n d targets.
Task: Use a classification m o d e l t o predict Patricia Miami 45 yes
w h e t h e r a person is a basketball player
u n d e r a specific feature. Elodie Phoenix 35 no Test set
Remy Chicago 72 yes Use n e w d a t a t o
verify t h e m o d e l
John N e w York 48 yes validity.
Model
t r aining
Each featur e o r a c o m b i n a t i o n o f several features can
provide a basis f o r a m o d e l t o m a k e a j u d g m e n t .
Examples o f Supervised Learning - Prediction Phase
Name City Age Label
Marine Miami 45 ?
Julien Miami 52 ? Unknown data
Recent data, i t is n o t
New Fred Orlando 20 ?
k now n whether the
data M ic helle Boston 34 ? people are basketball
Nicolas Phoenix 90 ? players.
IF city = M i a m i → Probability = +0.7

IF city= O r l a n d o → Probability = +0.2
IF age > 42 → Probability = +0.05*age + 0.06
Application IF age ≤ 42 → Probability = +0.01*age + 0.02
model
Name City Age Prediction
Marine Miami 45 0.3
Possibility prediction
Julien Miami 52 0.9
Apply t h e m o d e l t o t h e
Fred Orlando 20 0.6 n e w d a t a t o predict
Prediction whether the customer
M ic helle Boston 34 0.5
data w i l l c hange t h e supplier.
Nicolas Phoenix 90 0.4
W h a t Is a G o o d M o d e l ?
• Ge ne ra liz a t ion capability

Can i t accurately predict t h e actual service data?
• In t e r p r e t a b ilit y
Is t h e prediction result easy t o interpret?
• Prediction speed
H o w l o n g does i t ta k e t o predict each piece o f data?
• Practicability
Is t h e prediction r a te still acceptable w h e n t h e
service v o l u m e increases w i t h a h u g e d a t a v o l u m e ?
M o d e l Validity ( 1 )
l Generalization capability: The g o a l o f m a c h i n e le a rn in g is t h a t t h e m o d e l o b t a i n e d a f t e r le a rn in g
should p e r f o r m w e l l o n n e w samples, n o t just o n samples used f o r training. The capability o f
a p p lyin g a m o d e l t o n e w samples is called generalization o r robustness.
l Error: difference b e t w e e n t h e sa mp le result predicted by t h e m o d e l o b t a i n e d a f t e r learning a n d

t h e actual sa mp le result.
p Training error: error t h a t y o u g e t w h e n y o u r u n t h e m o d e l o n t h e t r a i ni ng data.
p Generalization error: error t h a t y o u g e t w h e n y o u r u n t h e m o d e l o n n e w samples. Obviously, w e prefer a
m o d e l w i t h a smaller generalization error.
l Un d e rf it t in g : occurs w h e n t h e m o d e l o r t h e a l g o r i t h m does n o t f i t t h e d a t a w e l l enough.

l Overfitting: occurs w h e n t h e t r a i n i n g error o f t h e m o d e l o b t a i n e d a f t e r le a rn in g is s m a l l b u t t h e
generalization error is large ( p o o r generalization capability).
M o d e l Validity ( 2 )
l M o d e l capacity: model's capability o f f i t t i n g functions, w h i c h is also called m o d e l complexity.
p W h e n t h e capacity suits t h e task c omplex ity a n d t h e a m o u n t o f t r a i ni ng d a t a provided, t h e a l g o r i t h m
effect is usually o p t i ma l .
p Mo del s w i t h insufficient capacity c a n n o t solve c o m p l e x tasks a n d u n d e r f i t t i n g m a y occur.
p A high-capacity m o d e l can solve c o m p l e x tasks, b u t ov er fitting m a y occur if t h e capacity is higher t h a n
t h a t required by a task.
Underfitting Good fitting Overfitting

N o t all features are learned. Noises are learned.
O v e r f i t t i n g Cause — Error
l Total error o f f in a l prediction = Bias2 + Variance + Irreducible error
l Generally, t h e prediction error can be divided i n t o t w o types:
p Error caused by "bias "
Variance
p Error caused by "v arianc e"
l Variance: Bias
p O ffs et o f t h e prediction result f r o m t h e average value

p Error caused by t h e model's sensitivity t o s mall fluc tuations
in t h e t r a i ni ng set
l Bias:
p Difference b e t w e e n t h e expected ( o r average) prediction value a n d t h e
correct value w e are t r y i n g t o predict.
Variance a n d Bias
l C o m b i n ations o f varian ce a n d bias are as
follows:
p L o w bias & l o w variance –> G o o d m o d e l
p L o w bias & h i g h variance
p H i g h bias & l o w variance
p H i g h bias & h i g h variance –> Poor m o d e l
l Ideally, w e w a n t a m o d e l t h a t can accurately

capture t h e rules in t h e t r a i n i n g d a t a a n d
s u m m a r i z e t h e invisible d a t a ( n e w data).
H ow ev e r, i t is usually impossible f o r t h e m o d e l
t o c o m p l e t e b o t h tasks a t t h e s a m e time .
M o d e l C ompl exi ty a n d Error
l As t h e m o d e l c o m p l e x i t y increases, t h e t r a i n i n g error decreases.
l As t h e m o d e l compl exi ty increases, t h e test error decreases t o a certain p o i n t
a n d t h e n increases in t h e reverse direction, f o r m i n g a convex curve.
H i g h bias & L o w bias &

l o w variance h i g h variance
Testing error
Error
Training error
M o d e l Complexity
M a c h i n e Learning Performance Evaluation - Regression
l The closer t h e M e a n Absolute Error ( M A E ) is t o 0, t h e b e t t e r t h e m o d e l can f i t t h e t r a i n i n g data.
𝑚
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦𝑖
m
𝑖=1
l M e a n Square Error (MSE)

m
1
2
𝑀𝑆𝐸 = 𝑦 − 𝑦𝑖
m
𝑖=
1 𝑖
l The value r a n g e o f R2 is (–∞, 1]. A larger value indicates t h a t t h e m o d e l can b e t t e r f i t t h e t r a i n i n g
data. TSS indicates t h e difference b e t w e e n samples. RSS indicates t h e difference b e t w e e n t h e
predicted value a n d sa mp le value.
𝑅𝑆𝑆 𝑚 𝑦𝑖 − 𝑦𝑖 2
𝑖=1
𝑅2 = 1 − = 1 − 2
𝑇𝑆𝑆 𝑚
𝑖= 𝑦𝑖 − 𝑦𝑖
1
M a c h i n e Learning P e rf o rma n ce Evaluation - Classification ( 1 )
l Terms a n d definitions: Estimated
amount
p 𝑃: positive, indic at ing t h e n u m b e r o f real positive cases yes no Total
in t h e data. A ctual a m o u n t
yes 𝑇𝑃 𝐹𝑁 𝑃
p 𝑁: negative, indic at ing t h e n u m b e r o f real negat ive cases
no 𝐹𝑃 𝑇𝑁 𝑁
in t h e data.
Total 𝑃′ 𝑁′ 𝑃+𝑁
p 𝑇P : t r u e positive, indic at ing t h e n u m b e r o f positive cases t h a t are correctly
classified by t h e classifier. Confusion m a t r i x
p 𝑇𝑁: t r u e negative, indic at ing t h e n u m b e r o f negat ive cases t h a t are correctly classified by t h e classifier.
p 𝐹𝑃: false positive, indic at ing t h e n u m b e r o f positive cases t h a t are incorrectly classified by t h e classifier.
p 𝐹𝑁: false negative, indic at ing t h e n u m b e r o f negat ive cases t h a t are incorrectly classified by t h e classifier.
l Confusion matr ix : a t least a n 𝑚 × 𝑚 table. 𝐶𝑀𝑖,𝑗 o f t h e first 𝑚 r o w s a n d 𝑚 c o l u m n s indicates t h e n u m b e r o f

cases t h a t actually b e l o n g t o class 𝑖 b u t are classified i n t o class 𝑗 by t h e classifier.
p Ideally, f o r a h i g h accuracy classifier, m o s t pr edic t ion values should be locat ed in t h e d ia g o n a l f r o m 𝐶𝑀1,1 t o 𝐶𝑀𝑚,𝑚 o f
t h e t able w h i l e values outside t h e d ia g o n a l are 0 o r close t o 0. T h a t is, 𝐹𝑃 a n d 𝐹𝑃 are close t o 0.
M a c h i n e Learning P e rf o rma n ce Evaluation - Classification ( 2 )
Measurement Ratio
𝑇𝑃 + 𝑇𝑁
Accuracy a n d rec ognition rate
𝑃+𝑁
𝐹𝑃 + 𝐹𝑁
Error r ate a n d misclassification r ate
𝑃+𝑁
Sensitivity, t r u e positive rate, a n d 𝑇𝑃
recall 𝑃
𝑇𝑁
Specificity a n d t r u e negative rate
𝑁
𝑇𝑃
Precision
𝑇𝑃 + 𝐹𝑃
𝐹1, h a r m o n i c m e a n o f t h e recall rate 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
a n d precision 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝛽, w h e r e 𝛽 is a n o n - n e g a t i v e real (1 + 𝛽 2 ) × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙

number 𝛽2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Example o f M a c h i n e Learning Per for mance Evaluation
l W e have tr ained a m a c h i n e l e a r n i n g m o d e l t o id e n tify w h e t h e r t h e object i n a n i m a g e is
a cat. N o w w e use 2 0 0 pictures t o verify t h e m o d e l performance. A m o n g t h e 200
images, objects i n 1 7 0 images are cats, w h i l e others are not. The identification result o f
t h e m o d e l is t h a t objects i n 1 6 0 ima g e s are cats, w h i l e others are not.
𝑇𝑃 140
Precision: 𝑃 = = = 87.5% Estimated
𝑇𝑃+𝐹𝑃 140+20 amount
Actual
𝒚𝒆𝒔 𝒏𝒐 Total
Recall: 𝑅 = 𝑇𝑃 = 140 = 82.4% amount

𝑃 170
𝑦𝑒𝑠 140 30 170
Accuracy: 𝐴𝐶𝐶 = 𝑇𝑃+𝑇𝑁 = 140+10 = 75% 𝑛𝑜 20 10 30

𝑃+𝑁 170+30
Total 160 40 200
Conten ts
6. Case study
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 1 )
l The g r a d i e n t descent m e t h o d uses t h e negative g r a d i e n t Cost surface
direction o f t h e c u r r e n t position as t h e search direction,
w h i c h is t h e steepest direction. The f o r m u l a is as follows:
wk+1 = wk -hÑf wk (x i
)
l In the f o r m ula, 𝜂 in dicates th e learn in g r a te a n d 𝑖
in dicates the d a t a reco r d n u m b e r 𝑖 . The w e igh t
p a r a m e t e r w indicates t h e change i n each iteration.
l Convergence: The value of the object ive f u n ct ion

changes very little, or the maximum number of
iterations is reached.
l Batch Gradient Descent ( B G D ) uses t h e samples ( m in t o t a l ) in all datasets t o
u p d a t e t h e w e i g h t p a r a m e t e r based o n t h e g r a d i e n t value a t t h e cur r ent point.
1 m
wk +1 = wk -h å Ñf wk (x i )
m i=1
l Stochastic Gradient D escent (SG D ) r a n d o m ly selects a sam p le in a d a taset t o
u p d a t e t h e w e i g h t p a r a m e t e r based o n t h e g r a d i e n t value a t t h e cur r ent point.
w
k+1
= wk -hÑfw (x i
) k
l M i n i - B a t c h Gradient Descent ( M B G D ) combines t h e features o f BGD a n d SGD

a n d selects the gradie nts o f n sa m ples in a d a taset to u p d a t e t he w e igh t
+n-1
wk +1 = wk -h1 å Ñf
t
p a r a m e t e r.
wk (xi )
n i=t
l Comparison o f three g ra d ie n t descent m e t h o d s
p In t h e SGD, samples selected f o r each t r a i ni ng are stochastic. Such instability causes t h e loss f u n c t i o n t o
be unstable o r even causes reverse displacement w h e n t h e loss f u n c t i o n decreases t o t h e lowes t point.
p BGD has t h e highest stability b u t consumes t o o m a n y c o m p u t i n g resources. M B G D is a m e t h o d t h a t
balances SGD a n d BGD.
BGD
Uses all t r a i ni ng samples f o r t r a i ni ng each time.
SGD
Uses o n e t r a i ni ng s ample f o r t r a i ni ng each time.
MBGD
Uses a certain n u m b e r o f t r a i ni ng samples f o r
t r a i ni ng each time.
Parameters a n d H y p e r p a r a m e t e r s i n M o d e l s
l The m o d e l contains n o t o n l y parameters b u t also hyperparameters. The purpose
is t o enable t h e m o d e l t o learn t h e o p t i m a l parameters.
p Parameters are a u t o m a t i c a l l y learned b y models.
p H y p e r p a r a me te r s are m a n u a l l y set.
Model parameters are

"distilled" f r o m dat a.
Model
Tra in in g
Use
hyper par amet er s t o
control training.
Hyperparameters of a Model
• O f t e n used in m o d e l p a r a m e t e r • λ d u r i n g Lasso/Ridge regression

• Learning rat e f o r t ra i n i n g a
e s t i m a t i o n processes. neural n e t w o r k , n u m b e r o f
iterations, b a t c h size, activation
• O f t e n specified b y t h e practitioner. function, a n d n u m b e r o f neurons
• Can o f t e n be set using heuristics. • 𝐶 a n d 𝜎 in s u p p o r t vector
machines ( S V M )
• O f t e n t u n e d f o r a given predictive • K in k-nearest n e i g h b o r ( K N N )
modeling problem. • N u m b e r o f trees in a r a n d o m
forest
M o d e l hyperparameters are
Common model
external configurations o f
h y p e rp a r a m e t e rs
models.
H y p e r p a r a m e t e r Search Procedure a n d M e t h o d
1. Dividing a dataset i n t o a t r a in in g set, v alidat ion set, a n d test set.

2. O p t i m i z i n g t h e m o d e l par am et er s using t h e t r a i n i n g set based o n t h e m o d e l
p e r f o r m a n c e indicators.
3. Searching f o r t h e m o d e l h y p e r - p a r a m e t e r s using t h e v alidat ion set based o n t h e m o d e l
Procedure f o r p e r f o r m a n c e indicators.
searching 4. P e r f o r m step 2 a n d step 3 alternately. Finally, d e t e r m i n e t h e m o d e l par am et er s a n d
h y perpara m e ters hy per par am et er s a n d assess t h e m o d e l using t h e test set.
• Grid search
• R a n d o m search
• Heuristic int elligent search
Search a l g o r i t h m • Bayesian search
(step 3 )
H y p e r p a r a m e t e r Searching M e t h o d - Grid Search
l Grid search a t t e m p t s t o exhaustively search all possible
hyperparameter combinations to f o r m a hyperparameter
value grid. Grid search
l In practice, t h e r a n g e o f h y p e r p a r a m e t e r values t o search is 5
Hyperparameter 1
specified m a n u a l l y. 4
3
l Grid search is a n expensive a n d t i m e - c o n s u m i n g m e t h o d .
2
p This m e t h o d w o r k s w e l l w h e n t h e n u m b e r o f hyperparameters
is relatively small. Therefore, i t is applicable t o generally 1
m a c h i n e le a rn in g a l g o r i t h m s b u t inapplicable t o n e u ra l
0 1 2 3 4 5
networks Hyperparameter 2
(see t h e deep le a rn in g part).
H y p e r p a r a m e t e r Searching M e t h o d - R a n d o m Search
l W h e n t h e h y p e r p a r a m e t e r search space is large, r a n d o m
search is b e t t e r t h a n g r i d search. R a n d o m search
l In r a n d o m search, each setting is s a m p l e d f r o m t h e
d i s tr i b u ti o n o f possible p a r a m e t e r values, in a n a t t e m p t
t o f i n d t h e best subset o f hyperparameters.
Parameter 1
l N o te:
p Search is p e r f o r m e d w i t h i n a coarse range, w h i c h t h e n w i l l
be n a r r o w e d based o n w h e r e t h e best result appears.
p Some hyperparameters are m o r e i m p o r t a n t t h a n others, a n d
Parameter 2
t h e search deviation w i l l be affected d u r i n g r a n d o m search.
Cross Va l i d a ti o n ( 1 )
l Cross validation: It is a statistical analysis m e t h o d used t o validate t h e p e r f o r m a n c e o f a
classifier. The basic idea is t o divide t h e original dataset i n t o t w o parts: t r a i n i n g set a n d va lid a t io n
set. Train t h e classifier using t h e t r a i n i n g set a n d test t h e m o d e l using t h e validation set t o check
t h e classifier performance .
l k - f o l d cross validation ( 𝑲 − 𝑪𝑽 ):
p Divide t h e r a w d a t a i n t o 𝑘 groups (generally, evenly divided).

p Use each subset as a v alidation set, a n d use t h e o t h e r 𝑘 − 1 subsets as t h e t r a i ni ng set. A t o t a l o f 𝑘 models
can be obtained.
p Use t h e m e a n classification accuracy o f t h e final v alidation sets o f 𝑘 models as t h e p e r f o r m a n c e indic ator

o f t h e 𝐾 − 𝐶𝑉 classifier.
Cross Va l i d a ti o n ( 2 )
Entire dataset
Training set Test set
Training set Validation set Test set
l Note: The K value in K - f o l d cross validation is also a hyperparameter.
Conten ts
5 . C o m m o n M a c h i n e Learning Algorithms
6. Case study
M a c h i n e Learning A l g o r i t h m O v e r v i e w
M a c h i n e l e a r ni ng
Supervised learning Unsupervised learning
Classificatio n Regression Clustering O t h ers
Logistic regression Linear regression K -means Correlation rule

Hierarchical Principal c o m p o n e n t
SVM SVM
clustering analysis (PCA)
Neural network D ensit y- b ased Gaussian m i x t u r e
Neural network
clustering model ( G M M )
Decision tree Decision tree
R a n d o m forest R a n d o m forest
GBDT GBDT
KNN
Naive Bayes
Linear Regression ( 1 )
l Linear regression: a statistical analysis m e t h o d t o d e t e r m i n e t h e q u a n t i t a t i v e
relationships b e t w e e n t w o o r m o r e variables t h r o u g h regression analysis in
m a t h e m a t i c a l statistics.
l Linear regression is a ty pe o f supervised learning.
U n a r y linear regression M u l t i - d i m e n s i o n a l linear regression
Linear Regression ( 2 )
l The m o d e l f u n c t i o n o f linear regression is as follows, w h e r e 𝑤 indicates t h e w e i g h t par ameter, 𝑏 indicates t h e
bias, a n d 𝑥 indicates t h e s ample attribute.
hw (x) = wT x + b
l The relationship b e t w e e n t h e value predicted by t h e m o d e l a n d ac tual value is as follows, w h e r e 𝑦 indicates
t h e ac tual value, a n d 𝜀 indicates t h e error.
y = wT x + b +e
l The error 𝜀 is influenc ed by m a n y factors independently. According t o t h e central l i m i t t h e o r e m , t h e error 𝜀
f o l lo w s n o r m a l d ist r ib u tion . A ccor d in g to th e n o r m a l d ist r ib u t io n f u nctio n an d m a x im u m lik e liho o d
estimation, t h e loss f u n c t i o n o f linear regression is as follows:
1
å (w )
2
J (w)= h (x) - y
2m
l To m a k e t h e predicted value close t o t h e ac tual value, w e need t o m i n i m i z e t h e loss value. W e can use t h e
g r a d i e n t descent m e t h o d t o calculate t h e w e i g h t p a r a m e t e r 𝑤 w h e n t h e loss f u n c t i o n reaches t h e m i n i m u m ,
a n d t h e n c o m p l e t e m o d e l building.
Linear Regression Extension - P o l y n o m i a l Regression
l P o l y n o m i a l regression is a n extension o f linear regression. Generally, t h e c o mp le x ity o f
a dataset exceeds t h e possibility o f f i t t i n g b y a s tr a ig h t line. T h a t is, obvious u n d e r f i t t i n g
occurs if t h e original linear regression m o d e l is used. The s o lu tio n is t o use p o l y n o m i a l
regression.
h (x) = w x + w x2 + + w xn +b
w 1 2 n
l where, t h e n t h p o w e r is a p o l y n o m i a l regression
d i m e n s i o n (degree).
l P o l y n o m i a l regression belongs t o linear

regression as t h e relationship b e t w e e n its w e i g h t
p a ra me t e rs 𝑤 is still linear w h i l e its nonlinearity
is reflected in t h e f e a t u re dimension. Com par is on b e t w e e n linear regression
a n d p o l y n o m i a l regression
Linear Regression a n d O v e r f i t t i n g Prevention
l Regularization t e r m s can be used t o reduce overfitting. The value o f 𝑤 c a n n o t be t o o
large o r t o o s m a l l i n t h e s a m p l e space. You can a d d a square s u m loss o n t h e t a r g e t
func tion.
1
å (hw (x) - y ) +l å w
2 2
J (w)= 2
l Regularizatio n te r m s ( n o r m ) :2m
The reg u l arizatio n t e r m h e re is called L2 - n o r m . Lin ear
regression t h a t uses this loss f u n c t i o n is also called Ridge regression.
1
å (hw (x) - y ) +l å w 1
2
J (w)=
2m
l Linear regression w i t h absolute loss is called Lasso regression.
Logistic Regression ( 1 )
l Logistic regression: The logistic regression m o d e l is used t o solve classification problems.
The m o d e l is d e fin e d as follows:
𝑒 𝑤𝑥+ 𝑏
𝑃 𝑌 = 1 𝑥 =
1 + 𝑒 𝑤𝑥+ 𝑏
1
𝑃 𝑌 = 0 𝑥 =
1 + 𝑒 𝑤𝑥+ 𝑏
w h e r e 𝑤 indicates t h e w ei ght , 𝑏 indicates t h e bias, a n d 𝑤𝑥 + 𝑏 is r egar ded as t h e linear f u n c t i o n o f 𝑥. C ompar e
t h e preceding t w o pr obability values. The class w i t h a higher pr obability value is t h e class o f 𝑥.
Logistic Regression ( 2 )
l B o t h t h e logistic regression m o d e l a n d linear regression m o d e l are generalized linear
models. Logistic regression introduces n o n l i n e a r factors ( t h e s i g m o i d f u n c t i o n ) based o n
linear regression a n d sets thresholds, so i t can deal w i t h b in a r y classification problems.
l Accor d in g to t h e m o d e l f u n c t io n o f logist ic re g ression , t h e lo ss f u n c tio n o f logis t ic

regression can be e s ti m a te d as f o l l o w s b y using t h e m a x i m u m lik e lih o o d estimation:
1
J (w) = -
m
å ( y ln hw (x) + (1- y) ln(1- hw (x)))
l w h e r e 𝑤 indicates t h e w e i g h t p a r a me te r, 𝑚 indicates t h e n u m b e r o f samples, 𝑥 indicates
t h e sample, a n d 𝑦 indicates t h e real value. The values o f all t h e w e i g h t p a r a m e te r s 𝑤
can also be o b t a i n e d t h r o u g h t h e g r a d i e n t descent a l g o r i t h m .
Logistic Regression Extension - S o f t m a x Function ( 1 )
l Logistic regression applies o n l y t o binary classification problems. For multi-class
classification problems, use t h e S o f t m a x function.
Binary classification p r o b l e m Multi-class classification p r o b l e m
Grape?
Ma le ? Orange?
A p p le?
Female? Banana?
l S o f t m a x regression is a generalization o f logistic regression t h a t w e can use f o r
K-class classification.
l The S o f t m a x f u n c t i o n is used t o m a p a K-dimensional vector o f arbitrary real

values t o a n o t h e r K-dimensional vector o f real values, w h e r e each vector
e l e m e n t is in t h e interval (0, 1).
l The regression probability f u n c t i o n o f S o f t m a x is as follows:

wkTx
e
p( y = k | x; w) = K
, k =1,2 ,K
åe wlTx
l=1
l S o f t m a x assigns a p r o b a b i l i ty t o each class i n a multi-class p r o b l e m . These probabilities
m u s t a d d u p t o 1.
p S o f t m a x m a y p ro d u ce a f o r m b e l o n g i n g t o a particular class. Example:
Category Probability
Grape? 0.09
• S u m o f a ll probabilities:
Orange? 0.22 • 0.09 + 0.22 + 0.68 + 0.01 =1
• M o s t probably, this picture is
A p p le? a n apple.
0.68
Banana? 0.01
Decision Tree
l A decision tree is a tree structure ( a binary tree o r a n o n - b i n a r y tree). Each n o n - l e a f n o d e represents a test o n
a f e at u r e attr ibute. Each br anc h represents t h e o u t p u t o f a f e at u r e a t t r i b u t e in a certain value range, a n d
each leaf n o d e stores a category. To use t h e decision tree, start f r o m t h e r o o t node, test t h e f eat ur e attributes
o f t h e i t ems t o be classified, select t h e o u t p u t branches, a n d use t h e c ategory stored o n t h e leaf n o d e as t h e
final result.
Root
Sh o r t Tall
Ca n n o t Can Sh o r t Long
sq u eak sq u eak neck neck
Sh o r t Long
M i g h t be a M i g h t be a
M i g h t be nose n o se
squirrel giraff e
a rat
On land In w a t e r M i g h t be a n
e le p h a n t
M i g h t be a M i g h t be
rhinoceros a hippo
Decision Tree Structure
Root N o d e
I nt er nal In t e rnal
Node Node
In t e rnal
Leaf N o d e Leaf N o d e Node Leaf N o d e
Leaf N o d e Leaf N o d e Leaf N o d e
Key Points o f Decision Tree Construction
l To create a decision tree, w e need t o select a ttr ib u te s a n d d e t e r m i n e t h e tree structure
b e t w e e n fe a tu r e attributes. The key step o f constructing a decision tree is t o divide d a t a
o f all fea tur e attributes, c o m p a r e t h e result sets i n t e r m s o f 'purity', a n d select t h e
a t t r i b u t e w i t h t h e highest 'purity' as t h e d a t a p o i n t f o r dataset division.
l The metrics t o q u a n t i f y t h e 'purity' include t h e i n f o r m a t i o n e n t r o p y a n d GINI Index. The

f o r m u l a is as follows:
K K
H(X)= -å pklog (2 p)k
k=1
Gini = 1 - å
k =1
p k2
l w h e r e 𝑝𝑘 indicates t h e p r o b a b ility t h a t t h e s a m p l e belongs t o class k ( t h e r e are K

classes i n total) . A greater difference b e t w e e n p u r i t y b e fo r e s e g m e n t a t i o n a n d t h a t a fte r
s e g m e n t a t i o n indicates a b e t t e r decision tree.
l C o m m o n decision tree a l g o r i t h m s include ID3, C4.5, a n d CART.

Decision Tree Construction Process
l Feature selection: Select a feature f r o m t h e features o f t h e t r a i n i n g d a t a as t h e
split standard o f t h e cur r ent node. ( D i ff e r e n t standards generate d i ff e r e n t
decision tree algorithms .)
l D ecision t r ee g e n era tion : Generate inter n a l n o d e upside d o w n based o n t he

selected features a n d stop u n t i l t h e dataset can n o l o n g e r be split.
l Pruning: The decision tree m a y easily b e c o m e o v e r f i t t i n g unless necessary
p r u n i n g ( i n c l u d i n g p r e - p r u n i n g a n d p o s t - p r u n i n g ) is p e r f o r m e d t o reduce t h e
tree size a n d o p t i m i z e its n o d e structure.
Decision Tree Example
l The f o l l o w i n g f ig u re shows a classification w h e n a decision tree is used. The classification result is
i m p a c t e d by three attributes: Refund, M a r i t a l Status, a n d Taxable Income.
Marital Taxable
Tid R ef und C h eat
Status Income
1 Yes Single 125,000 No R e fu n d
2 No Married 100,000 No
3 No Single 70,000 No M a rital
No Status
4 Yes Married 120,000 No
5 No Divorced 95,000 Yes
Taxabl e
6 No Married 60,000 No Income No
7 Yes Divorced 220,000 No
8 No Single 85,000 Yes No Yes
9 No Married 75,000 No
10 No Single 90,000 Yes
SVM
l S V M is a b in a ry classification m o d e l wh o se basic m o d e l is a linear classifier d e f in e d in t h e
eigenspace w i t h t h e largest interval. SVMs also include kernel tricks t h a t m a k e t h e m n o n lin e a r
classifiers. The S V M le a rn in g a l g o r i t h m is t h e o p t i m a l solution t o convex q u a d ra t ic p r o g r a m m i n g .
Project i o n
Com plex Easy s e g m e n t a t i o n in

s e g m e n t a t i o n in l o w - h ig h - d im e n s io n a l space
dim ens ional space
Linear S V M ( 1 )
l H o w d o w e split t h e red a n d blue datasets by a straight line?
or
W i t h binary classification B o t h t h e left a n d r i g h t m e t h o d s can be used t o

Tw o - d i m e n s i o n a l dataset divide datasets. W h i c h o f t h e m is correct?
Linear S V M ( 2 )
l Straight lines are used t o divide d a t a i n t o d i ff e r e n t classes. Actually, w e can use m u l t i p l e straight
lines t o divide data. The core idea o f t h e S V M is t o f i n d a straight line a n d keep t h e p o i n t close t o
t h e straight line as f a r as possible f r o m t h e straight line. This can enable st ro n g generalization
capability o f t h e mo d e l. These points are called support vectors.
l In t w o - d i m e n s i o n a l space, w e use straight lines f o r segmentation. In h i g h - d i m e n s i o n a l space, w e
use hyperplanes f o r segmentation.
Distance b e t w e e n
s u p p o r t vectors
is as f a r as
possible
Nonlinear SVM (1)
l H o w d o w e classify a n o n l i n e a r separable dataset?
Linear S V M can f u n c t i o n w e l l N o n l i n e a r datasets c a n n o t be

f o r linear separable datasets. split w i t h s tr a ig h t lines.
Nonlinear SVM (2)
l Kernel func tions are used t o construct n o n l i n e a r SVMs.
l Kernel func tions a l l o w a l g o r i t h m s t o f i t t h e largest h y p e r p la n e i n a t r a n s f o r m e d h i g h -
d i m e n s i o n a l fe a tu r e space.
C o m m o n k e r n e l functions
Linear Polyn o m i a l
kernel kernel
f u n c tion function
Gaussian Sigmoid
kernel kernel
function f u n c tion I n p u t space H i g h - d im e n sio n a l
f e a t u re space
KNN Algorithm (1)
l The KNN classification a l g o r i t h m is a
theoretically m a t u r e m e t h o d a n d o n e o f t h e
simplest machine learning algorithms.
According t o this m e t h o d , if t h e m a j o r i t y o f
k samples m o s t similar t o one sampl e
?
(nearest neighbors in the eigenspace)
b e l o n g t o a specific category, this sample
also belongs t o this category.
The t a r g e t category o f p o i n t ? varies w i t h

t h e n u m b e r o f t h e m o s t adjacent nodes.
KNN Algorithm (2)
l As t h e prediction result is d e t e r m i n e d based o n
t h e n u m b e r a n d w e i g h t s o f neighbors in t h e
t r a i n i n g set, t h e K N N a l g o r i t h m has a simple logic.
l K N N is a n o n - p a r a m e t r i c m e t h o d w h i c h is usually
used in datasets with irregular decision
boundaries.
p The K N N a l g o r i t h m generally adopts t h e m a j o r i t y
v o t i n g m e t h o d f o r classification prediction a n d t h e
average value m e t h o d f o r regression prediction.
l K N N requires a h u g e n u m b e r o f computati ons.
KNN Algorithm (3)
l Generally, a larg e r k valu e red uces the i m p a c t o f noise o n classific a t io n , b u t o b f uscates the
b o u n d a r y b e t w e e n classes.
p A larger k value means a higher pr obability o f u n d e r f i t t i n g because t h e s e g m e n t a t i o n is t o o rough. A
smaller k value means a higher pr obability o f ov er fitting because t h e s e g m e n t a t i o n is t o o refined.
• The b o u n d a r y becomes s m o o t h e r as
t h e value o f k increases.
• As t h e k value increases t o infinity, all
d a t a points w i l l eventually b e c o m e all
b lu e o r all red.
N ai ve Bayes ( 1 )
l Naive Bayes a l g o r i t h m : a simple multi-class classification a l g o r i t h m based o n t h e Bayes t h e o r e m .
It assumes t h a t features are i n d e p e n d e n t o f each other. For a given sa mp le f e a t u re 𝑋 , t h e
p ro b a b ilit y t h a t a sample belongs t o a category 𝐻 is:
P ( X1 ,…, X n | C k ) P (C k )
P (C k | X 1,…, X n ) =
P ( X1 ,…, X n)
p 𝑋1, … , 𝑋𝑛 are d a t a features, w h i c h are usually described by m e a s u r e m e n t values o f m a t t r i b u t e sets.

n For example, t h e color f e a t u r e m a y have t hr ee attributes: red, yellow, a n d blue.
p 𝐶𝑘 indicates t h a t t h e d a t a belongs t o a specific category 𝐶

p 𝑃 𝐶𝑘|𝑋1, … , 𝑋𝑛 is a posterior probability, o r a posterior pr obability o f u n d e r c o n d i t i o n 𝐶𝑘 .
p 𝑃 𝐶𝑘 is a pr ior pr obability t h a t is independent o f 𝑋1, … , 𝑋𝑛
p 𝑃 𝑋1, … , 𝑋𝑛 is t h e priori pr obability o f 𝑋.
N ai ve Bayes ( 2 )
l I n d e p e n d e n t a s s u m p t i o n o f features.
p For example, if a f r u i t is red, r o u n d , a n d a b o u t 1 0 c m (3.94 in.) in d ia me te r, i t can be
considered a n apple.
p A Naiv e Bayes classifier considers t h a t each fe a tu r e in d e p e n d e n tly c ontributes t o t h e

p r o b a b i l i ty t h a t t h e f r u i t is a n apple, regardless o f a n y possible c orrelation b e t w e e n
t h e color, roundness, a n d d ia me te r.
Ensemble Learning
l Ensemble lear ning is a m a c h i n e learning p a r a d i g m in w h i c h m u l t i p l e learners are t r a i ned a n d c o m b i n e d t o
solve t h e s ame p r o b l e m. W h e n m u l t i p l e learners are used, t h e i nt egr at ed generaliz ation capability can be
m u c h stronger t h a n t h a t o f a single learner.
l If y o u ask a c o m p l e x ques tion t o thous ands o f people a t r a n d o m a n d t h e n s u m m a r i z e their answers, t h e

s u m m a r i z e d ans wer is better t h a n a n expert's ans wer in m o s t cases. This is t h e w i s d o m o f t h e masses.
Trai ni ng set
Dataset 1 Dataset 2 Dataset m
Model 1 Model 2 Model m
Model Large
sy n thesis m o del
Classification o f Ensemble Learning
Bagging ( R a n d o m Forest)
• Independently builds several basic learners a n d t h e n
Baggi ng
averages their predictions.
• O n average, a c ompos ite learner is usually better t h a n
a single-base learner because o f a smaller variance.
Ensemble learning
Boosting (Adaboost, GBDT, a n d XGboost)

Constructs basic learners in sequence t o
Boosting
gr adually reduce t h e bias o f a c ompos ite learner.
The c ompos ite learner can f i t d a t a well, w h i c h
m a y also cause overfitting.
Ensemble M e t h o d s i n M a c h i n e Learning ( 1 )
l R a n d o m forest = Bagging + CART decision t r ee
l Ra n d o m forest s b u ild m u l t iple decision t rees a n d m e rg e t h em together to m a ke predictions m o re accurate

a n d stable.
p R a n d o m forests can be used f o r classification a n d regression problems.

Bootstrap sam pl i ng Decision t r e e building Aggregation
prediction result
D a t a subset 1 Prediction 1
D a t a subset 2 Prediction 2
• Category:
All tr a i ni ng d a t a m a j o r i t y voting Final prediction
• Regression:
Prediction a v e r a ge value
D a t a subset Prediction n
Ensemble M e t h o d s i n M a c h i n e Learning ( 2 )
l GBDT is a type o f boosting a l g o r i t h m .
l For a n aggregative m o d e , t h e s u m o f t h e results o f all t h e basic learners equals t h e predicted
value. In essence, t h e residual o f t h e error f u n c t i o n t o t h e predicted value is f i t by t h e n e xt basic
learner. (Th e residual is t h e error b e t w e e n t h e predicted value a n d t h e actual value.)
l D u r i n g m o d e l training, GBDT requires t h a t t h e sa mp le loss f o r m o d e l prediction be as sma ll as

possible.
Pred ic ti o n
30 years o l d 20 years o l d
Residual
calcula tio n
Pred ic ti o n
10 years o l d 9 years o l d
Residual
calcula tio n
Pred ic ti o n
1 year o l d 1 year o l d
Unsupervised Learning - K- m e a n s
l K -me a n s clustering aims t o p a r t i t i o n n observations i n t o k clusters in w h i c h each observation
belongs t o t h e cluster w i t h t h e nearest m e a n , serving as a p r o t o t y p e o f t h e cluster.
l For t h e k - m e a n s a l g o r i t h m , specify t h e f in a l n u m b e r o f clusters (k). Then, divide n d a t a objects

i n t o k clusters. The clusters o b t a i n e d m e e t t h e f o l l o w i n g conditions: ( 1 ) Objects in t h e same
cluster are h ig h ly similar. ( 2 ) The similarity o f objects in d i ff e r e n t clusters is small.
x1 x1
K - m e a n s clustering
The d a t a is n o t tagged. K-
means clustering can
a u t o m a t i c a l l y classify datasets.
x2 x2
Unsupervised Learning - Hierarchical Clustering
l Hierarchical clustering divides a dataset a t d i ff e r e n t layers a n d f o r m s a tree-like clustering
structure. The dataset division m a y use a " b o t t o m - u p " a g g r e g a t i o n policy, o r a " t o p - d o w n "
splitting policy. The hierarchy o f clustering is represented in a tree graph. The r o o t is t h e u n i q u e
cluster o f all samples, a n d t h e leaves are t h e cluster o f o n l y a sample.
hierarchical clustering
Split hierarchical
Agglomerative
clustering
Conten ts
6. Case study
Comprehensive Case
l Assume t h a t there is a dataset c o n t a i n i n g t h e house areas a n d prices o f 21,613
housing units sold in a city. Based o n this data, w e can predict t h e prices o f
o t h e r houses in t h e city.
House A r e a Price
1,180 221,900
2,570 538,000
770 180,000
1,960 604,000
1,680 510,000
5,420 1,225,000
Dataset
1,715 257,500
1,060 291,850
1,160 468,000
1,430 310,000
1,370 400,000
1,810 530,000
… …
P r o b l e m Analysis
l This case contains a large a m o u n t o f data, inc luding i n p u t x (hous e area), a n d o u t p u t y (price), w h i c h is a
c ontinuous value. W e can use regression o f supervised learning. D r a w a scatter c hart based o n t h e d a t a a n d
use linear regression.
l O u r g o a l is t o b u i l d a m o d e l f u n c t i o n h ( x ) t h a t infinitely appr ox imates t h e f u n c t i o n t h a t expresses t r u e
dis tr ibution o f t h e dataset.
l Then, use t h e m o d e l t o predict u n k n o w n price data.
x U n a r y linear regression f u n c t i o n
Feature: house area
h(x) = wo + w1x
In p u t
Price
D a taset Learning h (x)
a lg o rit h m
Output
y
Label: price
House a r e a
Goal o f Linear Regression
l Linear regression a ims t o f i n d a s tr a ig h t line t h a t best fits t h e dataset.
l Linear regression is a p a r a m e te r - b a s e d m o d e l . Here, w e need l e a r n i n g p a r a m e te r s 𝑤0
a n d 𝑤1. W h e n these t w o p a r a m e te r s are fo u n d , t h e best m o d e l appears.
W h i c h line is t h e best parameter?
h(x) = wo + w1x
Price
Price
House a r e a House a r e a
Loss Function o f Linear Regression
l To f i n d t h e o p t i m a l parameter, construct a loss f u n c t i o n a n d f i n d t h e p a r a m e t e r
values w h e n t h e loss f u n c t i o n becomes t h e m i n i m u m .
1
Loss f u n c t i o n o f
linear regression:
J (w) =
2m
å (h(x) - y )2
Erro
Errorr Error r
Erro
Goal:
1
å
Price
arg min J (w) = (h(x) - y )2
w 2m
• where, m indicates t h e n u m b e r o f samples,
• h(x) indicates t h e predicted value, a n d y
House a r e a indicates t h e actual value.
Gr a d i e n t Descent M e t h o d
l The g ra d ie n t descent a l g o r i t h m finds t h e m i n i m u m value o f a f u n c t i o n t h r o u g h iteration.
l It aims t o r a n d o m i z e a n initial p o i n t o n t h e loss function, a n d t h e n f i n d t h e g l o b a l m i n i m u m value
o f t h e loss f u n c t i o n based o n t h e negative g ra d i e n t direction. Such p a r a m e t e r value is t h e o p t i m a l
p a r a m e t e r value.
p Point A: t h e position o f 𝑤0 a n d 𝑤1 a f t e r r a n d o m initialization.

𝑤0 a n d 𝑤1 are t h e required parameters. Cost surface
p A-B connection line: a track f o r m e d based o n descents in
a negative g ra d ie n t direction. U p o n each descent, values 𝑤0

a n d 𝑤1 change, a n d t h e regression line also changes.
p Point B: g l o b a l m i n i m u m value o f t h e loss function.

Final values o f 𝑤0 a n d 𝑤1 are also f o u n d .
I t e r a t i o n Example
l The f o l l o w i n g is a n e x a m p l e o f a g r a d i e n t descent iteration. W e can see t h a t as red
points o n t h e loss f u n c t i o n surface g r a d u a lly a p p r o a c h a l o w e s t point, f i t t i n g o f t h e
linear regression red line w i t h d a t a becomes b e tte r a n d better. A t this time , w e can g e t
t h e best parameters.
M o d e l D e b u g g i n g a n d Ap p l i ca ti o n
l A f t e r t h e m o d e l is trained, test i t w i t h t h e test The f in a l m o d e l result is as follows:
set t o ensure t h e generalization capability. h(x) = 280.62x - 43581
l If o v e r fi tti n g occurs, use Lasso regression o r
Ridge regression w i t h regulariz ation t e r m s
a n d t u n e t h e hyperparameters.
If u n d e r f it t i n g occurs, u se a m o r e co mp le x
Price
l
regression m o d e l , such as GBDT.
l Note:
p For real data, pay a t t e n t i o n t o t h e functions o f
d a t a cleansing a n d f e a t u re engineering.
House a r e a
Summary
l First, this course describes t h e d e fi n i ti o n a n d classification o f m a c h i n e learning, as

w e l l as p r o b l e m s m a c h i n e l e a r n i n g solves. Then, i t introduces key k n o w l e d g e
points o f m a c h i n e learning, in c lu d in g t h e overall procedure ( d a t a collection, d a t a
cleansing, fe atu r e extraction, m o d e l training, m o d e l t r a i n i n g a n d evaluation, a n d
m o d e l d e p l o y m e n t ) , c o m m o n a l g o r i t h m s (linear regression, logistic regression,
decision tree, SVM, naive Bayes, K N N , ensemble learning, K-means, etc.), g r a d i e n t
descent a l g o r i t h m , p a r a m e te r s a n d hyper-parameters.
l Finally, a c o m p l e t e m a c h i n e le a r n in g process is presented b y a case o f using linear

regression t o predict house prices.
Quiz
1. (Tru e o r false) G r a d ie n t descent it e ra t io n is the o n l y m e t h o d o f m a c h i n e lea r n i n g
algo r it h m s . ( )
A. True
B. False
1. (Single-answer q u e st io n ) W h i c h o f t h e f o l l o w i n g a l g o r i t h m s is n o t supervised le a rn in g ? (
)
A. Linear regression
B. Decision tree
C. KNN
D. K-means
Recommendations
l O n l i n e learning website
p https://e.huawei.com/en/talent/#/
l H u a w e i K n o w l e d g e Base
p h ttp s ://s u p p o r t.h u a w e i.c o m/e n te r p r is e /e n /k n o w le d g e ? la n g = e n
Thank you. 把数字世界带入每个人、每个家庭、
每个组织，构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.
Copyright©2020 Huawei Technologies Co., Ltd.

All Rights Reserved.
The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

CV w5 - Machine Learning (Kepanjangan)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CV w5 - Machine Learning (Kepanjangan)

Uploaded by

Copyright:

Available Formats

5.

Machine Learning Overview

l M a c h i n e learning is a core research field o f AI, a n d i t is also a necessary

U p o n c o m p l e t i o n o f this course, y o u w i l l be able to:

Data Learning Basic

Experience H isto r ical

In p u t Pre d ictio n In p u t Prediction

New data M o del Pred ic ti o n

• Samples are used f o r training.

l M a c h i n e learning can be used in t h e f o l l o w i n g scenarios:

Rules are c o m p le x o r Task rules change over tim e . D a t a d is tr ib u tio n changes

Simple probl ems

l Target f u n c t i o n f is u n k n o w n . Learning a l g o r i t h m s c a n n o t o b t a i n a perfect f u n c t i o n f .

p Regression: For this t y p e o f task, a c o m p u t e r p r o g r a m predicts t h e o u t p u t f o r t h e given input . Learning a l g o r i t h m s

p Clustering: A lar ge a m o u n t o f d a t a f r o m a n unlabeled dataset is divided i n t o m u l t i p l e categories according t o int er nal

l Classification a n d regression are t w o m a i n types o f prediction, a c c o u n t i n g f r o m 8 0 % t o 90%. The o u t p u t o f

l Unsupervised learning: For u n l a b e l e d samples, t h e learning a l g o r i t h m s directly m o d e l t h e i n p u t datasets.

l Semi-supervised learning: In o n e task, a m a c h i n e lear ning m o d e l t h a t a u t o m a t i c a l l y uses a large a m o u n t o f

l R einf or cement learning: It is a n area o f m a c h i n e lear ning concerned w i t h h o w agents o u g h t t o t a k e actions

Feature 1 ... Feature n Goal

Feature 1 ... Feature n G oal

Feature 1 ... Feature n

Feature 1 ... Feature n

Feature 1 ... Feature n G oal

Feature 1 ... Feature n Unknown

l Training set: a dataset used i n t h e t r a i n i n g process, w h e r e each sample is

l Test set: Testing refers t o t h e process o f using t h e m o d e l o b t a i n e d a f t e r learning

Feature 1 Feature 2 Feature 3 Label

No. Area School Districts Direction House Price

1 100 8 South 1000

Test set 5 95 3 South 850

Fill in missing values, Normalize data to

9 % Mining modes f r o m data

6 0 % Cleansing a n d sor t ing d a t a

CrowdFlower Data Science Report 2016

l In m o st cases, the collected d a ta can be used b y a l g o r it h m s o nly af t e r b e ing

p Inconsistency: contains inconsistent records.

1 111 John 31/12/1990 M 0 0 Ireland D u b lin

2 222 Mery 15/10/1978 F 1 15 Iceland Missing value

4 444 Mark 01/11/1997 M 0 0 France Paris

6 555 Peter 1983-12-01 M 1 10 Italy Rome

7 777 Calvin 05/05/1995 M 0 0 Italy Italy Value t h a t

10 101010 Paul 14/11/1992 M 1 26 Ytali Rome M isspelling

n In t h e text, t h e w o r d is conver t ed i n t o a w o r d vector t h r o u g h w o r d e m b e d d i n g ( gener ally using t h e w o r d 2 v e c m o d e l ,

l Feature selection is necessary in t h e f o l l o w i n g aspects:

Select t h e Common methods

Wr a p p e r m e t h o d s consider featur e selection as a

D a t a splitting: M o d e l training: M o d e l verification:

M o d e l fine-tuning: M o d e l de pl oy m e nt: M o d e l test:

Service Name City Age Label

IF city = M i a m i → Probability = +0.7

• Ge ne ra liz a t ion capability

l Error: difference b e t w e e n t h e sa mp le result predicted by t h e m o d e l o b t a i n e d a f t e r learning a n d

l Un d e rf it t in g : occurs w h e n t h e m o d e l o r t h e a l g o r i t h m does n o t f i t t h e d a t a w e l l enough.

Underfitting Good fitting Overfitting

p O ffs et o f t h e prediction result f r o m t h e average value

p H i g h bias & l o w variance

p H i g h bias & h i g h variance –> Poor m o d e l

l Ideally, w e w a n t a m o d e l t h a t can accurately

H i g h bias & L o w bias &

l M e a n Square Error (MSE)

l Confusion matr ix : a t least a n 𝑚 × 𝑚 table. 𝐶𝑀𝑖,𝑗 o f t h e first 𝑚 r o w s a n d 𝑚 c o l u m n s indicates t h e n u m b e r o f

𝐹𝛽, w h e r e 𝛽 is a n o n - n e g a t i v e real (1 + 𝛽 2 ) × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙