You are on page 1of 105

5.

Machine Learning Overview


COMPUTER VISION – TK44019
Materi kuliah untuk Chapter ini mengikuti modul HCIA - Artificial
Intelligence (hak cipta materi adalah milik Huawei).
M a c h i n e Learning O v e r v i e w
Foreword

l M a c h i n e learning is a core research field o f AI, a n d i t is also a necessary


k n o w l e d g e f o r deep learning. Therefore, this chapter m a i n l y introduces t h e
m a i n concepts o f m a c h i n e learning, t h e classification o f m a c h i n e learning,
t h e overall process o f m a c h i n e learning, a n d t h e c o m m o n a l g o r i t h m s o f
m a c h i n e learning.

2 Hu a we i Confidential
Objectives

U p o n c o m p l e t i o n o f this course, y o u w i l l be able to:


p M a s t e r t h e l e a r n i n g a l g o r i t h m d e f i n i t i o n a n d m a c h i n e l e a r n i n g process.
p
K n o w c o m m o n m a c h i n e l e a r n i n g a lg o r ith ms .
p U n d e r s t a n d concepts such as hyperparameters, g r a d i e n t descent, a n d cross
validation.

3 Hu a we i Confidential
Conten ts

1 . M a c h i n e Learning Definition

2 . M a c h i n e Learning Types

3 . M a c h i n e Learning Process

4 . O t h e r Key M a c h i n e Learning M e t h o d s

5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case Study

4 Hu a we i Confidential
M a c h i n e Learning A l g o r i t h m s ( 1 )
l M a c h i n e learning ( i n c l u d i n g deep l e a r n i n g ) is a study o f learning algorithms. A
c o m p u t e r p r o g r a m is said t o learn f r o m experience 𝐸 w i t h respect t o s om e class
o f tasks 𝑇 a n d p e r f o r m a n c e measure 𝑃 if its p e r f o r m a n c e a t tasks i n 𝑇, as
measured by 𝑃, improves w i t h experience 𝐸.

Data Learning Basic


algorithms u n dersta n d i n g
(Experience E)
(Task T) ( M e a s u r e P)

5 Hu a we i Confidential
M a c h i n e Learning A l g o r i t h m s ( 2 )

Experience H isto r ical


data
Induction Training

In p u t Pre d ictio n In p u t Prediction


New New F u tu re
Regularity Future Model
problems data a ttr ib u te s

6 Hu a we i Confidential
Differences B e t w e e n M a c h i n e Learning A l g o r i t h m s a n d
Tradi ti onal Rule-Based A l g o r i t h m s
Rule-based a l g o r i t h m s M a c h i n e lear ning

Trai n i n g
data

M ach ine
lear ning

New data M o del Pred ic ti o n

• Samples are used f o r training.


• Explicit p r o g r a m m i n g is used t o solve problems. • The dec is ion - m ak ing rules are c o m p l e x o r
diff ic ult t o describe.
• Rules can be m a n u a l l y specified.
• Rules are a u t o m a t i c a l l y lear ned by machines.

7 Hu a we i Confidential
Ap p l i ca ti o n Scenarios o f M a c h i n e Learning ( 1 )
l The sol u tio n to a p r o b l e m is c o m plex, o r t he proble m m a y involv e a large
a m o u n t o f d a t a w i t h o u t a clear d a t a distribution function.

l M a c h i n e learning can be used in t h e f o l l o w i n g scenarios:

Rules are c o m p le x o r Task rules change over tim e . D a t a d is tr ib u tio n changes


c a n n o t be described, such For exam ple, in t h e p a r t - o f - over tim e , re q u irin g c ons tant
as facial re c o g n itio n a n d speech t a g g i n g task, n e w r e a d a p ta tio n o f p ro g ra m s ,
voice recognition. w o r d s o r m e a n in g s are such as p re d ic tin g t h e t r e n d o f
g e n e ra te d a t a n y tim e . c o m m o d i t y sales.

8 Hu a we i Confidential
Ap p l i ca ti o n Scenarios o f M a c h i n e Learning ( 2 )

Com plex
Machine learning
M a n u a l rules
Rule c omplex ity

al gor i t hm s

Rule-based
Simple

Simple probl ems


al gor i t hm s

Small Lar ge

Scale o f t h e p r o b l e m

9 Hu a we i Confidential
Rational U n d e r s t a n d i n g o f M a c h i n e Learning A l g o r i t h m s

Target e q u a t i o n
𝑓: 𝑋 → 𝑌

Ideal

A c tu a l
Training d a t a Hypothesis f u n c t i o n
𝐷: {(𝑥1, 𝑦1) ⋯ , (𝑥𝑛, 𝑦𝑛)} Learning a l g o r i t h m s 𝑔 ≈𝑓

l Target f u n c t i o n f is u n k n o w n . Learning a l g o r i t h m s c a n n o t o b t a i n a perfect f u n c t i o n f .


l Assume t h a t hypothesis f u n c t i o n g a pprox im a t e s f u n c t i o n f , b u t m a y be d i ffe r e n t f r o m
f u n c t i o n f.

10 Hu a we i Confidential
M a i n Problems Solved by M a c h i n e Learning
l M a c h i n e lear ning can deal w i t h m a n y types o f tasks. The f o l l o w i n g describes t h e m o s t typical a n d c o m m o n
types o f tasks.
p Classification: A c o m p u t e r p r o g r a m needs t o specify w h i c h o f t h e k categories s o m e i n p u t belongs to. To accomplish this
task, lear ning a l g o r i t h m s usually o u t p u t a f u n c t i o n 𝑓: 𝑅 𝑛 → (1,2, … , 𝑘). For example, t h e i m a g e classification a l g o r i t h m in
c o m p u t e r vision is developed t o h a n d le classification tasks.

p Regression: For this t y p e o f task, a c o m p u t e r p r o g r a m predicts t h e o u t p u t f o r t h e given input . Learning a l g o r i t h m s


typically o u t p u t a f u n c t i o n 𝑓: 𝑅 𝑛 → 𝑅. A n e x a m p le o f this task t y p e is t o predict t h e c l a i m a m o u n t o f a n insured person ( t o
set t h e insurance p r e m i u m ) o r predict t h e security price.

p Clustering: A lar ge a m o u n t o f d a t a f r o m a n unlabeled dataset is divided i n t o m u l t i p l e categories according t o int er nal


sim ilar it y o f t h e data. D a t a in t h e s am e cat egor y is m o r e similar t h a n t h a t in d iff e r e n t categories. This f e a t u r e can be
used in scenarios such as i m a g e retrieval a n d user prof ile m a n a g e m e n t .

l Classification a n d regression are t w o m a i n types o f prediction, a c c o u n t i n g f r o m 8 0 % t o 90%. The o u t p u t o f


classification is discrete category values, a n d t h e o u t p u t o f regression is c ontinuous numbers.

11 Hu a we i Confidential
Conten ts

1 . M a c h i n e Learning D e f i n i t i o n

2 . M a c h i n e Learning Types

3 . M a c h i n e Learning Process

4 . O t h e r Key M a c h i n e Learning M e t h o d s

5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study

12 Hu a we i Confidential
M a c h i n e Learning Classification
l Supervised learning: O b t a i n a n o p t i m a l m o d e l w i t h r equir ed p e r f o r m a n c e t h r o u g h t r a i ni ng a n d learning
based o n t h e samples o f k n o w n categories. Then, use t h e m o d e l t o m a p all inputs t o o u t p u t s a n d check t h e
o u t p u t f o r t h e purpose o f classifying u n k n o w n data.

l Unsupervised learning: For u n l a b e l e d samples, t h e learning a l g o r i t h m s directly m o d e l t h e i n p u t datasets.


Clustering is a c o m m o n f o r m o f unsupervised learning. W e o n l y need t o p u t h i g h l y similar samples together,
calculate t h e similarity b e t w e e n n e w samples a n d existing ones, a n d classify t h e m by similarity.

l Semi-supervised learning: In o n e task, a m a c h i n e lear ning m o d e l t h a t a u t o m a t i c a l l y uses a large a m o u n t o f


unlabeled d a t a t o assist learning directly o f a s mall a m o u n t o f labeled data.

l R einf or cement learning: It is a n area o f m a c h i n e lear ning concerned w i t h h o w agents o u g h t t o t a k e actions


in a n e n v i r o n m e n t t o m a x i m i z e s o m e n o t i o n o f c u m u l a t i v e reward. The difference b e t w e e n r e i n f o r c e m e n t
learning a n d supervised learning is t h e teacher signal. The r e i n f o r c e m e n t signal pr ov ided by t h e e n v i r o n m e n t
in r e i n f o r c e m e n t learning is used t o evaluate t h e ac tion (scalar signal) r a t h e r t h a n telling t h e learning system
h o w t o p e r f o r m correct actions.

13 Hu a we i Confidential
Supervised Learning
D a t a f e a t u re Label

Feature 1 ... Feature n Goal

Supervised l e a r n i n g
Feature 1 ... Feature n Goal
algorithm

Feature 1 ... Feature n G oal

Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair No
Sunny Cold Weak Yes
15 Hu a we i Confidential
Supervised Learning - Regression Questions
l Regression: reflects t h e features o f a t t r i b u t e values o f samples i n a s a m p l e dataset. The
dependency b e t w e e n a t t r i b u t e values is discovered b y expressing t h e relationship o f
s a m p l e m a p p i n g t h r o u g h functions.
p H o w m u c h w i l l I b e n e f i t f r o m t h e stock n e xt week?
p Wh a t ' s t h e t e m p e r a t u r e o n Tuesday?

16 Hu a we i Confidential
Supervised Learning - Classification Questions
l Classification : m a ps samples in a sam p le d a ta set to a specifie d categor y by
using a classification model.
p W i l l th er e be a traffic j a m o n XX r o a d d u r i n g
t h e m o r n i n g rush h o u r t o m o r r o w ?

p W h i c h m e t h o d is m o r e attractive t o customers:
5 y u a n voucher o r 2 5 % off?

17 Hu a we i Confidential
Unsupervised Learning

D a t a Feature

Feature 1 ... Feature n

Unsupervised I n t e rn a l
Feature 1 ... Feature n si m i l a r it y
learning algorithm

Feature 1 ... Feature n

Monthly Consumption
Commodity
Consumption Time Category
Badminton Cluster 1
1000–2000 6:00–12:00
racket
Cluster 2
500–1000 Basketball 18:00–24:00
1000–2000 G a m e console 00:00–6:00

18 Hu a we i Confidential
Unsupervised Learning - Clustering Questions
l Clustering: classifies samples i n a sample dataset i n t o several categories based
o n t h e clustering m o d e l . The similarity o f samples b e l o n g i n g t o t h e same
category is high.
p W h i c h audiences like t o w a t c h movies
o f t h e s a m e subject?
p W h i c h o f these c o m p o n e n t s are
d a m a g e d in a similar way ?

19 Hu a we i Confidential
Semi-Supervised Learning
D a t a Feature Label

Feature 1 ... Feature n G oal

Semi-supervised
Feature 1 ... Feature n Unknown
learning algorithms

Feature 1 ... Feature n Unknown

Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair /
Sunny Cold Weak /

20 Hu a we i Confidential
R ei nforcement Learning
l The m o d el p e rceives t he en v ir o n m e n t , ta kes actions, an d m a k e s adjust m e n ts
a n d choices based o n t h e status a n d a w a r d o r p u n i s h m e n t .

M o del

Award or A ct io n 𝑎𝑡
Status 𝑠𝑡
p u n i s h m e n t 𝑟𝑡

𝑟𝑡 +1

𝑠𝑡 + Environment
1

21 Hu a we i Confidential
R ei nforcement Learning - Best Behavior
l Reinforcement learning: always looks f o r best behaviors. Reinforcement learning
is t a r g e t e d a t machines o r robots.
p A u to p i l o t: Should i t b r a k e o r accelerate w h e n t h e y e l l o w l i g h t starts t o flash?
p Cleaning r o bo t: Should i t keep w o r k i n g o r g o back f o r charging?

22 Hu a we i Confidential
Conten ts

1 . M a c h i n e learning a l g o r i t h m

2 . M a c h i n e Learning Classification

3 . M a c h i n e Learning Process

4 . O t h e r Key M a c h i n e Learning M e t h o d s

5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study

23 Hu a we i Confidential
M a c h i n e Learning Process

Feature Model
Data Data Model Model deployment
ex t r ac t ion a n d
c ollect io n cleansing t rain i n g ev aluat ion
selection and integration

Feedback a n d i te r a ti o n

24 Hu a we i Confidential
Basic M a c h i n e Learning Concept — Dataset
l Dataset: a collection o f d a t a used i n m a c h i n e learning tasks. Each d a t a record is
called a sample. Events o r attributes t h a t reflect t h e p e r f o r m a n c e o r n a t u r e o f a
sample in a particular aspect are called features.

l Training set: a dataset used i n t h e t r a i n i n g process, w h e r e each sample is


referred t o as a t r a i n i n g sample. The process o f creating a m o d e l f r o m d a t a is
called learning (training).

l Test set: Testing refers t o t h e process o f using t h e m o d e l o b t a i n e d a f t e r learning


f o r prediction. The dataset used is called a test set, a n d each sampl e is called a
test sample.

25 Hu a we i Confidential
Checking D a t a O v e r v i e w
l Typical dataset f o r m

Feature 1 Feature 2 Feature 3 Label

No. Area School Districts Direction House Price

1 100 8 South 1000

2 120 9 S o u t h we st 1300
Training
set 3 60 6 North 700

4 80 9 Southeast 1100

Test set 5 95 3 South 850

26 Hu a we i Confidential
I m p o r t a n c e o f D a t a Processing
l D a t a is crucial t o models. It is t h e ceiling o f m o d e l capabilities. W i t h o u t g o o d
data, there is n o g o o d model.

Data
Data
D a t a cleansing
prepro cessi n g normalization

Fill in missing values, Normalize data to


a n d det ect a n d reduce noise a n d
e lim in a t e causes o f i m p r o v e m o d e l accuracy.
dataset exceptions.

D a t a di mensi on
reduction
Sim plif y d a t a
at t r ibut es t o avoid
d im e n s io n explosion.

27 Hu a we i Confidential
W o r k l o a d o f D a t a Cleansing
l Statistics o n d a t a scientists' w o r k in m a c h i n e learning

3 % R e m o d e lin g t r a i n i n g datasets

5 % Ot her s
4 % O p t i m i z i n g m odels

9 % Mining modes f r o m data

1 9 % Collecting datasets

6 0 % Cleansing a n d sor t ing d a t a

CrowdFlower Data Science Report 2016

28 Hu a we i Confidential
D a t a Cleansing
l M ost m a chin e learnin g m o d e l s proce ss fe ature s, w h i c h are usually n u m e r ic
representations o f i n p u t variables t h a t can be used in t h e model.

l In m o st cases, the collected d a ta can be used b y a l g o r it h m s o nly af t e r b e ing


preprocessed. The preprocessing operations include t h e f o l l o w i n g :
p D a t a filte r ing
p Processing o f lost d a t a
p Processing o f possible exceptions, errors, o r a b n o r m a l values
p C o m b i n a t i o n o f d a t a f r o m m u l t i p l e d a t a sources
p
D a t a consolidation

29 Hu a we i Confidential
Dirty Data (1)
l Generally, real d a t a m a y have s o m e q u a l i t y problems.
p Incompleteness: contains missing values o r t h e d a t a t h a t lacks a ttr ib u te s
p Noise: contains incorrect records o r exceptions.

p Inconsistency: contains inconsistent records.

30 Hu a we i Confidential
Dirty Data (2)
IsTe #Stu
# Id Name Birthday Ge nde r ach dent Country City
er s

1 111 John 31/12/1990 M 0 0 Ireland D u b lin

2 222 Mery 15/10/1978 F 1 15 Iceland Missing value


Madri
3 333 Alice 19/04/2000 F 0 0 Spain
d

4 444 Mark 01/11/1997 M 0 0 France Paris


Invalid value
5 555 Alex 15/03/2000 A 1 23 Germany Berlin

6 555 Peter 1983-12-01 M 1 10 Italy Rome

7 777 Calvin 05/05/1995 M 0 0 Italy Italy Value t h a t


should be in
8 888 Roxane 03/08/1948 F 0 0 P ortugal Lisbon another column
Switzerlan Genev
Invalid duplicat e i t e m 9 999 Anne 05/09/1992 F 0 5
d a

10 101010 Paul 14/11/1992 M 1 26 Ytali Rome M isspelling

Incorrect f o r m a t A t t r i b u t e dependency

31 Hu a we i Confidential
D a t a Conversion
l A f t e r b e in g preprocessed, t h e d a t a needs t o be converted i n t o a representation f o r m suitable f o r
t h e m a c h i n e le a rn in g mo d e l. C o m m o n d a t a conversion f o r m s include t h e f o l l o w i n g :
p W i t h respect t o classification, category d a t a is encoded i n t o a corresponding n u me r i c a l representation.
p Value d a t a is converted t o category d a t a t o reduce t h e value o f variables ( f o r age s e g m e n t a t i o n ) .

p Other data

n In t h e text, t h e w o r d is conver t ed i n t o a w o r d vector t h r o u g h w o r d e m b e d d i n g ( gener ally using t h e w o r d 2 v e c m o d e l ,


BERT m odel, etc).
n Process i m a g e d a t a ( c olor space, grayscale, g e o m e t r ic change, H a a r feature, a n d i m a g e e n h a n c e m e n t )

p Feature engineering
n N o r m a l i z e features t o ensure t h e s am e value ranges f o r i n p u t variables o f t h e s am e m odel.
n Feature expansion: C o m b in e o r convert existing variables t o generat e n e w features, such as t h e average.

32 Hu a we i Confidential
Necessity o f Feature Selection
l Generally, a d a taset h a s m a n y feature s, s o m e o f w h i c h m a y be re d u n d a n t or
irrelevant t o t h e value t o be predicted.

l Feature selection is necessary in t h e f o l l o w i n g aspects:

Sim plif y
models t o
Reduce t h e
make them
training time
easy f o r users
t o interpret

Improve
Avoid model
d i m e nsion genera liza t io n
explosion a n d avoid
o v e r f it t in g

33 Hu a we i Confidential
Feature Selection M e t h o d s - Filter
l Filter m e t h o d s are i n d e p e n d e n t o f t h e m o d e l d u r i n g feature selection.
By ev aluating t h e c orrelation b e t w e e n each
featur e a n d t h e t a r g e t attribute, these m e t h o d s
use a statistical measure t o assign a value t o
each feature. Features are t h e n sorted by score,
w h i c h is h e l p f u l f o r preserving o r e l i m i n a t i n g
specific features.

Select t h e Common methods


Traverse all Train Evaluate t h e
features optimal m o d els p erform a n ce • Pearson c orrelation coefficient
feature subset • Chi-square coefficient
• Mutual information
Procedure of a filter m e t h o d
Limitations
• The filter m e t h o d tends t o select r e d u n d a n t
variables as t h e relationship b e t w e e n features
is n o t considered.

34 Hu a we i Confidential
Feature Selection M e t h o d s - W r a p p e r
l Wr a p p e r m e t h o d s use a prediction m o d e l t o score feature subsets.

Wr a p p e r m e t h o d s consider featur e selection as a


search issue f o r w h i c h differ ent c o m b i n a t i o n s are
evaluated a n d c ompared. A predictive m o d e l is
used t o evaluate a c o m b i n a t i o n o f features a n d
Select t h e o p t i m a l assign a score based o n m o d e l accuracy.
featur e subset
Common methods
Genera te
Traverse all
a feature
Train
Evaluate • Recursive featur e e l i m i n a t i o n (RFE)
features m o d els
subset m o d e ls
Limitations
Procedure of a • Wr a p p e r m e t h o d s t r a i n a n e w m o d e l f o r each
wrapper m e t h o d subset, resulting in a h u g e n u m b e r of
computations.
• A featur e set w i t h t h e best p e r f o r m a n c e is
usually prov ided f o r a specific ty pe o f model.

35 Hu a we i Confidential
Feature Selection M e t h o d s - E m b e d d e d
l E m b e d d e d m e t h o d s consider feature selection as a p a r t o f m o d e l construction.

The m o s t c o m m o n ty pe o f e m b e d d e d featur e
selection m e t h o d is t h e regulariz ation m e t h o d .
Regularization m e t h o d s are also called penalization
Select t h e o p t i m a l featur e subset m e t h o d s t h a t intr oduc e a d d i t i o n a l constraints i n t o
t h e o p t i m i z a t i o n o f a predictive a l g o r i t h m t h a t bias
t h e m o d e l t o w a r d l o w e r c omplex ity a n d reduce t h e
Traverse all Generate a Train m o d e ls
features feature subset + Evaluate t h e n u m b e r o f features.
effect

Common methods
Procedure of an embedded m e t h o d
• Lasso regression
• Ridge regression

36 Hu a we i Confidential
Overall Procedure o f Building a M o d e l
M o d e l Building Procedure

1 2 3

D a t a splitting: M o d e l training: M o d e l verification:


Divide d a t a i n t o t ra in in g Use d a t a t h a t has been Use validation sets t o
sets, test sets, a n d cleaned u p a n d feature validate t h e m o d e l
validation sets. engineering t o t ra in a model. validity.

6 5 4

M o d e l fine-tuning: M o d e l de pl oy m e nt: M o d e l test:


Continuously t u n e t h e D e p lo y t h e m o d e l in Use test d a t a t o evaluate
m o d e l based o n t h e a n actual t h e generalization
actual d a t a o f a service p r o d u c t io n scenario. capability o f t h e m o d e l in
scenario. a real e n viro n m e n t .

37 Hu a we i Confidential
Examples o f Supervised Learning - Learning Phase
l Use t h e classification m o d e l t o predict w h e t h e r a person is a basketball player.
Fe a tur e
( a t t r i b u te ) Ta r g e t

Service Name City Age Label


Training set
data Mike Miami 42 yes The m o d e l searches
Jerry N e w York 32 no f o r t h e relationship
(Cleansed features a n d tags)
b e t w e e n features
Sp lit t i n g Bryan Orlando 18 no
a n d targets.
Task: Use a classification m o d e l t o predict Patricia Miami 45 yes
w h e t h e r a person is a basketball player
u n d e r a specific feature. Elodie Phoenix 35 no Test set
Remy Chicago 72 yes Use n e w d a t a t o
verify t h e m o d e l
John N e w York 48 yes validity.
Model
t r aining
Each featur e o r a c o m b i n a t i o n o f several features can
provide a basis f o r a m o d e l t o m a k e a j u d g m e n t .

38 Hu a we i Confidential
Examples o f Supervised Learning - Prediction Phase
Name City Age Label
Marine Miami 45 ?
Julien Miami 52 ? Unknown data
Recent data, i t is n o t
New Fred Orlando 20 ?
k now n whether the
data M ic helle Boston 34 ? people are basketball
Nicolas Phoenix 90 ? players.

IF city = M i a m i → Probability = +0.7


IF city= O r l a n d o → Probability = +0.2
IF age > 42 → Probability = +0.05*age + 0.06
Application IF age ≤ 42 → Probability = +0.01*age + 0.02
model
Name City Age Prediction
Marine Miami 45 0.3
Possibility prediction
Julien Miami 52 0.9
Apply t h e m o d e l t o t h e
Fred Orlando 20 0.6 n e w d a t a t o predict
Prediction whether the customer
M ic helle Boston 34 0.5
data w i l l c hange t h e supplier.
Nicolas Phoenix 90 0.4

39 Hu a we i Confidential
W h a t Is a G o o d M o d e l ?

• Ge ne ra liz a t ion capability


Can i t accurately predict t h e actual service data?

• In t e r p r e t a b ilit y
Is t h e prediction result easy t o interpret?

• Prediction speed
H o w l o n g does i t ta k e t o predict each piece o f data?

• Practicability
Is t h e prediction r a te still acceptable w h e n t h e
service v o l u m e increases w i t h a h u g e d a t a v o l u m e ?

40 Hu a we i Confidential
M o d e l Validity ( 1 )
l Generalization capability: The g o a l o f m a c h i n e le a rn in g is t h a t t h e m o d e l o b t a i n e d a f t e r le a rn in g
should p e r f o r m w e l l o n n e w samples, n o t just o n samples used f o r training. The capability o f
a p p lyin g a m o d e l t o n e w samples is called generalization o r robustness.

l Error: difference b e t w e e n t h e sa mp le result predicted by t h e m o d e l o b t a i n e d a f t e r learning a n d


t h e actual sa mp le result.
p Training error: error t h a t y o u g e t w h e n y o u r u n t h e m o d e l o n t h e t r a i ni ng data.
p Generalization error: error t h a t y o u g e t w h e n y o u r u n t h e m o d e l o n n e w samples. Obviously, w e prefer a
m o d e l w i t h a smaller generalization error.

l Un d e rf it t in g : occurs w h e n t h e m o d e l o r t h e a l g o r i t h m does n o t f i t t h e d a t a w e l l enough.


l Overfitting: occurs w h e n t h e t r a i n i n g error o f t h e m o d e l o b t a i n e d a f t e r le a rn in g is s m a l l b u t t h e
generalization error is large ( p o o r generalization capability).

41 Hu a we i Confidential
M o d e l Validity ( 2 )
l M o d e l capacity: model's capability o f f i t t i n g functions, w h i c h is also called m o d e l complexity.
p W h e n t h e capacity suits t h e task c omplex ity a n d t h e a m o u n t o f t r a i ni ng d a t a provided, t h e a l g o r i t h m
effect is usually o p t i ma l .
p Mo del s w i t h insufficient capacity c a n n o t solve c o m p l e x tasks a n d u n d e r f i t t i n g m a y occur.
p A high-capacity m o d e l can solve c o m p l e x tasks, b u t ov er fitting m a y occur if t h e capacity is higher t h a n
t h a t required by a task.

Underfitting Good fitting Overfitting


N o t all features are learned. Noises are learned.
42 Hu a we i Confidential
O v e r f i t t i n g Cause — Error
l Total error o f f in a l prediction = Bias2 + Variance + Irreducible error
l Generally, t h e prediction error can be divided i n t o t w o types:
p Error caused by "bias "
Variance
p Error caused by "v arianc e"

l Variance: Bias

p O ffs et o f t h e prediction result f r o m t h e average value


p Error caused by t h e model's sensitivity t o s mall fluc tuations
in t h e t r a i ni ng set

l Bias:
p Difference b e t w e e n t h e expected ( o r average) prediction value a n d t h e
correct value w e are t r y i n g t o predict.

43 Hu a we i Confidential
Variance a n d Bias
l C o m b i n ations o f varian ce a n d bias are as
follows:
p L o w bias & l o w variance –> G o o d m o d e l
p L o w bias & h i g h variance

p H i g h bias & l o w variance

p H i g h bias & h i g h variance –> Poor m o d e l

l Ideally, w e w a n t a m o d e l t h a t can accurately


capture t h e rules in t h e t r a i n i n g d a t a a n d
s u m m a r i z e t h e invisible d a t a ( n e w data).
H ow ev e r, i t is usually impossible f o r t h e m o d e l
t o c o m p l e t e b o t h tasks a t t h e s a m e time .

44 Hu a we i Confidential
M o d e l C ompl exi ty a n d Error
l As t h e m o d e l c o m p l e x i t y increases, t h e t r a i n i n g error decreases.
l As t h e m o d e l compl exi ty increases, t h e test error decreases t o a certain p o i n t
a n d t h e n increases in t h e reverse direction, f o r m i n g a convex curve.

H i g h bias & L o w bias &


l o w variance h i g h variance

Testing error
Error

Training error

M o d e l Complexity

45 Hu a we i Confidential
M a c h i n e Learning Performance Evaluation - Regression
l The closer t h e M e a n Absolute Error ( M A E ) is t o 0, t h e b e t t e r t h e m o d e l can f i t t h e t r a i n i n g data.

𝑚
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦𝑖
m
𝑖=1

l M e a n Square Error (MSE)


m
1
2
𝑀𝑆𝐸 = 𝑦 − 𝑦𝑖
m
𝑖=
1 𝑖
l The value r a n g e o f R2 is (–∞, 1]. A larger value indicates t h a t t h e m o d e l can b e t t e r f i t t h e t r a i n i n g
data. TSS indicates t h e difference b e t w e e n samples. RSS indicates t h e difference b e t w e e n t h e
predicted value a n d sa mp le value.

𝑅𝑆𝑆 𝑚 𝑦𝑖 − 𝑦𝑖 2
𝑖=1
𝑅2 = 1 − = 1 − 2
𝑇𝑆𝑆 𝑚
𝑖= 𝑦𝑖 − 𝑦𝑖
1

46 Hu a we i Confidential
M a c h i n e Learning P e rf o rma n ce Evaluation - Classification ( 1 )
l Terms a n d definitions: Estimated
amount
p 𝑃: positive, indic at ing t h e n u m b e r o f real positive cases yes no Total

in t h e data. A ctual a m o u n t

yes 𝑇𝑃 𝐹𝑁 𝑃
p 𝑁: negative, indic at ing t h e n u m b e r o f real negat ive cases
no 𝐹𝑃 𝑇𝑁 𝑁
in t h e data.
Total 𝑃′ 𝑁′ 𝑃+𝑁
p 𝑇P : t r u e positive, indic at ing t h e n u m b e r o f positive cases t h a t are correctly
classified by t h e classifier. Confusion m a t r i x
p 𝑇𝑁: t r u e negative, indic at ing t h e n u m b e r o f negat ive cases t h a t are correctly classified by t h e classifier.

p 𝐹𝑃: false positive, indic at ing t h e n u m b e r o f positive cases t h a t are incorrectly classified by t h e classifier.

p 𝐹𝑁: false negative, indic at ing t h e n u m b e r o f negat ive cases t h a t are incorrectly classified by t h e classifier.

l Confusion matr ix : a t least a n 𝑚 × 𝑚 table. 𝐶𝑀𝑖,𝑗 o f t h e first 𝑚 r o w s a n d 𝑚 c o l u m n s indicates t h e n u m b e r o f


cases t h a t actually b e l o n g t o class 𝑖 b u t are classified i n t o class 𝑗 by t h e classifier.

p Ideally, f o r a h i g h accuracy classifier, m o s t pr edic t ion values should be locat ed in t h e d ia g o n a l f r o m 𝐶𝑀1,1 t o 𝐶𝑀𝑚,𝑚 o f
t h e t able w h i l e values outside t h e d ia g o n a l are 0 o r close t o 0. T h a t is, 𝐹𝑃 a n d 𝐹𝑃 are close t o 0.

47 Hu a we i Confidential
M a c h i n e Learning P e rf o rma n ce Evaluation - Classification ( 2 )
Measurement Ratio
𝑇𝑃 + 𝑇𝑁
Accuracy a n d rec ognition rate
𝑃+𝑁
𝐹𝑃 + 𝐹𝑁
Error r ate a n d misclassification r ate
𝑃+𝑁
Sensitivity, t r u e positive rate, a n d 𝑇𝑃
recall 𝑃
𝑇𝑁
Specificity a n d t r u e negative rate
𝑁
𝑇𝑃
Precision
𝑇𝑃 + 𝐹𝑃
𝐹1, h a r m o n i c m e a n o f t h e recall rate 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
a n d precision 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

𝐹𝛽, w h e r e 𝛽 is a n o n - n e g a t i v e real (1 + 𝛽 2 ) × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙


number 𝛽2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

48 Hu a we i Confidential
Example o f M a c h i n e Learning Per for mance Evaluation
l W e have tr ained a m a c h i n e l e a r n i n g m o d e l t o id e n tify w h e t h e r t h e object i n a n i m a g e is
a cat. N o w w e use 2 0 0 pictures t o verify t h e m o d e l performance. A m o n g t h e 200
images, objects i n 1 7 0 images are cats, w h i l e others are not. The identification result o f
t h e m o d e l is t h a t objects i n 1 6 0 ima g e s are cats, w h i l e others are not.

𝑇𝑃 140
Precision: 𝑃 = = = 87.5% Estimated
𝑇𝑃+𝐹𝑃 140+20 amount
Actual
𝒚𝒆𝒔 𝒏𝒐 Total

Recall: 𝑅 = 𝑇𝑃 = 140 = 82.4% amount


𝑃 170
𝑦𝑒𝑠 140 30 170

Accuracy: 𝐴𝐶𝐶 = 𝑇𝑃+𝑇𝑁 = 140+10 = 75% 𝑛𝑜 20 10 30


𝑃+𝑁 170+30

Total 160 40 200

49 Hu a we i Confidential
Conten ts

1 . M a c h i n e Learning D e f i n i t i o n

2 . M a c h i n e Learning Types

3 . M a c h i n e Learning Process

4 . O t h e r Key M a c h i n e Learning M e t h o d s

5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study

50 Hu a we i Confidential
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 1 )
l The g r a d i e n t descent m e t h o d uses t h e negative g r a d i e n t Cost surface
direction o f t h e c u r r e n t position as t h e search direction,
w h i c h is t h e steepest direction. The f o r m u l a is as follows:

wk+1 = wk -hÑf wk (x i
)
l In the f o r m ula, 𝜂 in dicates th e learn in g r a te a n d 𝑖
in dicates the d a t a reco r d n u m b e r 𝑖 . The w e igh t
p a r a m e t e r w indicates t h e change i n each iteration.

l Convergence: The value of the object ive f u n ct ion


changes very little, or the maximum number of
iterations is reached.

51 Hu a we i Confidential
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 2 )
l Batch Gradient Descent ( B G D ) uses t h e samples ( m in t o t a l ) in all datasets t o
u p d a t e t h e w e i g h t p a r a m e t e r based o n t h e g r a d i e n t value a t t h e cur r ent point.
1 m
wk +1 = wk -h å Ñf wk (x i )
m i=1
l Stochastic Gradient D escent (SG D ) r a n d o m ly selects a sam p le in a d a taset t o
u p d a t e t h e w e i g h t p a r a m e t e r based o n t h e g r a d i e n t value a t t h e cur r ent point.
w
k+1
= wk -hÑfw (x i
) k

l M i n i - B a t c h Gradient Descent ( M B G D ) combines t h e features o f BGD a n d SGD


a n d selects the gradie nts o f n sa m ples in a d a taset to u p d a t e t he w e igh t
+n-1

wk +1 = wk -h1 å Ñf
t
p a r a m e t e r.
wk (xi )
n i=t
52 Hu a we i Confidential
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 3 )
l Comparison o f three g ra d ie n t descent m e t h o d s
p In t h e SGD, samples selected f o r each t r a i ni ng are stochastic. Such instability causes t h e loss f u n c t i o n t o
be unstable o r even causes reverse displacement w h e n t h e loss f u n c t i o n decreases t o t h e lowes t point.
p BGD has t h e highest stability b u t consumes t o o m a n y c o m p u t i n g resources. M B G D is a m e t h o d t h a t
balances SGD a n d BGD.

BGD
Uses all t r a i ni ng samples f o r t r a i ni ng each time.

SGD
Uses o n e t r a i ni ng s ample f o r t r a i ni ng each time.

MBGD
Uses a certain n u m b e r o f t r a i ni ng samples f o r
t r a i ni ng each time.

53 Hu a we i Confidential
Parameters a n d H y p e r p a r a m e t e r s i n M o d e l s
l The m o d e l contains n o t o n l y parameters b u t also hyperparameters. The purpose
is t o enable t h e m o d e l t o learn t h e o p t i m a l parameters.
p Parameters are a u t o m a t i c a l l y learned b y models.
p H y p e r p a r a me te r s are m a n u a l l y set.

Model parameters are


"distilled" f r o m dat a.

Model

Tra in in g
Use
hyper par amet er s t o
control training.
54 Hu a we i Confidential
Hyperparameters of a Model

• O f t e n used in m o d e l p a r a m e t e r • λ d u r i n g Lasso/Ridge regression


• Learning rat e f o r t ra i n i n g a
e s t i m a t i o n processes. neural n e t w o r k , n u m b e r o f
iterations, b a t c h size, activation
• O f t e n specified b y t h e practitioner. function, a n d n u m b e r o f neurons
• Can o f t e n be set using heuristics. • 𝐶 a n d 𝜎 in s u p p o r t vector
machines ( S V M )
• O f t e n t u n e d f o r a given predictive • K in k-nearest n e i g h b o r ( K N N )
modeling problem. • N u m b e r o f trees in a r a n d o m
forest

M o d e l hyperparameters are
Common model
external configurations o f
h y p e rp a r a m e t e rs
models.

55 Hu a we i Confidential
H y p e r p a r a m e t e r Search Procedure a n d M e t h o d

1. Dividing a dataset i n t o a t r a in in g set, v alidat ion set, a n d test set.


2. O p t i m i z i n g t h e m o d e l par am et er s using t h e t r a i n i n g set based o n t h e m o d e l
p e r f o r m a n c e indicators.
3. Searching f o r t h e m o d e l h y p e r - p a r a m e t e r s using t h e v alidat ion set based o n t h e m o d e l
Procedure f o r p e r f o r m a n c e indicators.
searching 4. P e r f o r m step 2 a n d step 3 alternately. Finally, d e t e r m i n e t h e m o d e l par am et er s a n d
h y perpara m e ters hy per par am et er s a n d assess t h e m o d e l using t h e test set.

• Grid search
• R a n d o m search
• Heuristic int elligent search
Search a l g o r i t h m • Bayesian search
(step 3 )

56 Hu a we i Confidential
H y p e r p a r a m e t e r Searching M e t h o d - Grid Search
l Grid search a t t e m p t s t o exhaustively search all possible
hyperparameter combinations to f o r m a hyperparameter
value grid. Grid search
l In practice, t h e r a n g e o f h y p e r p a r a m e t e r values t o search is 5

Hyperparameter 1
specified m a n u a l l y. 4

3
l Grid search is a n expensive a n d t i m e - c o n s u m i n g m e t h o d .
2
p This m e t h o d w o r k s w e l l w h e n t h e n u m b e r o f hyperparameters
is relatively small. Therefore, i t is applicable t o generally 1

m a c h i n e le a rn in g a l g o r i t h m s b u t inapplicable t o n e u ra l
0 1 2 3 4 5

networks Hyperparameter 2
(see t h e deep le a rn in g part).

57 Hu a we i Confidential
H y p e r p a r a m e t e r Searching M e t h o d - R a n d o m Search
l W h e n t h e h y p e r p a r a m e t e r search space is large, r a n d o m
search is b e t t e r t h a n g r i d search. R a n d o m search
l In r a n d o m search, each setting is s a m p l e d f r o m t h e
d i s tr i b u ti o n o f possible p a r a m e t e r values, in a n a t t e m p t
t o f i n d t h e best subset o f hyperparameters.

Parameter 1
l N o te:
p Search is p e r f o r m e d w i t h i n a coarse range, w h i c h t h e n w i l l
be n a r r o w e d based o n w h e r e t h e best result appears.
p Some hyperparameters are m o r e i m p o r t a n t t h a n others, a n d
Parameter 2
t h e search deviation w i l l be affected d u r i n g r a n d o m search.

58 Hu a we i Confidential
Cross Va l i d a ti o n ( 1 )
l Cross validation: It is a statistical analysis m e t h o d used t o validate t h e p e r f o r m a n c e o f a
classifier. The basic idea is t o divide t h e original dataset i n t o t w o parts: t r a i n i n g set a n d va lid a t io n
set. Train t h e classifier using t h e t r a i n i n g set a n d test t h e m o d e l using t h e validation set t o check
t h e classifier performance .

l k - f o l d cross validation ( 𝑲 − 𝑪𝑽 ):

p Divide t h e r a w d a t a i n t o 𝑘 groups (generally, evenly divided).


p Use each subset as a v alidation set, a n d use t h e o t h e r 𝑘 − 1 subsets as t h e t r a i ni ng set. A t o t a l o f 𝑘 models
can be obtained.

p Use t h e m e a n classification accuracy o f t h e final v alidation sets o f 𝑘 models as t h e p e r f o r m a n c e indic ator


o f t h e 𝐾 − 𝐶𝑉 classifier.

59 Hu a we i Confidential
Cross Va l i d a ti o n ( 2 )

Entire dataset

Training set Test set

Training set Validation set Test set

l Note: The K value in K - f o l d cross validation is also a hyperparameter.

60 Hu a we i Confidential
Conten ts

1 . M a c h i n e Learning D e f i n i t i o n

2 . M a c h i n e Learning Types

3 . M a c h i n e Learning Process

4 . O t h e r Key M a c h i n e Learning M e t h o d s

5 . C o m m o n M a c h i n e Learning Algorithms
6. Case study

61 Hu a we i Confidential
M a c h i n e Learning A l g o r i t h m O v e r v i e w

M a c h i n e l e a r ni ng

Supervised learning Unsupervised learning

Classificatio n Regression Clustering O t h ers

Logistic regression Linear regression K -means Correlation rule


Hierarchical Principal c o m p o n e n t
SVM SVM
clustering analysis (PCA)
Neural network D ensit y- b ased Gaussian m i x t u r e
Neural network
clustering model ( G M M )
Decision tree Decision tree

R a n d o m forest R a n d o m forest

GBDT GBDT

KNN

Naive Bayes

62 Hu a we i Confidential
Linear Regression ( 1 )
l Linear regression: a statistical analysis m e t h o d t o d e t e r m i n e t h e q u a n t i t a t i v e
relationships b e t w e e n t w o o r m o r e variables t h r o u g h regression analysis in
m a t h e m a t i c a l statistics.

l Linear regression is a ty pe o f supervised learning.

U n a r y linear regression M u l t i - d i m e n s i o n a l linear regression

63 Hu a we i Confidential
Linear Regression ( 2 )
l The m o d e l f u n c t i o n o f linear regression is as follows, w h e r e 𝑤 indicates t h e w e i g h t par ameter, 𝑏 indicates t h e
bias, a n d 𝑥 indicates t h e s ample attribute.

hw (x) = wT x + b
l The relationship b e t w e e n t h e value predicted by t h e m o d e l a n d ac tual value is as follows, w h e r e 𝑦 indicates
t h e ac tual value, a n d 𝜀 indicates t h e error.
y = wT x + b +e
l The error 𝜀 is influenc ed by m a n y factors independently. According t o t h e central l i m i t t h e o r e m , t h e error 𝜀
f o l lo w s n o r m a l d ist r ib u tion . A ccor d in g to th e n o r m a l d ist r ib u t io n f u nctio n an d m a x im u m lik e liho o d
estimation, t h e loss f u n c t i o n o f linear regression is as follows:
1
å (w )
2
J (w)= h (x) - y
2m
l To m a k e t h e predicted value close t o t h e ac tual value, w e need t o m i n i m i z e t h e loss value. W e can use t h e
g r a d i e n t descent m e t h o d t o calculate t h e w e i g h t p a r a m e t e r 𝑤 w h e n t h e loss f u n c t i o n reaches t h e m i n i m u m ,
a n d t h e n c o m p l e t e m o d e l building.

64 Hu a we i Confidential
Linear Regression Extension - P o l y n o m i a l Regression
l P o l y n o m i a l regression is a n extension o f linear regression. Generally, t h e c o mp le x ity o f
a dataset exceeds t h e possibility o f f i t t i n g b y a s tr a ig h t line. T h a t is, obvious u n d e r f i t t i n g
occurs if t h e original linear regression m o d e l is used. The s o lu tio n is t o use p o l y n o m i a l
regression.

h (x) = w x + w x2 + + w xn +b
w 1 2 n

l where, t h e n t h p o w e r is a p o l y n o m i a l regression
d i m e n s i o n (degree).

l P o l y n o m i a l regression belongs t o linear


regression as t h e relationship b e t w e e n its w e i g h t
p a ra me t e rs 𝑤 is still linear w h i l e its nonlinearity
is reflected in t h e f e a t u re dimension. Com par is on b e t w e e n linear regression
a n d p o l y n o m i a l regression

65 Hu a we i Confidential
Linear Regression a n d O v e r f i t t i n g Prevention
l Regularization t e r m s can be used t o reduce overfitting. The value o f 𝑤 c a n n o t be t o o
large o r t o o s m a l l i n t h e s a m p l e space. You can a d d a square s u m loss o n t h e t a r g e t
func tion.
1
å (hw (x) - y ) +l å w
2 2
J (w)= 2

l Regularizatio n te r m s ( n o r m ) :2m
The reg u l arizatio n t e r m h e re is called L2 - n o r m . Lin ear
regression t h a t uses this loss f u n c t i o n is also called Ridge regression.

1
å (hw (x) - y ) +l å w 1
2
J (w)=

2m
l Linear regression w i t h absolute loss is called Lasso regression.

66 Hu a we i Confidential
Logistic Regression ( 1 )
l Logistic regression: The logistic regression m o d e l is used t o solve classification problems.
The m o d e l is d e fin e d as follows:
𝑒 𝑤𝑥+ 𝑏
𝑃 𝑌 = 1 𝑥 =
1 + 𝑒 𝑤𝑥+ 𝑏
1
𝑃 𝑌 = 0 𝑥 =
1 + 𝑒 𝑤𝑥+ 𝑏
w h e r e 𝑤 indicates t h e w ei ght , 𝑏 indicates t h e bias, a n d 𝑤𝑥 + 𝑏 is r egar ded as t h e linear f u n c t i o n o f 𝑥. C ompar e
t h e preceding t w o pr obability values. The class w i t h a higher pr obability value is t h e class o f 𝑥.

67 Hu a we i Confidential
Logistic Regression ( 2 )
l B o t h t h e logistic regression m o d e l a n d linear regression m o d e l are generalized linear
models. Logistic regression introduces n o n l i n e a r factors ( t h e s i g m o i d f u n c t i o n ) based o n
linear regression a n d sets thresholds, so i t can deal w i t h b in a r y classification problems.

l Accor d in g to t h e m o d e l f u n c t io n o f logist ic re g ression , t h e lo ss f u n c tio n o f logis t ic


regression can be e s ti m a te d as f o l l o w s b y using t h e m a x i m u m lik e lih o o d estimation:

1
J (w) = -
m
å ( y ln hw (x) + (1- y) ln(1- hw (x)))
l w h e r e 𝑤 indicates t h e w e i g h t p a r a me te r, 𝑚 indicates t h e n u m b e r o f samples, 𝑥 indicates
t h e sample, a n d 𝑦 indicates t h e real value. The values o f all t h e w e i g h t p a r a m e te r s 𝑤
can also be o b t a i n e d t h r o u g h t h e g r a d i e n t descent a l g o r i t h m .

68 Hu a we i Confidential
Logistic Regression Extension - S o f t m a x Function ( 1 )
l Logistic regression applies o n l y t o binary classification problems. For multi-class
classification problems, use t h e S o f t m a x function.

Binary classification p r o b l e m Multi-class classification p r o b l e m

Grape?

Ma le ? Orange?

A p p le?

Female? Banana?

69 Hu a we i Confidential
Logistic Regression Extension - S o f t m a x Function ( 2 )
l S o f t m a x regression is a generalization o f logistic regression t h a t w e can use f o r
K-class classification.

l The S o f t m a x f u n c t i o n is used t o m a p a K-dimensional vector o f arbitrary real


values t o a n o t h e r K-dimensional vector o f real values, w h e r e each vector
e l e m e n t is in t h e interval (0, 1).

l The regression probability f u n c t i o n o f S o f t m a x is as follows:


wkTx
e
p( y = k | x; w) = K
, k =1,2 ,K
åe wlTx

l=1

70 Hu a we i Confidential
Logistic Regression Extension - S o f t m a x Function ( 3 )
l S o f t m a x assigns a p r o b a b i l i ty t o each class i n a multi-class p r o b l e m . These probabilities
m u s t a d d u p t o 1.
p S o f t m a x m a y p ro d u ce a f o r m b e l o n g i n g t o a particular class. Example:
Category Probability

Grape? 0.09

• S u m o f a ll probabilities:
Orange? 0.22 • 0.09 + 0.22 + 0.68 + 0.01 =1
• M o s t probably, this picture is
A p p le? a n apple.
0.68

Banana? 0.01

71 Hu a we i Confidential
Decision Tree
l A decision tree is a tree structure ( a binary tree o r a n o n - b i n a r y tree). Each n o n - l e a f n o d e represents a test o n
a f e at u r e attr ibute. Each br anc h represents t h e o u t p u t o f a f e at u r e a t t r i b u t e in a certain value range, a n d
each leaf n o d e stores a category. To use t h e decision tree, start f r o m t h e r o o t node, test t h e f eat ur e attributes
o f t h e i t ems t o be classified, select t h e o u t p u t branches, a n d use t h e c ategory stored o n t h e leaf n o d e as t h e
final result.
Root

Sh o r t Tall

Ca n n o t Can Sh o r t Long
sq u eak sq u eak neck neck

Sh o r t Long
M i g h t be a M i g h t be a
M i g h t be nose n o se
squirrel giraff e
a rat

On land In w a t e r M i g h t be a n
e le p h a n t

M i g h t be a M i g h t be
rhinoceros a hippo
72 Hu a we i Confidential
Decision Tree Structure
Root N o d e

I nt er nal In t e rnal
Node Node

In t e rnal
Leaf N o d e Leaf N o d e Node Leaf N o d e

Leaf N o d e Leaf N o d e Leaf N o d e

73 Hu a we i Confidential
Key Points o f Decision Tree Construction
l To create a decision tree, w e need t o select a ttr ib u te s a n d d e t e r m i n e t h e tree structure
b e t w e e n fe a tu r e attributes. The key step o f constructing a decision tree is t o divide d a t a
o f all fea tur e attributes, c o m p a r e t h e result sets i n t e r m s o f 'purity', a n d select t h e
a t t r i b u t e w i t h t h e highest 'purity' as t h e d a t a p o i n t f o r dataset division.

l The metrics t o q u a n t i f y t h e 'purity' include t h e i n f o r m a t i o n e n t r o p y a n d GINI Index. The


f o r m u l a is as follows:
K K
H(X)= -å pklog (2 p)k
k=1
Gini = 1 - å
k =1
p k2

l w h e r e 𝑝𝑘 indicates t h e p r o b a b ility t h a t t h e s a m p l e belongs t o class k ( t h e r e are K


classes i n total) . A greater difference b e t w e e n p u r i t y b e fo r e s e g m e n t a t i o n a n d t h a t a fte r
s e g m e n t a t i o n indicates a b e t t e r decision tree.

l C o m m o n decision tree a l g o r i t h m s include ID3, C4.5, a n d CART.


74 Hu a we i Confidential
Decision Tree Construction Process
l Feature selection: Select a feature f r o m t h e features o f t h e t r a i n i n g d a t a as t h e
split standard o f t h e cur r ent node. ( D i ff e r e n t standards generate d i ff e r e n t
decision tree algorithms .)

l D ecision t r ee g e n era tion : Generate inter n a l n o d e upside d o w n based o n t he


selected features a n d stop u n t i l t h e dataset can n o l o n g e r be split.
l Pruning: The decision tree m a y easily b e c o m e o v e r f i t t i n g unless necessary
p r u n i n g ( i n c l u d i n g p r e - p r u n i n g a n d p o s t - p r u n i n g ) is p e r f o r m e d t o reduce t h e
tree size a n d o p t i m i z e its n o d e structure.

75 Hu a we i Confidential
Decision Tree Example
l The f o l l o w i n g f ig u re shows a classification w h e n a decision tree is used. The classification result is
i m p a c t e d by three attributes: Refund, M a r i t a l Status, a n d Taxable Income.

Marital Taxable
Tid R ef und C h eat
Status Income
1 Yes Single 125,000 No R e fu n d
2 No Married 100,000 No
3 No Single 70,000 No M a rital
No Status
4 Yes Married 120,000 No
5 No Divorced 95,000 Yes
Taxabl e
6 No Married 60,000 No Income No
7 Yes Divorced 220,000 No
8 No Single 85,000 Yes No Yes
9 No Married 75,000 No
10 No Single 90,000 Yes

76 Hu a we i Confidential
SVM
l S V M is a b in a ry classification m o d e l wh o se basic m o d e l is a linear classifier d e f in e d in t h e
eigenspace w i t h t h e largest interval. SVMs also include kernel tricks t h a t m a k e t h e m n o n lin e a r
classifiers. The S V M le a rn in g a l g o r i t h m is t h e o p t i m a l solution t o convex q u a d ra t ic p r o g r a m m i n g .

Project i o n

Com plex Easy s e g m e n t a t i o n in


s e g m e n t a t i o n in l o w - h ig h - d im e n s io n a l space
dim ens ional space

77 Hu a we i Confidential
Linear S V M ( 1 )
l H o w d o w e split t h e red a n d blue datasets by a straight line?

or

W i t h binary classification B o t h t h e left a n d r i g h t m e t h o d s can be used t o


Tw o - d i m e n s i o n a l dataset divide datasets. W h i c h o f t h e m is correct?

78 Hu a we i Confidential
Linear S V M ( 2 )
l Straight lines are used t o divide d a t a i n t o d i ff e r e n t classes. Actually, w e can use m u l t i p l e straight
lines t o divide data. The core idea o f t h e S V M is t o f i n d a straight line a n d keep t h e p o i n t close t o
t h e straight line as f a r as possible f r o m t h e straight line. This can enable st ro n g generalization
capability o f t h e mo d e l. These points are called support vectors.
l In t w o - d i m e n s i o n a l space, w e use straight lines f o r segmentation. In h i g h - d i m e n s i o n a l space, w e
use hyperplanes f o r segmentation.

Distance b e t w e e n
s u p p o r t vectors
is as f a r as
possible

79 Hu a we i Confidential
Nonlinear SVM (1)
l H o w d o w e classify a n o n l i n e a r separable dataset?

Linear S V M can f u n c t i o n w e l l N o n l i n e a r datasets c a n n o t be


f o r linear separable datasets. split w i t h s tr a ig h t lines.

80 Hu a we i Confidential
Nonlinear SVM (2)
l Kernel func tions are used t o construct n o n l i n e a r SVMs.
l Kernel func tions a l l o w a l g o r i t h m s t o f i t t h e largest h y p e r p la n e i n a t r a n s f o r m e d h i g h -
d i m e n s i o n a l fe a tu r e space.
C o m m o n k e r n e l functions

Linear Polyn o m i a l
kernel kernel
f u n c tion function

Gaussian Sigmoid
kernel kernel
function f u n c tion I n p u t space H i g h - d im e n sio n a l
f e a t u re space

81 Hu a we i Confidential
KNN Algorithm (1)
l The KNN classification a l g o r i t h m is a
theoretically m a t u r e m e t h o d a n d o n e o f t h e
simplest machine learning algorithms.
According t o this m e t h o d , if t h e m a j o r i t y o f
k samples m o s t similar t o one sampl e
?
(nearest neighbors in the eigenspace)
b e l o n g t o a specific category, this sample
also belongs t o this category.

The t a r g e t category o f p o i n t ? varies w i t h


t h e n u m b e r o f t h e m o s t adjacent nodes.

82 Hu a we i Confidential
KNN Algorithm (2)
l As t h e prediction result is d e t e r m i n e d based o n
t h e n u m b e r a n d w e i g h t s o f neighbors in t h e
t r a i n i n g set, t h e K N N a l g o r i t h m has a simple logic.

l K N N is a n o n - p a r a m e t r i c m e t h o d w h i c h is usually
used in datasets with irregular decision
boundaries.
p The K N N a l g o r i t h m generally adopts t h e m a j o r i t y
v o t i n g m e t h o d f o r classification prediction a n d t h e
average value m e t h o d f o r regression prediction.

l K N N requires a h u g e n u m b e r o f computati ons.

83 Hu a we i Confidential
KNN Algorithm (3)
l Generally, a larg e r k valu e red uces the i m p a c t o f noise o n classific a t io n , b u t o b f uscates the
b o u n d a r y b e t w e e n classes.
p A larger k value means a higher pr obability o f u n d e r f i t t i n g because t h e s e g m e n t a t i o n is t o o rough. A
smaller k value means a higher pr obability o f ov er fitting because t h e s e g m e n t a t i o n is t o o refined.

• The b o u n d a r y becomes s m o o t h e r as
t h e value o f k increases.
• As t h e k value increases t o infinity, all
d a t a points w i l l eventually b e c o m e all
b lu e o r all red.

84 Hu a we i Confidential
N ai ve Bayes ( 1 )
l Naive Bayes a l g o r i t h m : a simple multi-class classification a l g o r i t h m based o n t h e Bayes t h e o r e m .
It assumes t h a t features are i n d e p e n d e n t o f each other. For a given sa mp le f e a t u re 𝑋 , t h e
p ro b a b ilit y t h a t a sample belongs t o a category 𝐻 is:

P ( X1 ,…, X n | C k ) P (C k )
P (C k | X 1,…, X n ) =
P ( X1 ,…, X n)

p 𝑋1, … , 𝑋𝑛 are d a t a features, w h i c h are usually described by m e a s u r e m e n t values o f m a t t r i b u t e sets.


n For example, t h e color f e a t u r e m a y have t hr ee attributes: red, yellow, a n d blue.

p 𝐶𝑘 indicates t h a t t h e d a t a belongs t o a specific category 𝐶


p 𝑃 𝐶𝑘|𝑋1, … , 𝑋𝑛 is a posterior probability, o r a posterior pr obability o f u n d e r c o n d i t i o n 𝐶𝑘 .
p 𝑃 𝐶𝑘 is a pr ior pr obability t h a t is independent o f 𝑋1, … , 𝑋𝑛
p 𝑃 𝑋1, … , 𝑋𝑛 is t h e priori pr obability o f 𝑋.

85 Hu a we i Confidential
N ai ve Bayes ( 2 )
l I n d e p e n d e n t a s s u m p t i o n o f features.
p For example, if a f r u i t is red, r o u n d , a n d a b o u t 1 0 c m (3.94 in.) in d ia me te r, i t can be
considered a n apple.

p A Naiv e Bayes classifier considers t h a t each fe a tu r e in d e p e n d e n tly c ontributes t o t h e


p r o b a b i l i ty t h a t t h e f r u i t is a n apple, regardless o f a n y possible c orrelation b e t w e e n
t h e color, roundness, a n d d ia me te r.

86 Hu a we i Confidential
Ensemble Learning
l Ensemble lear ning is a m a c h i n e learning p a r a d i g m in w h i c h m u l t i p l e learners are t r a i ned a n d c o m b i n e d t o
solve t h e s ame p r o b l e m. W h e n m u l t i p l e learners are used, t h e i nt egr at ed generaliz ation capability can be
m u c h stronger t h a n t h a t o f a single learner.

l If y o u ask a c o m p l e x ques tion t o thous ands o f people a t r a n d o m a n d t h e n s u m m a r i z e their answers, t h e


s u m m a r i z e d ans wer is better t h a n a n expert's ans wer in m o s t cases. This is t h e w i s d o m o f t h e masses.

Trai ni ng set

Dataset 1 Dataset 2 Dataset m

Model 1 Model 2 Model m

Model Large
sy n thesis m o del

87 Hu a we i Confidential
Classification o f Ensemble Learning

Bagging ( R a n d o m Forest)
• Independently builds several basic learners a n d t h e n
Baggi ng
averages their predictions.
• O n average, a c ompos ite learner is usually better t h a n
a single-base learner because o f a smaller variance.

Ensemble learning

Boosting (Adaboost, GBDT, a n d XGboost)


Constructs basic learners in sequence t o
Boosting
gr adually reduce t h e bias o f a c ompos ite learner.
The c ompos ite learner can f i t d a t a well, w h i c h
m a y also cause overfitting.

88 Hu a we i Confidential
Ensemble M e t h o d s i n M a c h i n e Learning ( 1 )
l R a n d o m forest = Bagging + CART decision t r ee

l Ra n d o m forest s b u ild m u l t iple decision t rees a n d m e rg e t h em together to m a ke predictions m o re accurate


a n d stable.

p R a n d o m forests can be used f o r classification a n d regression problems.


Bootstrap sam pl i ng Decision t r e e building Aggregation
prediction result
D a t a subset 1 Prediction 1

D a t a subset 2 Prediction 2
• Category:
All tr a i ni ng d a t a m a j o r i t y voting Final prediction
• Regression:
Prediction a v e r a ge value

D a t a subset Prediction n

89 Hu a we i Confidential
Ensemble M e t h o d s i n M a c h i n e Learning ( 2 )
l GBDT is a type o f boosting a l g o r i t h m .
l For a n aggregative m o d e , t h e s u m o f t h e results o f all t h e basic learners equals t h e predicted
value. In essence, t h e residual o f t h e error f u n c t i o n t o t h e predicted value is f i t by t h e n e xt basic
learner. (Th e residual is t h e error b e t w e e n t h e predicted value a n d t h e actual value.)

l D u r i n g m o d e l training, GBDT requires t h a t t h e sa mp le loss f o r m o d e l prediction be as sma ll as


possible.
Pred ic ti o n
30 years o l d 20 years o l d
Residual
calcula tio n

Pred ic ti o n
10 years o l d 9 years o l d
Residual
calcula tio n

Pred ic ti o n
1 year o l d 1 year o l d

90 Hu a we i Confidential
Unsupervised Learning - K- m e a n s
l K -me a n s clustering aims t o p a r t i t i o n n observations i n t o k clusters in w h i c h each observation
belongs t o t h e cluster w i t h t h e nearest m e a n , serving as a p r o t o t y p e o f t h e cluster.

l For t h e k - m e a n s a l g o r i t h m , specify t h e f in a l n u m b e r o f clusters (k). Then, divide n d a t a objects


i n t o k clusters. The clusters o b t a i n e d m e e t t h e f o l l o w i n g conditions: ( 1 ) Objects in t h e same
cluster are h ig h ly similar. ( 2 ) The similarity o f objects in d i ff e r e n t clusters is small.

x1 x1

K - m e a n s clustering

The d a t a is n o t tagged. K-
means clustering can
a u t o m a t i c a l l y classify datasets.
x2 x2

91 Hu a we i Confidential
Unsupervised Learning - Hierarchical Clustering
l Hierarchical clustering divides a dataset a t d i ff e r e n t layers a n d f o r m s a tree-like clustering
structure. The dataset division m a y use a " b o t t o m - u p " a g g r e g a t i o n policy, o r a " t o p - d o w n "
splitting policy. The hierarchy o f clustering is represented in a tree graph. The r o o t is t h e u n i q u e
cluster o f all samples, a n d t h e leaves are t h e cluster o f o n l y a sample.
hierarchical clustering

Split hierarchical
Agglomerative

clustering
92 Hu a we i Confidential
Conten ts

1 . M a c h i n e Learning D e f i n i t i o n

2 . M a c h i n e Learning Types

3 . M a c h i n e Learning Process

4 . O t h e r Key M a c h i n e Learning M e t h o d s

5 . C o m m o n M a c h i n e Learning A l g o r i t h m s

6. Case study

93 Hu a we i Confidential
Comprehensive Case
l Assume t h a t there is a dataset c o n t a i n i n g t h e house areas a n d prices o f 21,613
housing units sold in a city. Based o n this data, w e can predict t h e prices o f
o t h e r houses in t h e city.
House A r e a Price
1,180 221,900
2,570 538,000
770 180,000
1,960 604,000
1,680 510,000
5,420 1,225,000
Dataset
1,715 257,500
1,060 291,850
1,160 468,000
1,430 310,000
1,370 400,000
1,810 530,000
… …

94 Hu a we i Confidential
P r o b l e m Analysis
l This case contains a large a m o u n t o f data, inc luding i n p u t x (hous e area), a n d o u t p u t y (price), w h i c h is a
c ontinuous value. W e can use regression o f supervised learning. D r a w a scatter c hart based o n t h e d a t a a n d
use linear regression.
l O u r g o a l is t o b u i l d a m o d e l f u n c t i o n h ( x ) t h a t infinitely appr ox imates t h e f u n c t i o n t h a t expresses t r u e
dis tr ibution o f t h e dataset.
l Then, use t h e m o d e l t o predict u n k n o w n price data.

x U n a r y linear regression f u n c t i o n
Feature: house area
h(x) = wo + w1x
In p u t

Price
D a taset Learning h (x)
a lg o rit h m

Output

y
Label: price
House a r e a

95 Hu a we i Confidential
Goal o f Linear Regression
l Linear regression a ims t o f i n d a s tr a ig h t line t h a t best fits t h e dataset.
l Linear regression is a p a r a m e te r - b a s e d m o d e l . Here, w e need l e a r n i n g p a r a m e te r s 𝑤0
a n d 𝑤1. W h e n these t w o p a r a m e te r s are fo u n d , t h e best m o d e l appears.

W h i c h line is t h e best parameter?

h(x) = wo + w1x
Price

Price
House a r e a House a r e a

96 Hu a we i Confidential
Loss Function o f Linear Regression
l To f i n d t h e o p t i m a l parameter, construct a loss f u n c t i o n a n d f i n d t h e p a r a m e t e r
values w h e n t h e loss f u n c t i o n becomes t h e m i n i m u m .

1
Loss f u n c t i o n o f
linear regression:
J (w) =
2m
å (h(x) - y )2

Erro
Errorr Error r
Erro
Goal:

1
å
Price

arg min J (w) = (h(x) - y )2

w 2m
• where, m indicates t h e n u m b e r o f samples,
• h(x) indicates t h e predicted value, a n d y
House a r e a indicates t h e actual value.

97 Hu a we i Confidential
Gr a d i e n t Descent M e t h o d
l The g ra d ie n t descent a l g o r i t h m finds t h e m i n i m u m value o f a f u n c t i o n t h r o u g h iteration.
l It aims t o r a n d o m i z e a n initial p o i n t o n t h e loss function, a n d t h e n f i n d t h e g l o b a l m i n i m u m value
o f t h e loss f u n c t i o n based o n t h e negative g ra d i e n t direction. Such p a r a m e t e r value is t h e o p t i m a l
p a r a m e t e r value.

p Point A: t h e position o f 𝑤0 a n d 𝑤1 a f t e r r a n d o m initialization.


𝑤0 a n d 𝑤1 are t h e required parameters. Cost surface
p A-B connection line: a track f o r m e d based o n descents in

a negative g ra d ie n t direction. U p o n each descent, values 𝑤0


a n d 𝑤1 change, a n d t h e regression line also changes.

p Point B: g l o b a l m i n i m u m value o f t h e loss function.


Final values o f 𝑤0 a n d 𝑤1 are also f o u n d .

98 Hu a we i Confidential
I t e r a t i o n Example
l The f o l l o w i n g is a n e x a m p l e o f a g r a d i e n t descent iteration. W e can see t h a t as red
points o n t h e loss f u n c t i o n surface g r a d u a lly a p p r o a c h a l o w e s t point, f i t t i n g o f t h e
linear regression red line w i t h d a t a becomes b e tte r a n d better. A t this time , w e can g e t
t h e best parameters.

99 Hu a we i Confidential
M o d e l D e b u g g i n g a n d Ap p l i ca ti o n
l A f t e r t h e m o d e l is trained, test i t w i t h t h e test The f in a l m o d e l result is as follows:
set t o ensure t h e generalization capability. h(x) = 280.62x - 43581
l If o v e r fi tti n g occurs, use Lasso regression o r
Ridge regression w i t h regulariz ation t e r m s
a n d t u n e t h e hyperparameters.

If u n d e r f it t i n g occurs, u se a m o r e co mp le x

Price
l

regression m o d e l , such as GBDT.

l Note:
p For real data, pay a t t e n t i o n t o t h e functions o f
d a t a cleansing a n d f e a t u re engineering.
House a r e a

100 Hu a we i Confidential
Summary

l First, this course describes t h e d e fi n i ti o n a n d classification o f m a c h i n e learning, as


w e l l as p r o b l e m s m a c h i n e l e a r n i n g solves. Then, i t introduces key k n o w l e d g e
points o f m a c h i n e learning, in c lu d in g t h e overall procedure ( d a t a collection, d a t a
cleansing, fe atu r e extraction, m o d e l training, m o d e l t r a i n i n g a n d evaluation, a n d
m o d e l d e p l o y m e n t ) , c o m m o n a l g o r i t h m s (linear regression, logistic regression,
decision tree, SVM, naive Bayes, K N N , ensemble learning, K-means, etc.), g r a d i e n t
descent a l g o r i t h m , p a r a m e te r s a n d hyper-parameters.

l Finally, a c o m p l e t e m a c h i n e le a r n in g process is presented b y a case o f using linear


regression t o predict house prices.

101 Hu a we i Confidential
Quiz
1. (Tru e o r false) G r a d ie n t descent it e ra t io n is the o n l y m e t h o d o f m a c h i n e lea r n i n g
algo r it h m s . ( )

A. True

B. False

1. (Single-answer q u e st io n ) W h i c h o f t h e f o l l o w i n g a l g o r i t h m s is n o t supervised le a rn in g ? (
)

A. Linear regression

B. Decision tree

C. KNN

D. K-means

102 Hu a we i Confidential
Recommendations

l O n l i n e learning website
p https://e.huawei.com/en/talent/#/
l H u a w e i K n o w l e d g e Base
p h ttp s ://s u p p o r t.h u a w e i.c o m/e n te r p r is e /e n /k n o w le d g e ? la n g = e n

103 Hu a we i Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like