Professional Documents
Culture Documents
2 Hu a we i Confidential
Objectives
3 Hu a we i Confidential
Conten ts
1 . M a c h i n e Learning Definition
2 . M a c h i n e Learning Types
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case Study
4 Hu a we i Confidential
M a c h i n e Learning A l g o r i t h m s ( 1 )
l M a c h i n e learning ( i n c l u d i n g deep l e a r n i n g ) is a study o f learning algorithms. A
c o m p u t e r p r o g r a m is said t o learn f r o m experience 𝐸 w i t h respect t o s om e class
o f tasks 𝑇 a n d p e r f o r m a n c e measure 𝑃 if its p e r f o r m a n c e a t tasks i n 𝑇, as
measured by 𝑃, improves w i t h experience 𝐸.
5 Hu a we i Confidential
M a c h i n e Learning A l g o r i t h m s ( 2 )
6 Hu a we i Confidential
Differences B e t w e e n M a c h i n e Learning A l g o r i t h m s a n d
Tradi ti onal Rule-Based A l g o r i t h m s
Rule-based a l g o r i t h m s M a c h i n e lear ning
Trai n i n g
data
M ach ine
lear ning
7 Hu a we i Confidential
Ap p l i ca ti o n Scenarios o f M a c h i n e Learning ( 1 )
l The sol u tio n to a p r o b l e m is c o m plex, o r t he proble m m a y involv e a large
a m o u n t o f d a t a w i t h o u t a clear d a t a distribution function.
8 Hu a we i Confidential
Ap p l i ca ti o n Scenarios o f M a c h i n e Learning ( 2 )
Com plex
Machine learning
M a n u a l rules
Rule c omplex ity
al gor i t hm s
Rule-based
Simple
Small Lar ge
Scale o f t h e p r o b l e m
9 Hu a we i Confidential
Rational U n d e r s t a n d i n g o f M a c h i n e Learning A l g o r i t h m s
Target e q u a t i o n
𝑓: 𝑋 → 𝑌
Ideal
A c tu a l
Training d a t a Hypothesis f u n c t i o n
𝐷: {(𝑥1, 𝑦1) ⋯ , (𝑥𝑛, 𝑦𝑛)} Learning a l g o r i t h m s 𝑔 ≈𝑓
10 Hu a we i Confidential
M a i n Problems Solved by M a c h i n e Learning
l M a c h i n e lear ning can deal w i t h m a n y types o f tasks. The f o l l o w i n g describes t h e m o s t typical a n d c o m m o n
types o f tasks.
p Classification: A c o m p u t e r p r o g r a m needs t o specify w h i c h o f t h e k categories s o m e i n p u t belongs to. To accomplish this
task, lear ning a l g o r i t h m s usually o u t p u t a f u n c t i o n 𝑓: 𝑅 𝑛 → (1,2, … , 𝑘). For example, t h e i m a g e classification a l g o r i t h m in
c o m p u t e r vision is developed t o h a n d le classification tasks.
11 Hu a we i Confidential
Conten ts
1 . M a c h i n e Learning D e f i n i t i o n
2 . M a c h i n e Learning Types
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study
12 Hu a we i Confidential
M a c h i n e Learning Classification
l Supervised learning: O b t a i n a n o p t i m a l m o d e l w i t h r equir ed p e r f o r m a n c e t h r o u g h t r a i ni ng a n d learning
based o n t h e samples o f k n o w n categories. Then, use t h e m o d e l t o m a p all inputs t o o u t p u t s a n d check t h e
o u t p u t f o r t h e purpose o f classifying u n k n o w n data.
13 Hu a we i Confidential
Supervised Learning
D a t a f e a t u re Label
Supervised l e a r n i n g
Feature 1 ... Feature n Goal
algorithm
Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair No
Sunny Cold Weak Yes
15 Hu a we i Confidential
Supervised Learning - Regression Questions
l Regression: reflects t h e features o f a t t r i b u t e values o f samples i n a s a m p l e dataset. The
dependency b e t w e e n a t t r i b u t e values is discovered b y expressing t h e relationship o f
s a m p l e m a p p i n g t h r o u g h functions.
p H o w m u c h w i l l I b e n e f i t f r o m t h e stock n e xt week?
p Wh a t ' s t h e t e m p e r a t u r e o n Tuesday?
16 Hu a we i Confidential
Supervised Learning - Classification Questions
l Classification : m a ps samples in a sam p le d a ta set to a specifie d categor y by
using a classification model.
p W i l l th er e be a traffic j a m o n XX r o a d d u r i n g
t h e m o r n i n g rush h o u r t o m o r r o w ?
p W h i c h m e t h o d is m o r e attractive t o customers:
5 y u a n voucher o r 2 5 % off?
17 Hu a we i Confidential
Unsupervised Learning
D a t a Feature
Unsupervised I n t e rn a l
Feature 1 ... Feature n si m i l a r it y
learning algorithm
Monthly Consumption
Commodity
Consumption Time Category
Badminton Cluster 1
1000–2000 6:00–12:00
racket
Cluster 2
500–1000 Basketball 18:00–24:00
1000–2000 G a m e console 00:00–6:00
18 Hu a we i Confidential
Unsupervised Learning - Clustering Questions
l Clustering: classifies samples i n a sample dataset i n t o several categories based
o n t h e clustering m o d e l . The similarity o f samples b e l o n g i n g t o t h e same
category is high.
p W h i c h audiences like t o w a t c h movies
o f t h e s a m e subject?
p W h i c h o f these c o m p o n e n t s are
d a m a g e d in a similar way ?
19 Hu a we i Confidential
Semi-Supervised Learning
D a t a Feature Label
Semi-supervised
Feature 1 ... Feature n Unknown
learning algorithms
Wind Enjoy
Weather Temperature
Speed Sports
Sunny Warm Strong Yes
Rainy Cold Fair /
Sunny Cold Weak /
20 Hu a we i Confidential
R ei nforcement Learning
l The m o d el p e rceives t he en v ir o n m e n t , ta kes actions, an d m a k e s adjust m e n ts
a n d choices based o n t h e status a n d a w a r d o r p u n i s h m e n t .
M o del
Award or A ct io n 𝑎𝑡
Status 𝑠𝑡
p u n i s h m e n t 𝑟𝑡
𝑟𝑡 +1
𝑠𝑡 + Environment
1
21 Hu a we i Confidential
R ei nforcement Learning - Best Behavior
l Reinforcement learning: always looks f o r best behaviors. Reinforcement learning
is t a r g e t e d a t machines o r robots.
p A u to p i l o t: Should i t b r a k e o r accelerate w h e n t h e y e l l o w l i g h t starts t o flash?
p Cleaning r o bo t: Should i t keep w o r k i n g o r g o back f o r charging?
22 Hu a we i Confidential
Conten ts
1 . M a c h i n e learning a l g o r i t h m
2 . M a c h i n e Learning Classification
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study
23 Hu a we i Confidential
M a c h i n e Learning Process
Feature Model
Data Data Model Model deployment
ex t r ac t ion a n d
c ollect io n cleansing t rain i n g ev aluat ion
selection and integration
Feedback a n d i te r a ti o n
24 Hu a we i Confidential
Basic M a c h i n e Learning Concept — Dataset
l Dataset: a collection o f d a t a used i n m a c h i n e learning tasks. Each d a t a record is
called a sample. Events o r attributes t h a t reflect t h e p e r f o r m a n c e o r n a t u r e o f a
sample in a particular aspect are called features.
25 Hu a we i Confidential
Checking D a t a O v e r v i e w
l Typical dataset f o r m
2 120 9 S o u t h we st 1300
Training
set 3 60 6 North 700
4 80 9 Southeast 1100
26 Hu a we i Confidential
I m p o r t a n c e o f D a t a Processing
l D a t a is crucial t o models. It is t h e ceiling o f m o d e l capabilities. W i t h o u t g o o d
data, there is n o g o o d model.
Data
Data
D a t a cleansing
prepro cessi n g normalization
D a t a di mensi on
reduction
Sim plif y d a t a
at t r ibut es t o avoid
d im e n s io n explosion.
27 Hu a we i Confidential
W o r k l o a d o f D a t a Cleansing
l Statistics o n d a t a scientists' w o r k in m a c h i n e learning
3 % R e m o d e lin g t r a i n i n g datasets
5 % Ot her s
4 % O p t i m i z i n g m odels
1 9 % Collecting datasets
28 Hu a we i Confidential
D a t a Cleansing
l M ost m a chin e learnin g m o d e l s proce ss fe ature s, w h i c h are usually n u m e r ic
representations o f i n p u t variables t h a t can be used in t h e model.
29 Hu a we i Confidential
Dirty Data (1)
l Generally, real d a t a m a y have s o m e q u a l i t y problems.
p Incompleteness: contains missing values o r t h e d a t a t h a t lacks a ttr ib u te s
p Noise: contains incorrect records o r exceptions.
30 Hu a we i Confidential
Dirty Data (2)
IsTe #Stu
# Id Name Birthday Ge nde r ach dent Country City
er s
Incorrect f o r m a t A t t r i b u t e dependency
31 Hu a we i Confidential
D a t a Conversion
l A f t e r b e in g preprocessed, t h e d a t a needs t o be converted i n t o a representation f o r m suitable f o r
t h e m a c h i n e le a rn in g mo d e l. C o m m o n d a t a conversion f o r m s include t h e f o l l o w i n g :
p W i t h respect t o classification, category d a t a is encoded i n t o a corresponding n u me r i c a l representation.
p Value d a t a is converted t o category d a t a t o reduce t h e value o f variables ( f o r age s e g m e n t a t i o n ) .
p Other data
p Feature engineering
n N o r m a l i z e features t o ensure t h e s am e value ranges f o r i n p u t variables o f t h e s am e m odel.
n Feature expansion: C o m b in e o r convert existing variables t o generat e n e w features, such as t h e average.
32 Hu a we i Confidential
Necessity o f Feature Selection
l Generally, a d a taset h a s m a n y feature s, s o m e o f w h i c h m a y be re d u n d a n t or
irrelevant t o t h e value t o be predicted.
Sim plif y
models t o
Reduce t h e
make them
training time
easy f o r users
t o interpret
Improve
Avoid model
d i m e nsion genera liza t io n
explosion a n d avoid
o v e r f it t in g
33 Hu a we i Confidential
Feature Selection M e t h o d s - Filter
l Filter m e t h o d s are i n d e p e n d e n t o f t h e m o d e l d u r i n g feature selection.
By ev aluating t h e c orrelation b e t w e e n each
featur e a n d t h e t a r g e t attribute, these m e t h o d s
use a statistical measure t o assign a value t o
each feature. Features are t h e n sorted by score,
w h i c h is h e l p f u l f o r preserving o r e l i m i n a t i n g
specific features.
34 Hu a we i Confidential
Feature Selection M e t h o d s - W r a p p e r
l Wr a p p e r m e t h o d s use a prediction m o d e l t o score feature subsets.
35 Hu a we i Confidential
Feature Selection M e t h o d s - E m b e d d e d
l E m b e d d e d m e t h o d s consider feature selection as a p a r t o f m o d e l construction.
The m o s t c o m m o n ty pe o f e m b e d d e d featur e
selection m e t h o d is t h e regulariz ation m e t h o d .
Regularization m e t h o d s are also called penalization
Select t h e o p t i m a l featur e subset m e t h o d s t h a t intr oduc e a d d i t i o n a l constraints i n t o
t h e o p t i m i z a t i o n o f a predictive a l g o r i t h m t h a t bias
t h e m o d e l t o w a r d l o w e r c omplex ity a n d reduce t h e
Traverse all Generate a Train m o d e ls
features feature subset + Evaluate t h e n u m b e r o f features.
effect
Common methods
Procedure of an embedded m e t h o d
• Lasso regression
• Ridge regression
36 Hu a we i Confidential
Overall Procedure o f Building a M o d e l
M o d e l Building Procedure
1 2 3
6 5 4
37 Hu a we i Confidential
Examples o f Supervised Learning - Learning Phase
l Use t h e classification m o d e l t o predict w h e t h e r a person is a basketball player.
Fe a tur e
( a t t r i b u te ) Ta r g e t
38 Hu a we i Confidential
Examples o f Supervised Learning - Prediction Phase
Name City Age Label
Marine Miami 45 ?
Julien Miami 52 ? Unknown data
Recent data, i t is n o t
New Fred Orlando 20 ?
k now n whether the
data M ic helle Boston 34 ? people are basketball
Nicolas Phoenix 90 ? players.
39 Hu a we i Confidential
W h a t Is a G o o d M o d e l ?
• In t e r p r e t a b ilit y
Is t h e prediction result easy t o interpret?
• Prediction speed
H o w l o n g does i t ta k e t o predict each piece o f data?
• Practicability
Is t h e prediction r a te still acceptable w h e n t h e
service v o l u m e increases w i t h a h u g e d a t a v o l u m e ?
40 Hu a we i Confidential
M o d e l Validity ( 1 )
l Generalization capability: The g o a l o f m a c h i n e le a rn in g is t h a t t h e m o d e l o b t a i n e d a f t e r le a rn in g
should p e r f o r m w e l l o n n e w samples, n o t just o n samples used f o r training. The capability o f
a p p lyin g a m o d e l t o n e w samples is called generalization o r robustness.
41 Hu a we i Confidential
M o d e l Validity ( 2 )
l M o d e l capacity: model's capability o f f i t t i n g functions, w h i c h is also called m o d e l complexity.
p W h e n t h e capacity suits t h e task c omplex ity a n d t h e a m o u n t o f t r a i ni ng d a t a provided, t h e a l g o r i t h m
effect is usually o p t i ma l .
p Mo del s w i t h insufficient capacity c a n n o t solve c o m p l e x tasks a n d u n d e r f i t t i n g m a y occur.
p A high-capacity m o d e l can solve c o m p l e x tasks, b u t ov er fitting m a y occur if t h e capacity is higher t h a n
t h a t required by a task.
l Variance: Bias
l Bias:
p Difference b e t w e e n t h e expected ( o r average) prediction value a n d t h e
correct value w e are t r y i n g t o predict.
43 Hu a we i Confidential
Variance a n d Bias
l C o m b i n ations o f varian ce a n d bias are as
follows:
p L o w bias & l o w variance –> G o o d m o d e l
p L o w bias & h i g h variance
44 Hu a we i Confidential
M o d e l C ompl exi ty a n d Error
l As t h e m o d e l c o m p l e x i t y increases, t h e t r a i n i n g error decreases.
l As t h e m o d e l compl exi ty increases, t h e test error decreases t o a certain p o i n t
a n d t h e n increases in t h e reverse direction, f o r m i n g a convex curve.
Testing error
Error
Training error
M o d e l Complexity
45 Hu a we i Confidential
M a c h i n e Learning Performance Evaluation - Regression
l The closer t h e M e a n Absolute Error ( M A E ) is t o 0, t h e b e t t e r t h e m o d e l can f i t t h e t r a i n i n g data.
𝑚
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦𝑖
m
𝑖=1
𝑅𝑆𝑆 𝑚 𝑦𝑖 − 𝑦𝑖 2
𝑖=1
𝑅2 = 1 − = 1 − 2
𝑇𝑆𝑆 𝑚
𝑖= 𝑦𝑖 − 𝑦𝑖
1
46 Hu a we i Confidential
M a c h i n e Learning P e rf o rma n ce Evaluation - Classification ( 1 )
l Terms a n d definitions: Estimated
amount
p 𝑃: positive, indic at ing t h e n u m b e r o f real positive cases yes no Total
in t h e data. A ctual a m o u n t
yes 𝑇𝑃 𝐹𝑁 𝑃
p 𝑁: negative, indic at ing t h e n u m b e r o f real negat ive cases
no 𝐹𝑃 𝑇𝑁 𝑁
in t h e data.
Total 𝑃′ 𝑁′ 𝑃+𝑁
p 𝑇P : t r u e positive, indic at ing t h e n u m b e r o f positive cases t h a t are correctly
classified by t h e classifier. Confusion m a t r i x
p 𝑇𝑁: t r u e negative, indic at ing t h e n u m b e r o f negat ive cases t h a t are correctly classified by t h e classifier.
p 𝐹𝑃: false positive, indic at ing t h e n u m b e r o f positive cases t h a t are incorrectly classified by t h e classifier.
p 𝐹𝑁: false negative, indic at ing t h e n u m b e r o f negat ive cases t h a t are incorrectly classified by t h e classifier.
p Ideally, f o r a h i g h accuracy classifier, m o s t pr edic t ion values should be locat ed in t h e d ia g o n a l f r o m 𝐶𝑀1,1 t o 𝐶𝑀𝑚,𝑚 o f
t h e t able w h i l e values outside t h e d ia g o n a l are 0 o r close t o 0. T h a t is, 𝐹𝑃 a n d 𝐹𝑃 are close t o 0.
47 Hu a we i Confidential
M a c h i n e Learning P e rf o rma n ce Evaluation - Classification ( 2 )
Measurement Ratio
𝑇𝑃 + 𝑇𝑁
Accuracy a n d rec ognition rate
𝑃+𝑁
𝐹𝑃 + 𝐹𝑁
Error r ate a n d misclassification r ate
𝑃+𝑁
Sensitivity, t r u e positive rate, a n d 𝑇𝑃
recall 𝑃
𝑇𝑁
Specificity a n d t r u e negative rate
𝑁
𝑇𝑃
Precision
𝑇𝑃 + 𝐹𝑃
𝐹1, h a r m o n i c m e a n o f t h e recall rate 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
a n d precision 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
48 Hu a we i Confidential
Example o f M a c h i n e Learning Per for mance Evaluation
l W e have tr ained a m a c h i n e l e a r n i n g m o d e l t o id e n tify w h e t h e r t h e object i n a n i m a g e is
a cat. N o w w e use 2 0 0 pictures t o verify t h e m o d e l performance. A m o n g t h e 200
images, objects i n 1 7 0 images are cats, w h i l e others are not. The identification result o f
t h e m o d e l is t h a t objects i n 1 6 0 ima g e s are cats, w h i l e others are not.
𝑇𝑃 140
Precision: 𝑃 = = = 87.5% Estimated
𝑇𝑃+𝐹𝑃 140+20 amount
Actual
𝒚𝒆𝒔 𝒏𝒐 Total
49 Hu a we i Confidential
Conten ts
1 . M a c h i n e Learning D e f i n i t i o n
2 . M a c h i n e Learning Types
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study
50 Hu a we i Confidential
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 1 )
l The g r a d i e n t descent m e t h o d uses t h e negative g r a d i e n t Cost surface
direction o f t h e c u r r e n t position as t h e search direction,
w h i c h is t h e steepest direction. The f o r m u l a is as follows:
wk+1 = wk -hÑf wk (x i
)
l In the f o r m ula, 𝜂 in dicates th e learn in g r a te a n d 𝑖
in dicates the d a t a reco r d n u m b e r 𝑖 . The w e igh t
p a r a m e t e r w indicates t h e change i n each iteration.
51 Hu a we i Confidential
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 2 )
l Batch Gradient Descent ( B G D ) uses t h e samples ( m in t o t a l ) in all datasets t o
u p d a t e t h e w e i g h t p a r a m e t e r based o n t h e g r a d i e n t value a t t h e cur r ent point.
1 m
wk +1 = wk -h å Ñf wk (x i )
m i=1
l Stochastic Gradient D escent (SG D ) r a n d o m ly selects a sam p le in a d a taset t o
u p d a t e t h e w e i g h t p a r a m e t e r based o n t h e g r a d i e n t value a t t h e cur r ent point.
w
k+1
= wk -hÑfw (x i
) k
wk +1 = wk -h1 å Ñf
t
p a r a m e t e r.
wk (xi )
n i=t
52 Hu a we i Confidential
M a c h i n e Learning Training M e t h o d - Gra d ie n t Descent ( 3 )
l Comparison o f three g ra d ie n t descent m e t h o d s
p In t h e SGD, samples selected f o r each t r a i ni ng are stochastic. Such instability causes t h e loss f u n c t i o n t o
be unstable o r even causes reverse displacement w h e n t h e loss f u n c t i o n decreases t o t h e lowes t point.
p BGD has t h e highest stability b u t consumes t o o m a n y c o m p u t i n g resources. M B G D is a m e t h o d t h a t
balances SGD a n d BGD.
BGD
Uses all t r a i ni ng samples f o r t r a i ni ng each time.
SGD
Uses o n e t r a i ni ng s ample f o r t r a i ni ng each time.
MBGD
Uses a certain n u m b e r o f t r a i ni ng samples f o r
t r a i ni ng each time.
53 Hu a we i Confidential
Parameters a n d H y p e r p a r a m e t e r s i n M o d e l s
l The m o d e l contains n o t o n l y parameters b u t also hyperparameters. The purpose
is t o enable t h e m o d e l t o learn t h e o p t i m a l parameters.
p Parameters are a u t o m a t i c a l l y learned b y models.
p H y p e r p a r a me te r s are m a n u a l l y set.
Model
Tra in in g
Use
hyper par amet er s t o
control training.
54 Hu a we i Confidential
Hyperparameters of a Model
M o d e l hyperparameters are
Common model
external configurations o f
h y p e rp a r a m e t e rs
models.
55 Hu a we i Confidential
H y p e r p a r a m e t e r Search Procedure a n d M e t h o d
• Grid search
• R a n d o m search
• Heuristic int elligent search
Search a l g o r i t h m • Bayesian search
(step 3 )
56 Hu a we i Confidential
H y p e r p a r a m e t e r Searching M e t h o d - Grid Search
l Grid search a t t e m p t s t o exhaustively search all possible
hyperparameter combinations to f o r m a hyperparameter
value grid. Grid search
l In practice, t h e r a n g e o f h y p e r p a r a m e t e r values t o search is 5
Hyperparameter 1
specified m a n u a l l y. 4
3
l Grid search is a n expensive a n d t i m e - c o n s u m i n g m e t h o d .
2
p This m e t h o d w o r k s w e l l w h e n t h e n u m b e r o f hyperparameters
is relatively small. Therefore, i t is applicable t o generally 1
m a c h i n e le a rn in g a l g o r i t h m s b u t inapplicable t o n e u ra l
0 1 2 3 4 5
networks Hyperparameter 2
(see t h e deep le a rn in g part).
57 Hu a we i Confidential
H y p e r p a r a m e t e r Searching M e t h o d - R a n d o m Search
l W h e n t h e h y p e r p a r a m e t e r search space is large, r a n d o m
search is b e t t e r t h a n g r i d search. R a n d o m search
l In r a n d o m search, each setting is s a m p l e d f r o m t h e
d i s tr i b u ti o n o f possible p a r a m e t e r values, in a n a t t e m p t
t o f i n d t h e best subset o f hyperparameters.
Parameter 1
l N o te:
p Search is p e r f o r m e d w i t h i n a coarse range, w h i c h t h e n w i l l
be n a r r o w e d based o n w h e r e t h e best result appears.
p Some hyperparameters are m o r e i m p o r t a n t t h a n others, a n d
Parameter 2
t h e search deviation w i l l be affected d u r i n g r a n d o m search.
58 Hu a we i Confidential
Cross Va l i d a ti o n ( 1 )
l Cross validation: It is a statistical analysis m e t h o d used t o validate t h e p e r f o r m a n c e o f a
classifier. The basic idea is t o divide t h e original dataset i n t o t w o parts: t r a i n i n g set a n d va lid a t io n
set. Train t h e classifier using t h e t r a i n i n g set a n d test t h e m o d e l using t h e validation set t o check
t h e classifier performance .
l k - f o l d cross validation ( 𝑲 − 𝑪𝑽 ):
59 Hu a we i Confidential
Cross Va l i d a ti o n ( 2 )
Entire dataset
60 Hu a we i Confidential
Conten ts
1 . M a c h i n e Learning D e f i n i t i o n
2 . M a c h i n e Learning Types
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning Algorithms
6. Case study
61 Hu a we i Confidential
M a c h i n e Learning A l g o r i t h m O v e r v i e w
M a c h i n e l e a r ni ng
R a n d o m forest R a n d o m forest
GBDT GBDT
KNN
Naive Bayes
62 Hu a we i Confidential
Linear Regression ( 1 )
l Linear regression: a statistical analysis m e t h o d t o d e t e r m i n e t h e q u a n t i t a t i v e
relationships b e t w e e n t w o o r m o r e variables t h r o u g h regression analysis in
m a t h e m a t i c a l statistics.
63 Hu a we i Confidential
Linear Regression ( 2 )
l The m o d e l f u n c t i o n o f linear regression is as follows, w h e r e 𝑤 indicates t h e w e i g h t par ameter, 𝑏 indicates t h e
bias, a n d 𝑥 indicates t h e s ample attribute.
hw (x) = wT x + b
l The relationship b e t w e e n t h e value predicted by t h e m o d e l a n d ac tual value is as follows, w h e r e 𝑦 indicates
t h e ac tual value, a n d 𝜀 indicates t h e error.
y = wT x + b +e
l The error 𝜀 is influenc ed by m a n y factors independently. According t o t h e central l i m i t t h e o r e m , t h e error 𝜀
f o l lo w s n o r m a l d ist r ib u tion . A ccor d in g to th e n o r m a l d ist r ib u t io n f u nctio n an d m a x im u m lik e liho o d
estimation, t h e loss f u n c t i o n o f linear regression is as follows:
1
å (w )
2
J (w)= h (x) - y
2m
l To m a k e t h e predicted value close t o t h e ac tual value, w e need t o m i n i m i z e t h e loss value. W e can use t h e
g r a d i e n t descent m e t h o d t o calculate t h e w e i g h t p a r a m e t e r 𝑤 w h e n t h e loss f u n c t i o n reaches t h e m i n i m u m ,
a n d t h e n c o m p l e t e m o d e l building.
64 Hu a we i Confidential
Linear Regression Extension - P o l y n o m i a l Regression
l P o l y n o m i a l regression is a n extension o f linear regression. Generally, t h e c o mp le x ity o f
a dataset exceeds t h e possibility o f f i t t i n g b y a s tr a ig h t line. T h a t is, obvious u n d e r f i t t i n g
occurs if t h e original linear regression m o d e l is used. The s o lu tio n is t o use p o l y n o m i a l
regression.
h (x) = w x + w x2 + + w xn +b
w 1 2 n
l where, t h e n t h p o w e r is a p o l y n o m i a l regression
d i m e n s i o n (degree).
65 Hu a we i Confidential
Linear Regression a n d O v e r f i t t i n g Prevention
l Regularization t e r m s can be used t o reduce overfitting. The value o f 𝑤 c a n n o t be t o o
large o r t o o s m a l l i n t h e s a m p l e space. You can a d d a square s u m loss o n t h e t a r g e t
func tion.
1
å (hw (x) - y ) +l å w
2 2
J (w)= 2
l Regularizatio n te r m s ( n o r m ) :2m
The reg u l arizatio n t e r m h e re is called L2 - n o r m . Lin ear
regression t h a t uses this loss f u n c t i o n is also called Ridge regression.
1
å (hw (x) - y ) +l å w 1
2
J (w)=
2m
l Linear regression w i t h absolute loss is called Lasso regression.
66 Hu a we i Confidential
Logistic Regression ( 1 )
l Logistic regression: The logistic regression m o d e l is used t o solve classification problems.
The m o d e l is d e fin e d as follows:
𝑒 𝑤𝑥+ 𝑏
𝑃 𝑌 = 1 𝑥 =
1 + 𝑒 𝑤𝑥+ 𝑏
1
𝑃 𝑌 = 0 𝑥 =
1 + 𝑒 𝑤𝑥+ 𝑏
w h e r e 𝑤 indicates t h e w ei ght , 𝑏 indicates t h e bias, a n d 𝑤𝑥 + 𝑏 is r egar ded as t h e linear f u n c t i o n o f 𝑥. C ompar e
t h e preceding t w o pr obability values. The class w i t h a higher pr obability value is t h e class o f 𝑥.
67 Hu a we i Confidential
Logistic Regression ( 2 )
l B o t h t h e logistic regression m o d e l a n d linear regression m o d e l are generalized linear
models. Logistic regression introduces n o n l i n e a r factors ( t h e s i g m o i d f u n c t i o n ) based o n
linear regression a n d sets thresholds, so i t can deal w i t h b in a r y classification problems.
1
J (w) = -
m
å ( y ln hw (x) + (1- y) ln(1- hw (x)))
l w h e r e 𝑤 indicates t h e w e i g h t p a r a me te r, 𝑚 indicates t h e n u m b e r o f samples, 𝑥 indicates
t h e sample, a n d 𝑦 indicates t h e real value. The values o f all t h e w e i g h t p a r a m e te r s 𝑤
can also be o b t a i n e d t h r o u g h t h e g r a d i e n t descent a l g o r i t h m .
68 Hu a we i Confidential
Logistic Regression Extension - S o f t m a x Function ( 1 )
l Logistic regression applies o n l y t o binary classification problems. For multi-class
classification problems, use t h e S o f t m a x function.
Grape?
Ma le ? Orange?
A p p le?
Female? Banana?
69 Hu a we i Confidential
Logistic Regression Extension - S o f t m a x Function ( 2 )
l S o f t m a x regression is a generalization o f logistic regression t h a t w e can use f o r
K-class classification.
l=1
70 Hu a we i Confidential
Logistic Regression Extension - S o f t m a x Function ( 3 )
l S o f t m a x assigns a p r o b a b i l i ty t o each class i n a multi-class p r o b l e m . These probabilities
m u s t a d d u p t o 1.
p S o f t m a x m a y p ro d u ce a f o r m b e l o n g i n g t o a particular class. Example:
Category Probability
Grape? 0.09
• S u m o f a ll probabilities:
Orange? 0.22 • 0.09 + 0.22 + 0.68 + 0.01 =1
• M o s t probably, this picture is
A p p le? a n apple.
0.68
Banana? 0.01
71 Hu a we i Confidential
Decision Tree
l A decision tree is a tree structure ( a binary tree o r a n o n - b i n a r y tree). Each n o n - l e a f n o d e represents a test o n
a f e at u r e attr ibute. Each br anc h represents t h e o u t p u t o f a f e at u r e a t t r i b u t e in a certain value range, a n d
each leaf n o d e stores a category. To use t h e decision tree, start f r o m t h e r o o t node, test t h e f eat ur e attributes
o f t h e i t ems t o be classified, select t h e o u t p u t branches, a n d use t h e c ategory stored o n t h e leaf n o d e as t h e
final result.
Root
Sh o r t Tall
Ca n n o t Can Sh o r t Long
sq u eak sq u eak neck neck
Sh o r t Long
M i g h t be a M i g h t be a
M i g h t be nose n o se
squirrel giraff e
a rat
On land In w a t e r M i g h t be a n
e le p h a n t
M i g h t be a M i g h t be
rhinoceros a hippo
72 Hu a we i Confidential
Decision Tree Structure
Root N o d e
I nt er nal In t e rnal
Node Node
In t e rnal
Leaf N o d e Leaf N o d e Node Leaf N o d e
73 Hu a we i Confidential
Key Points o f Decision Tree Construction
l To create a decision tree, w e need t o select a ttr ib u te s a n d d e t e r m i n e t h e tree structure
b e t w e e n fe a tu r e attributes. The key step o f constructing a decision tree is t o divide d a t a
o f all fea tur e attributes, c o m p a r e t h e result sets i n t e r m s o f 'purity', a n d select t h e
a t t r i b u t e w i t h t h e highest 'purity' as t h e d a t a p o i n t f o r dataset division.
75 Hu a we i Confidential
Decision Tree Example
l The f o l l o w i n g f ig u re shows a classification w h e n a decision tree is used. The classification result is
i m p a c t e d by three attributes: Refund, M a r i t a l Status, a n d Taxable Income.
Marital Taxable
Tid R ef und C h eat
Status Income
1 Yes Single 125,000 No R e fu n d
2 No Married 100,000 No
3 No Single 70,000 No M a rital
No Status
4 Yes Married 120,000 No
5 No Divorced 95,000 Yes
Taxabl e
6 No Married 60,000 No Income No
7 Yes Divorced 220,000 No
8 No Single 85,000 Yes No Yes
9 No Married 75,000 No
10 No Single 90,000 Yes
76 Hu a we i Confidential
SVM
l S V M is a b in a ry classification m o d e l wh o se basic m o d e l is a linear classifier d e f in e d in t h e
eigenspace w i t h t h e largest interval. SVMs also include kernel tricks t h a t m a k e t h e m n o n lin e a r
classifiers. The S V M le a rn in g a l g o r i t h m is t h e o p t i m a l solution t o convex q u a d ra t ic p r o g r a m m i n g .
Project i o n
77 Hu a we i Confidential
Linear S V M ( 1 )
l H o w d o w e split t h e red a n d blue datasets by a straight line?
or
78 Hu a we i Confidential
Linear S V M ( 2 )
l Straight lines are used t o divide d a t a i n t o d i ff e r e n t classes. Actually, w e can use m u l t i p l e straight
lines t o divide data. The core idea o f t h e S V M is t o f i n d a straight line a n d keep t h e p o i n t close t o
t h e straight line as f a r as possible f r o m t h e straight line. This can enable st ro n g generalization
capability o f t h e mo d e l. These points are called support vectors.
l In t w o - d i m e n s i o n a l space, w e use straight lines f o r segmentation. In h i g h - d i m e n s i o n a l space, w e
use hyperplanes f o r segmentation.
Distance b e t w e e n
s u p p o r t vectors
is as f a r as
possible
79 Hu a we i Confidential
Nonlinear SVM (1)
l H o w d o w e classify a n o n l i n e a r separable dataset?
80 Hu a we i Confidential
Nonlinear SVM (2)
l Kernel func tions are used t o construct n o n l i n e a r SVMs.
l Kernel func tions a l l o w a l g o r i t h m s t o f i t t h e largest h y p e r p la n e i n a t r a n s f o r m e d h i g h -
d i m e n s i o n a l fe a tu r e space.
C o m m o n k e r n e l functions
Linear Polyn o m i a l
kernel kernel
f u n c tion function
Gaussian Sigmoid
kernel kernel
function f u n c tion I n p u t space H i g h - d im e n sio n a l
f e a t u re space
81 Hu a we i Confidential
KNN Algorithm (1)
l The KNN classification a l g o r i t h m is a
theoretically m a t u r e m e t h o d a n d o n e o f t h e
simplest machine learning algorithms.
According t o this m e t h o d , if t h e m a j o r i t y o f
k samples m o s t similar t o one sampl e
?
(nearest neighbors in the eigenspace)
b e l o n g t o a specific category, this sample
also belongs t o this category.
82 Hu a we i Confidential
KNN Algorithm (2)
l As t h e prediction result is d e t e r m i n e d based o n
t h e n u m b e r a n d w e i g h t s o f neighbors in t h e
t r a i n i n g set, t h e K N N a l g o r i t h m has a simple logic.
l K N N is a n o n - p a r a m e t r i c m e t h o d w h i c h is usually
used in datasets with irregular decision
boundaries.
p The K N N a l g o r i t h m generally adopts t h e m a j o r i t y
v o t i n g m e t h o d f o r classification prediction a n d t h e
average value m e t h o d f o r regression prediction.
83 Hu a we i Confidential
KNN Algorithm (3)
l Generally, a larg e r k valu e red uces the i m p a c t o f noise o n classific a t io n , b u t o b f uscates the
b o u n d a r y b e t w e e n classes.
p A larger k value means a higher pr obability o f u n d e r f i t t i n g because t h e s e g m e n t a t i o n is t o o rough. A
smaller k value means a higher pr obability o f ov er fitting because t h e s e g m e n t a t i o n is t o o refined.
• The b o u n d a r y becomes s m o o t h e r as
t h e value o f k increases.
• As t h e k value increases t o infinity, all
d a t a points w i l l eventually b e c o m e all
b lu e o r all red.
84 Hu a we i Confidential
N ai ve Bayes ( 1 )
l Naive Bayes a l g o r i t h m : a simple multi-class classification a l g o r i t h m based o n t h e Bayes t h e o r e m .
It assumes t h a t features are i n d e p e n d e n t o f each other. For a given sa mp le f e a t u re 𝑋 , t h e
p ro b a b ilit y t h a t a sample belongs t o a category 𝐻 is:
P ( X1 ,…, X n | C k ) P (C k )
P (C k | X 1,…, X n ) =
P ( X1 ,…, X n)
85 Hu a we i Confidential
N ai ve Bayes ( 2 )
l I n d e p e n d e n t a s s u m p t i o n o f features.
p For example, if a f r u i t is red, r o u n d , a n d a b o u t 1 0 c m (3.94 in.) in d ia me te r, i t can be
considered a n apple.
86 Hu a we i Confidential
Ensemble Learning
l Ensemble lear ning is a m a c h i n e learning p a r a d i g m in w h i c h m u l t i p l e learners are t r a i ned a n d c o m b i n e d t o
solve t h e s ame p r o b l e m. W h e n m u l t i p l e learners are used, t h e i nt egr at ed generaliz ation capability can be
m u c h stronger t h a n t h a t o f a single learner.
Trai ni ng set
Model Large
sy n thesis m o del
87 Hu a we i Confidential
Classification o f Ensemble Learning
Bagging ( R a n d o m Forest)
• Independently builds several basic learners a n d t h e n
Baggi ng
averages their predictions.
• O n average, a c ompos ite learner is usually better t h a n
a single-base learner because o f a smaller variance.
Ensemble learning
88 Hu a we i Confidential
Ensemble M e t h o d s i n M a c h i n e Learning ( 1 )
l R a n d o m forest = Bagging + CART decision t r ee
D a t a subset 2 Prediction 2
• Category:
All tr a i ni ng d a t a m a j o r i t y voting Final prediction
• Regression:
Prediction a v e r a ge value
D a t a subset Prediction n
89 Hu a we i Confidential
Ensemble M e t h o d s i n M a c h i n e Learning ( 2 )
l GBDT is a type o f boosting a l g o r i t h m .
l For a n aggregative m o d e , t h e s u m o f t h e results o f all t h e basic learners equals t h e predicted
value. In essence, t h e residual o f t h e error f u n c t i o n t o t h e predicted value is f i t by t h e n e xt basic
learner. (Th e residual is t h e error b e t w e e n t h e predicted value a n d t h e actual value.)
Pred ic ti o n
10 years o l d 9 years o l d
Residual
calcula tio n
Pred ic ti o n
1 year o l d 1 year o l d
90 Hu a we i Confidential
Unsupervised Learning - K- m e a n s
l K -me a n s clustering aims t o p a r t i t i o n n observations i n t o k clusters in w h i c h each observation
belongs t o t h e cluster w i t h t h e nearest m e a n , serving as a p r o t o t y p e o f t h e cluster.
x1 x1
K - m e a n s clustering
The d a t a is n o t tagged. K-
means clustering can
a u t o m a t i c a l l y classify datasets.
x2 x2
91 Hu a we i Confidential
Unsupervised Learning - Hierarchical Clustering
l Hierarchical clustering divides a dataset a t d i ff e r e n t layers a n d f o r m s a tree-like clustering
structure. The dataset division m a y use a " b o t t o m - u p " a g g r e g a t i o n policy, o r a " t o p - d o w n "
splitting policy. The hierarchy o f clustering is represented in a tree graph. The r o o t is t h e u n i q u e
cluster o f all samples, a n d t h e leaves are t h e cluster o f o n l y a sample.
hierarchical clustering
Split hierarchical
Agglomerative
clustering
92 Hu a we i Confidential
Conten ts
1 . M a c h i n e Learning D e f i n i t i o n
2 . M a c h i n e Learning Types
3 . M a c h i n e Learning Process
4 . O t h e r Key M a c h i n e Learning M e t h o d s
5 . C o m m o n M a c h i n e Learning A l g o r i t h m s
6. Case study
93 Hu a we i Confidential
Comprehensive Case
l Assume t h a t there is a dataset c o n t a i n i n g t h e house areas a n d prices o f 21,613
housing units sold in a city. Based o n this data, w e can predict t h e prices o f
o t h e r houses in t h e city.
House A r e a Price
1,180 221,900
2,570 538,000
770 180,000
1,960 604,000
1,680 510,000
5,420 1,225,000
Dataset
1,715 257,500
1,060 291,850
1,160 468,000
1,430 310,000
1,370 400,000
1,810 530,000
… …
94 Hu a we i Confidential
P r o b l e m Analysis
l This case contains a large a m o u n t o f data, inc luding i n p u t x (hous e area), a n d o u t p u t y (price), w h i c h is a
c ontinuous value. W e can use regression o f supervised learning. D r a w a scatter c hart based o n t h e d a t a a n d
use linear regression.
l O u r g o a l is t o b u i l d a m o d e l f u n c t i o n h ( x ) t h a t infinitely appr ox imates t h e f u n c t i o n t h a t expresses t r u e
dis tr ibution o f t h e dataset.
l Then, use t h e m o d e l t o predict u n k n o w n price data.
x U n a r y linear regression f u n c t i o n
Feature: house area
h(x) = wo + w1x
In p u t
Price
D a taset Learning h (x)
a lg o rit h m
Output
y
Label: price
House a r e a
95 Hu a we i Confidential
Goal o f Linear Regression
l Linear regression a ims t o f i n d a s tr a ig h t line t h a t best fits t h e dataset.
l Linear regression is a p a r a m e te r - b a s e d m o d e l . Here, w e need l e a r n i n g p a r a m e te r s 𝑤0
a n d 𝑤1. W h e n these t w o p a r a m e te r s are fo u n d , t h e best m o d e l appears.
h(x) = wo + w1x
Price
Price
House a r e a House a r e a
96 Hu a we i Confidential
Loss Function o f Linear Regression
l To f i n d t h e o p t i m a l parameter, construct a loss f u n c t i o n a n d f i n d t h e p a r a m e t e r
values w h e n t h e loss f u n c t i o n becomes t h e m i n i m u m .
1
Loss f u n c t i o n o f
linear regression:
J (w) =
2m
å (h(x) - y )2
Erro
Errorr Error r
Erro
Goal:
1
å
Price
w 2m
• where, m indicates t h e n u m b e r o f samples,
• h(x) indicates t h e predicted value, a n d y
House a r e a indicates t h e actual value.
97 Hu a we i Confidential
Gr a d i e n t Descent M e t h o d
l The g ra d ie n t descent a l g o r i t h m finds t h e m i n i m u m value o f a f u n c t i o n t h r o u g h iteration.
l It aims t o r a n d o m i z e a n initial p o i n t o n t h e loss function, a n d t h e n f i n d t h e g l o b a l m i n i m u m value
o f t h e loss f u n c t i o n based o n t h e negative g ra d i e n t direction. Such p a r a m e t e r value is t h e o p t i m a l
p a r a m e t e r value.
98 Hu a we i Confidential
I t e r a t i o n Example
l The f o l l o w i n g is a n e x a m p l e o f a g r a d i e n t descent iteration. W e can see t h a t as red
points o n t h e loss f u n c t i o n surface g r a d u a lly a p p r o a c h a l o w e s t point, f i t t i n g o f t h e
linear regression red line w i t h d a t a becomes b e tte r a n d better. A t this time , w e can g e t
t h e best parameters.
99 Hu a we i Confidential
M o d e l D e b u g g i n g a n d Ap p l i ca ti o n
l A f t e r t h e m o d e l is trained, test i t w i t h t h e test The f in a l m o d e l result is as follows:
set t o ensure t h e generalization capability. h(x) = 280.62x - 43581
l If o v e r fi tti n g occurs, use Lasso regression o r
Ridge regression w i t h regulariz ation t e r m s
a n d t u n e t h e hyperparameters.
If u n d e r f it t i n g occurs, u se a m o r e co mp le x
Price
l
l Note:
p For real data, pay a t t e n t i o n t o t h e functions o f
d a t a cleansing a n d f e a t u re engineering.
House a r e a
100 Hu a we i Confidential
Summary
101 Hu a we i Confidential
Quiz
1. (Tru e o r false) G r a d ie n t descent it e ra t io n is the o n l y m e t h o d o f m a c h i n e lea r n i n g
algo r it h m s . ( )
A. True
B. False
1. (Single-answer q u e st io n ) W h i c h o f t h e f o l l o w i n g a l g o r i t h m s is n o t supervised le a rn in g ? (
)
A. Linear regression
B. Decision tree
C. KNN
D. K-means
102 Hu a we i Confidential
Recommendations
l O n l i n e learning website
p https://e.huawei.com/en/talent/#/
l H u a w e i K n o w l e d g e Base
p h ttp s ://s u p p o r t.h u a w e i.c o m/e n te r p r is e /e n /k n o w le d g e ? la n g = e n
103 Hu a we i Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.