You are on page 1of 66

MACHINE LEARNING,

DEEP LEARNING, AND


MOTION ANALYSIS

University of Waterloo
Department of Electrical & Computer Engineering

Terry Taewoong Um

Terry Taewoong Um (terry.t.um@gmail.com)


1
CAUTION

• I cannot explain everything


• You cannot get every details

• Try to get a big picture


• Get some useful keywords
• Connect with your research

Terry Taewoong Um (terry.t.um@gmail.com)


2
CONTENTS

1. What is Machine Learning?


(Part 1 Q & A)

2. What is Deep Learning?


(Part 2 Q & A)

3. Machine Learning in Motion Analysis


(Part 3 Q & A)

Terry Taewoong Um (terry.t.um@gmail.com)


3
CONTENTS

1. What is Machine Learning?

Terry Taewoong Um (terry.t.um@gmail.com)


4
W H AT I S M A C H I N E L E A R N I N G ?
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)

Example: A program for soccer tactics

T : Win the game


P : Goals
E : (x) Players’ movements
(y) Evaluation

Terry Taewoong Um (terry.t.um@gmail.com)


5
W H AT I S M A C H I N E L E A R N I N G ?
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)

“Toward learning robot table tennis”, J. Peters et al. (2012)


https://youtu.be/SH3bADiB7uQ

Terry Taewoong Um (terry.t.um@gmail.com)


6
TA S K S
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)

classification regression clustering


discrete target values real target values no target values
x : pixels (28*28) x ∈ (0,100) x ∈ (-3,3)×(-3,3)
y : 0,1, 2,3,…,9 y : 0,1, 2,3,…,9

Terry Taewoong Um (terry.t.um@gmail.com)


7
PERFORMANCE
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)

classification regression clustering


0-1 loss function L2 loss function

Terry Taewoong Um (terry.t.um@gmail.com)


8
EXPERIENCE
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)

classification regression clustering


labeled data labeled data unlabeled data
(pixels)→(number) (x) → (y) (x1,x2)

Terry Taewoong Um (terry.t.um@gmail.com)


9
A TO Y E X A M P L E

Weight
(kg)
[Output Y]

? Height(cm)
[Input X]

Terry Taewoong Um (terry.t.um@gmail.com)


10
A TO Y E X A M P L E

Weight Y = aX+b
(kg)
80

180 Height(cm)
Model : Y = aX+b Parameter : (a, b)
[Goal] Find (a,b) which best fits the given data
Terry Taewoong Um (terry.t.um@gmail.com)
11
A TO Y E X A M P L E
[Numerical Solution]
1. Set a cost function

2. Apply an optimization method


[Analytic Solution]
(e.g. Gradient Descent (GD) Method)
Least square problem
L

(from AX = b, X=A#b where


A# is A’s pseudo inverse)
Local minima problem
(a,b)
Not always available http://www.yaldex.com/game- http://mnemstudio.org/neural-networks-
development/1592730043_ch18lev1sec4.html multilayer-perceptron-design.htm

Terry Taewoong Um (terry.t.um@gmail.com)


12
W H AT W O U L D B E T H E C O R R E C T M O D E L ?

Select a model → Set a cost function → Optimization

Running
Record
(min)

140

32 Age(year)

Terry Taewoong Um (terry.t.um@gmail.com)


13
W H AT W O U L D B E T H E C O R R E C T M O D E L ?
1. Regularization 2. Nonparametric model

“overfitting”

? X

Terry Taewoong Um (terry.t.um@gmail.com)


14
L 2 R E G U L A R I Z AT I O N

(e.g. w=(a,b) where Y=aX+b)

Avoid a complicated model!

• Another interpretation :
: Maximum a Posteriori (MAP)

http://goo.gl/6GE2ix

http://goo.gl/6GE2ix

Terry Taewoong Um (terry.t.um@gmail.com)


15
L 2 R E G U L A R I Z AT I O N
• Bayesian inference ex) fair coin : 50% H, 50% T
prior likelihood falsified coin : 80% H, 20% T
posterior
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 𝑃(𝐷𝑎𝑡𝑎|𝐵𝑒𝑙𝑖𝑒𝑓) Let’s say we observed ten heads consecutively.
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 𝐷𝑎𝑡𝑎 =
𝑃(𝐷𝑎𝑡𝑎)
What’s the probability for being a fair coin?
normalization

Fair 𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 = 0.2


(you don’t believe this coin is fair)
10 Falsified 𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 = 0.8
𝑃 𝐷𝑎𝑡𝑎|𝐵𝑒𝑙𝑖𝑒𝑓 = 0.5 ≈ 0.001 𝑃 𝐷𝑎𝑡𝑎|𝐵𝑒𝑙𝑖𝑒𝑓 = 0.810 ≈ 0.107
coin? 𝑃 𝐵𝑒𝑙𝑖𝑒𝑓|𝐷𝑎𝑡𝑎 ∝ 0.2 ∗ 0.001 = 0.0002 coin? 𝑃 𝐵𝑒𝑙𝑖𝑒𝑓|𝐷𝑎𝑡𝑎 ∝ 0.8 ∗ 0.107 = 0.0856
0.0002
Fair = = 0.23% , Unfair = 99.77%
0.0002+0.0856

• Another interpretation :
: Maximum a Posteriori (MAP)

http://goo.gl/6GE2ix
http://goo.gl/6GE2ix

Terry Taewoong Um (terry.t.um@gmail.com)


16
W H AT W O U L D B E T H E C O R R E C T M O D E L ?
1. Regularization 2. Nonparametric model

error

test error
training validation test
set set set
for training for early for evaluation
training error (parameter stopping (measure the
optimization) (avoid performance)
we should overfitting)
training time
stop here
keep watching the validation error

Terry Taewoong Um (terry.t.um@gmail.com)


17
N O N PA R A M E T R I C M O D E L
• It does not assume any parametric models (e.g. Y = aX+b, Y=aX2+bX+c, etc.)
• It often requires much more samples

• Kernel methods are frequently applied for modeling the data


• Gaussian Process Regression (GPR), a sort of kernel method, is a widely-used
nonparametric regression method
• Support Vector Machine (SVM), also a sort of kernel method, is a widely-used
nonparametric classification method

kernel function

[Input space] [Feature space]

Terry Taewoong Um (terry.t.um@gmail.com)


18
S U P P O R T V E C TO R M A C H I N E ( S V M )

[Linear classifiers] [Maximum margin]


“Myo”, Thalmic Labs (2013)
https://youtu.be/oWu9TFJjHaM
kernel function

[Dual formulation] ( )

kernel function

Terry Taewoong Um (terry.t.um@gmail.com) Support vector Machine Tutorial, J. Weston, http://goo.gl/19ywcj


19
GAUSSIAN PROCESS REGRESSION (GPR)
• Gaussian Distribution

• Multivariate regression likelihood https://goo.gl/EO54WN

likelihood

prior

posterior
https://youtu.be/kvPmArtVoFE

prediction conditioning the joint distribution of the observed & predicted values

https://youtu.be/YqhLnCm0KXY
http://goo.gl/XvOOmf

Terry Taewoong Um (terry.t.um@gmail.com)


20
DIMENSION REDUCTION

• Principal Component Analysis


low dim. high dim.
: Find the best orthogonal axes
(=principal components) which
maximize the variance of the data

𝑋 → ∅(𝑋)
high dim. low dim.

Y = P X

* The rows in P are m largest eigenvectors


1
of 𝑋𝑋 𝑇 (covariance matrix)
𝑁
[Original space] [Feature space]

Terry Taewoong Um (terry.t.um@gmail.com)


21
DIMENSION REDUCTION

http://jbhuang0604.blogspot.kr/2013/04/miss-korea-2013-contestants-face.html

Terry Taewoong Um (terry.t.um@gmail.com)


22
S U M M A RY - PA R T 1
• Machine Learning
- Tasks : Classification, Regression, Clustering, etc.
- Performance : 0-1 loss, L2 loss, etc.
- Experience : labeled data, unlabelled data

• Machine Learning Process


(1) Select a parametric / nonparametric model
(2) Set a performance measurement including regularization term
(3) Training data (optimizing parameters) until validation error increases
(4) Evaluate the final performance using test set

• Nonparametric model : Support Vector Machine, Gaussian Process Regression


• Dimension reduction : used as pre-processing data

Terry Taewoong Um (terry.t.um@gmail.com)


23
CONTENTS

Questions about Part 1?

Terry Taewoong Um (terry.t.um@gmail.com)


24
CONTENTS

2. What is Deep Learning?

Terry Taewoong Um (terry.t.um@gmail.com)


25
PA R A D I G M C H A N G E

PAST PRESENT

How can we find a


good representation?
ML What is the best
Method ML method for
(e.g. Representation
GPR, SVM)
the target task?

Knowledge Knowledge

Terry Taewoong Um (terry.t.um@gmail.com)


26
PA R A D I G M C H A N G E

PRESENT
kernel function

How can we find a


good representation?

Representation

Knowledge

Terry Taewoong Um (terry.t.um@gmail.com)


27
PA R A D I G M C H A N G E

PRESENT
IMAGE

How can we find a


good representation?

Hand-Crafted Features Representation


(Features)
SPEECH

Knowledge

Terry Taewoong Um (terry.t.um@gmail.com)


28
PA R A D I G M C H A N G E

PRESENT
IMAGE

Representation
(Features)
Hand-Crafted Features
Can we learn a good representation
(feature) for the target task as well?
SPEECH

Knowledge

Terry Taewoong Um (terry.t.um@gmail.com)


29
DEEP LEARNING
• What is Deep Learning (DL) ?
- Learning methods which have deep (not shallow) architecture

- It often allows end-to-end learning

- It automatically finds intermediate representation. Thus,


it can be regarded as a representation learning

- It often contains stacked “neural network”. Thus,


Deep learning usually indicates “deep neural network” http://goo.gl/5Ry08S

“Deep Gaussian Process” (2013) http://goo.gl/fxmmPE


https://youtu.be/NwoGqYsQifg

Terry Taewoong Um (terry.t.um@gmail.com)


30
O U T S TA N D I N G P E R F O R M A N C E O F D L

• State-of-art results achieved by DL

- Object recognition (Simonyan et al., 2015)


- Natural machine translation (Bahdanau et al., 2014)
- Speech recognition (Chorowski et al., 2014)
- Face recognition (Taigman et al., 2014)
- Emotion recognition (Ebrahimi-Kahou et al., 2014)
- Human pose estimation (Jain et al., 2014)
- Deep reinforcement learning(mnih et al., 2013)
- Image/Video caption (Xu et al., 2015)
- Particle physics (Baldi et al., 2014)
error rate : 28% → 15% → 8%
- Bioinformatics (Leung et al., 2014)
(2010) (2012) (2014)
- And so on…. K. Cho, https://goo.gl/vdfGpu

DL has won most of ML challenges!

Terry Taewoong Um (terry.t.um@gmail.com)


31
BIOLOGICAL EVIDENCE

Andrew Ng, https://youtu.be/ZmNOAtZIgIk

• Somatosensory cortex learns to see


• Why do we need different ML methods
for different task?

• The vental pathway in the visual cortex has multiple stages


• There exist a lot of intermediate representations
Yann LeCun, https://goo.gl/VVQXJG

Terry Taewoong Um (terry.t.um@gmail.com)


32
BIG MOVEMENT

Going deeper and deeper….

http://goo.gl/zNbBE2 http://goo.gl/Lk64Q4

Terry Taewoong Um (terry.t.um@gmail.com)


33
NEURAL NETWORK (NN)

• Universal approximation theorem (Hornik, 1991)

- A single hidden layer NN w/ linear output can approximate any cont. func. arbitrarily well,
given enough hidden units
- This does not imply we have learning method to train them

Terry Taewoong Um (terry.t.um@gmail.com)


34
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
TRAINING NN
1• First, calculate the output using data & initial parameters (W ,b)

• Activation functions

http://goo.gl/qMQk5H

Terry Taewoong Um (terry.t.um@gmail.com)


35
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
TRAINING NN
2• Then, calculate the error and update the weights from top to bottom

: Backpropagation algorithm

• Parameter gradients known

http://goo.gl/qMQk5H

Terry Taewoong Um (terry.t.um@gmail.com)


36
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
TRAINING NN
2• Then, calculate the error and update the weights from top to bottom

: Backpropagation algorithm

• Parameter gradients

known

http://goo.gl/qMQk5H

Terry Taewoong Um (terry.t.um@gmail.com)


37
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
TRAINING NN
2• Then, calculate the error and update the weights from top to bottom

: Backpropagation algorithm

• Parameter gradients

known

http://goo.gl/qMQk5H

Terry Taewoong Um (terry.t.um@gmail.com)


38
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
TRAINING NN
2• Then, calculate the error and update the weights from top to bottom

: Backpropagation algorithm

• Parameter gradients

known

http://goo.gl/qMQk5H

Terry Taewoong Um (terry.t.um@gmail.com)


39
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
TRAINING NN
3• Repeat this process with different dataset(mini-batches)

- Forward propagation (calculate the output values)


- Evaluate the error
- Backward propagation (update the weights)
- Repeat this process until the error converges

• As you can see here, NN is not a fancy algorithm,


but just a iterative gradient descent method with
huge number of parameters

• NN is often likely to be
stuck in local minima pitfall

http://goo.gl/qMQk5H
Terry Taewoong Um (terry.t.um@gmail.com)
40
F R O M N N TO D E E P N N
• A long winter of NN
- NN requires expert’s skill to tune the hyperparameters
- It sometimes gives a good result, but sometimes gives a bad result.
The result is highly depend on the quality of initialization, regularization,
hyperparameters, data, etc.

- Local minima is always problematic

• From NN to deep NN (since 2006)

Geoffrey Hinton Yann LeCun Yoshua Bengio


(U. Toronto, Google) (NYU, Facebook) (U. Montreal)

Terry Taewoong Um (terry.t.um@gmail.com)


41
WHY IS DL SO SUCCESSFUL?

• Pre-training with unsupervised learning

• Convolutional Neural Network

• Recurrent Neural Net

• GPGPU (parallel processing) & big data

• Advanced algorithms for optimization,


activation, regularization

• Huge research society


(Vision, Speech, NLP, Biology, etc.)

http://t-robotics.blogspot.kr/2015/05/deep-learning.html

Terry Taewoong Um (terry.t.um@gmail.com)


42
U N S U P E RV I S E D L E A R N I N G
• How can we avoid pathologic local minima cases?

(1) First, pre-train the data with unsupervised learning method


and get a new representation

(2) Stack up this block structures

(3) Training each layer in end-to-end manner

(4) Fine tune the final structure with (ordinary) fully-connected NN


http://goo.gl/QGJm5k

• Unsupervised learning method

- Restricted Boltzmann Machine (RBM)


→ Deep RBM, Deep Belief Network (DBN)
- Autoencoder
→ Deep Auto-encoder
Autoencoder http://goo.gl/s6kmqY

Terry Taewoong Um (terry.t.um@gmail.com)


43
U N S U P E RV I S E D L E A R N I N G

“Convolutional deep belief networks for scalable unsupervised learning of hierarchical representation”, Lee et al., 2012

Terry Taewoong Um (terry.t.um@gmail.com)


44
CONVOLUTIONAL NN

https://goo.gl/Xswsbd

• How can we deal with real images which is


much bigger than MNIST digit images?

- Use not fully-connected, but locally-connected NN


- Use convolutions to get various feature maps
https://goo.gl/G7kBjI

- Abstract the results into higher layer by using pooling


- Fine tune with fully-connected NN
http://goo.gl/5OR5oH

Terry Taewoong Um (terry.t.um@gmail.com)


45
CONVOLUTIONAL NN

“Visualization and Understanding Convolutional Network”, Zeiler et al., 2012

Terry Taewoong Um (terry.t.um@gmail.com)


46
CONVNET + RNN

“Large-scale Video Classification with Convolutional Neural Network”,


A. Karpathy 2014, https://youtu.be/qrzQ_AB1DZk

Terry Taewoong Um (terry.t.um@gmail.com)


47
RECURRENT NEURAL NETWORK (RNN)
[Neural Network] [Recurrent Neural Network]

t-1 t t+1
http://www.dmi.usherb.ca/~larocheh/index_en.html

Terry Taewoong Um (terry.t.um@gmail.com)


48
RECURRENT NEURAL NETWORK (RNN)
[Neural Network] [Recurrent Neural Network]

back propagation
through time
(BPTT)

back propagation

“Training Recurrent Neural Networks, I. Sutskever, 2013


• Vanishing gradient problem : Can’t have long memory!

Terry Taewoong Um (terry.t.um@gmail.com)


49
RNN + LSTM
• Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)

“Training Recurrent Neural Networks, I. Sutskever, 2013

Terry Taewoong Um (terry.t.um@gmail.com)


50
I N T E R E S T I N G R E S U LT S F R O M R N N
http://cs.stanford.edu/people/karpathy/deepimagesent/

“generating sequences with RNN”,


A.Graves, 2013
http://pail.unist.ac.kr/carpedm20/poet/

Terry Taewoong Um (terry.t.um@gmail.com)


51
WHY IS DL SO SUCCESSFUL?

• Pre-training with unsupervised learning

• Convolutional Neural Network

• Recurrent Neural Net

• GPGPU (parallel processing) & big data

• Advanced algorithms for optimization,


activation, regularization

• Huge research society


(Vision, Speech, NLP, Biology, etc.)

http://t-robotics.blogspot.kr/2015/05/deep-learning.html

Terry Taewoong Um (terry.t.um@gmail.com)


52
CONTENTS

Questions about Part 2?

Terry Taewoong Um (terry.t.um@gmail.com)


53
CONTENTS

3. Machine Learning in
Motion Analysis

Terry Taewoong Um (terry.t.um@gmail.com)


54
M O T I O N D ATA

“츄리닝”, 이상신 국중록

Terry Taewoong Um (terry.t.um@gmail.com)


55
M O T I O N D ATA
“츄리닝”, 이상신 국중록

We need to know the state not only at time t


𝑓 = 𝑓(𝑥, 𝑡)
but also at time t-1, t-2, t-3, etc.

Terry Taewoong Um (terry.t.um@gmail.com)


56
M O T I O N D ATA
• Why do motion data need special treatment?
- In general, most machine learning techniques assume i.i.d. (independent
& identically distributed) sampling condition.
e.g.) coins tossing
- However, motion data is temporally & spatially correlated

swing motion http://goo.gl/LQulvc manipulability ellipsoid https://goo.gl/dHjFO9

Terry Taewoong Um (terry.t.um@gmail.com)


57
M O T I O N D ATA
http://goo.gl/ll3sq6

We can infer the next state But, how can we exploit


based on the temporal & those benefits in ML method?
spatial information

Terry Taewoong Um (terry.t.um@gmail.com)


58
W H AT C A N W E D O W I T H M O T I O N D ATA ?

TASKS
• Learning the kinematic/dynamic model
• Motion segmentation
• Motion generation / synthesis
• Motion imitation (Imitation learning)
• Activity / Gesture recognition
http://goo.gl/gFOVWL

Data Applications
• Motion capture data • Biomechanics
• Vision Data • Humanoid
• Dynamic-level data • Animation

Terry Taewoong Um (terry.t.um@gmail.com)


59
HIDDEN MARKOV MODEL (HMM)

Prob. of (n+1) state only depends on state at (n+1)

Terry Taewoong Um (terry.t.um@gmail.com)


60
L I M I TAT I O N S O F H M M
• A common procedure of HMM for motion analysis

1. Extract features (e.g. PCA)


2. Define the HMM structure (e.g. using GMM)
3. Train a separate HMM per class (Baum-Welch algorithm)
4. Evaluate probability under each HMM (Fwd/Bwd algorithm)
or 3. Choose most probable sequence (Viterbi algorithm)

• Limitations & trend change in speech recognition area


- HMM handle discrete states only!
- HMM has short memory! (using just the previous state)
- HMM has limited expressive power!
- [Trend1] features-GMM → unsupervised learning methods
- [Trend2] features-GMM-HMM → recurrent neural network

Terry Taewoong Um (terry.t.um@gmail.com)


61
C A P T U R E T E M P O R A L I N F O R M AT I O N
• 3D ConvNet
- “3D Convolutional Neural Network for
Human Action Recognition” (Ji et al., 2010)

- 3D convolution

- Activity recognition / Pose estimation from video

“Joint Training of a Convolutional Network


and a Graphical Model for Human Pose
Estimation”, Tompson et al., 2014

Terry Taewoong Um (terry.t.um@gmail.com)


62
C A P T U R E T E M P O R A L I N F O R M AT I O N
• Recurrent Neural Network (RNN)

“Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition”, Y. Du et al., 2015

• However, how can we capture the


spatial information about motions?

Terry Taewoong Um (terry.t.um@gmail.com)


63
CHALLENGES
We should connect the geometric information with deep neural network!

• The link transformation from the i-1 th link to the i th link


𝑋𝑖−1,𝑖 = 𝑅𝑜𝑡 𝑧, 𝜃𝑖 𝑇𝑟𝑎𝑛𝑠 𝑧, 𝑑𝑖 𝑇𝑟𝑎𝑛𝑠 𝑥 , 𝑎𝑖 𝑅𝑜𝑡 𝑧, 𝛼𝑖 = 𝑒 [𝐴𝑖 ]𝜃𝑖 𝑀𝑖−1,𝑖
variable, 𝜃 constant, M
• Forward Kinematics
𝑋0,𝑛 = 𝑒 [𝐴1 ]𝜃1 𝑀0,1 𝑒 [𝐴2 ]𝜃2 𝑀1,2 ⋯ 𝑒 𝐴𝑛 𝜃𝑛
𝑀𝑛−1,𝑛 c.f.)

= 𝑒 [𝑆1 ]𝜃1 𝑒 [𝑆2 ]𝜃2 ⋯ 𝑒 [𝑆𝑛 ]𝜃𝑛 𝑀0,𝑛


𝑆𝑖 = 𝐴𝑑𝑀01 ⋯𝑀𝑖−2,𝑖−1 𝐴𝑖 , 𝑖 = 1, ⋯ , 𝑛

• Newton-Euler formulation for inverse dynamics

Lie group & Lie algebra,


http://goo.gl/uqilDV

external force acting


on the ith body propagated forces where

Terry Taewoong Um (terry.t.um@gmail.com)


64
CHALLENGES

https://www.youtube.com/watch?v=oxA2O-tHftI

Terry Taewoong Um (terry.t.um@gmail.com)


65
Thank you

Terry Taewoong Um (terry.t.um@gmail.com)


66

You might also like