Machine Learning Theory: Instructor

Machine learning theory
 Instructor
• Jun-Won Choi
• Office: 공업센터별관 303-1
• Phone: 02-2220-2316
• Office hour: meeting can be scheduled personally via email
 Prerequisite
• Basic knowledge of probability theory and linear algebra
Textbook
 Main textbook: Bishop, “Pattern recognition and machine
learning”
 Duda, Hart, Stork, “Pattern classification, 2nd edition”

Textbook
 Haykin, “Neural networks and learning machines”
 Goodfellow, “Deep learning”

Rebirth of artificial intelligence
 Ray Kuzweil predicted that technological singularity (the moment in
time that computers’ intelligence levels exceed human intelligence
levels) comes in the year of 2020.
4
Deep learning
 10 breakthrough technologies 2013
5
Deep learning = Deep neural network (DNN)
 Deep = many hidden layers
Data
Representati
on of data
F(x)
6
Applications of deep learning
 Computer vision
• Image classification, object detection, face recognition, face detection,
action recognition, scene understanding
 Natural language processing, speech recognition, translation
• Natural language processing, speech recognizer, question & answer
machine, language translation, image captioning
 Regression & forecasting
• Time-series analysis, financial modeling
 Medical diagnosis
• Breast cancer cell mitosis detection, bio-informatics
 Autonomous vehicle & robotics
• Pedestrian detection, traffic sign recognition, lane tracking, car accident
avoidance, driver behavior modeling
7
Performance of deep learning
 Imagenet: Large image data set (60 million images)
 Imagenet large scale visual recognition challenges (ILSVRC)
• Image classification competition with 1.2 million labeled images and 1,000
categories
• Annual challenge since 2010
Deep learning has been

used since 2012. 8
Why does deep learning work so well?
 Hierarchical feature learning through deep architecture (Inspired by
human brain processing)
 Deep architecture can represent more complex data structure than the
shallow structure.
9
Dark age of neural network
 Most of theory has been developed in 1980-1990’s
 However, neural network has not been successful
• Vanishing gradient problem
• Slow training
• Local optima
• Parameter space is highly nonconvex
• Gradient decent algorithm is easily stuck at the bad local minima
 Most of machine learning research focuses on model-based learning
• Support vector machine (SVM), Gaussian mixture model (GMM), Kernel
methods
10
Renaissance of deep neural network
 Unsupervised pre-training (Hinton, 2006)
• Training using back-propagation suffers from local minima and over-fitting
• Weights are pre-trained without labels (unsupervised learning)
• After pre-training, fine-tuning is performed using labels.
Restricted Bolzmann
Machine (RBM)
Deep Belief Network 11

 Big data
• Need to train hundreds of million parameters
• Overfitting issue can be resolved by using a large amount of data
 GPU
• Highly optimized for parallel processing and matrix operations
• Nividia Gforce TITAN-X : 3072 CUDA cores, 12G memory, 7
Tflops/single precision
• Various development framework
• Based on CUDA programming
12
 Top researchers are being hired by company
Jeffrey Hinton (Toronto)
Andrew Ng (Stanford)
Yoshua Bengio (Montreal)
Yann Lecun (NYU)
13
Popular structures of deep neural network
 Fully-connected deep neural network (DNN)
 Convolutive neural network (CNN)
 Recurrent neural network (RNN)
 Reinforcement learning
14
Application of CNN: object detection
 Experimental results
Car Horse Cow
Train Car Motorbike

15
Application of RNN: image captioning
16
Application of reinforcement learning
 AlphaGo
• Supervised learning: train Rollout/SL policy networks to predict next move

using the 15000 records of the previous plays in KGS Go server.
• Reinforcement learning: RL policy is initialized by SL policy and fine-tuned
by self-play.
Introduction to machine learning
 Reading
• Duda book Ch. 1.
• Bishop book Ch. 1.
Machine learning
 Components of machine learning
• Data
• Task
• Regression: map the data into continuous quantity.
• Classification: map the data into discrete quantity.
• Learning
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Regression
Data
• Classification
Learning
Classification example [Duda] pp. 1-9
 Classification between salmon and sea bass
Salmon
Sea bass
• We take a picture of fish and decide whether a fish is salman or sea bass
based on the picture.
• Two fishes are different in terms of length, lightness, width, number and
shape of fins, position of the mouth, …
• There are variations in lighting and position of the fish on conveyor.
Classification example
 Learning from the data
Label
• Supervised learning vs unsupervised learning
• Supervised learning: we know the right answer called “label”.
Data
Label B B S B
Data
Label S S B S
• Unsupervised learning: no teacher. Clustering is performed.

Machine learning
 Two stages of machine learning
• Training phase
• Collect the “training data”.
• The model is trained using the training data.
• Try to minimize the classification errors for training data
• Test phase
• Try the data unseen by the classifier.
• Good performance in training errors does not always lead to good
performance in test errors.
Training
Salmon?
Feature
Classifier
extraction
Sea bass?
 Feature extraction
• Length
Histogram
• Lightness
 Feature space
• We might use more features for improvement.

 Overfitting problem
• Classification is perfect for training data.

• But performance would not be good for the data not seen yet.
• This is the issue of generalization.
 Finding the best decision boundary (in terms of generalization
performance) is the key problem of machine learning.
 Feature extraction
• Find the most distinguishable representation of data.
• With less training data, we need more domain knowledge.
• Feature should be robust to noise, errors, rotations, translations,
illuminations etc.
 Classify handwritten digit
• MNIST dataset
• 60,000 training images (= labeled data)
• 10,000 test images
• Each image has been resized to 28by28 size.
• It is not easy to recognize digits due to variability of handwriting
 Classifier performance
Regression example
 Regression task
• Desirable output is a continuous variable.
• Suppose that several (input, target) data pairs are given. These data pairs
are considered as training data.
• The goal of regression is to obtain the target output for the unseen input
data.
(Size, Price): training data
Price
800k
?
600k
Size of house
2000 3000
sqt sqt
Regression example
 Curve fitting problem
• Ten data points are given.
• Data is generated from the model 2
?
Regression example
 Polynomial curve fitting
• We fit the data using the polynomial function of the form
is the order of polynomial

…
• Least square method

• We minimize the squared error function
Regression example
 How to choose ?
• Model selection problem.
• Not easy to solve.
Weights of polynomial
are turned to learn
random noise
Overfitting
problem
Regression example
 Overfitting problem
• With 9, we can make training error zero
• However, test error becomes very large.
Regression example
 How to overcome overfitting
• More data points can solve overfitting. M=9
• Mostly, the data is limited.

• We should know proper model complexity!
Regression example
 Regularization
• Soothe overfitting problem.
• Shrinkage method
• Add a penalty term which discourages the coefficients from reaching
large values.
Regression example
 S-fold cross validation – choose the best model complexity
• (S-1)/S portion of the available data is used for training and the rest is used
for performance evaluation.
• This process is repeated for all possible choices.
• Finally, performance scores for all runs are averaged.
• The averaged performance scores are compared for various model
candidates.
Probability theory
 Marginalization
Conditional probability
 Joint probability distribution
Probability theory
 Expectation
 Conditional expectation
 Variance
Probability theory
 Covariance
 Covariance matrix
Probability theory
 Bayes’ theorem Likelihood function
A prior distribution
A posteriori distribution
Observed data
• Negative log likelihood is called “error function”

Probability theory
 Gaussian distribution
 Multi-variate Gaussian distribution

Probability theory
 Maximum likelihood estimation
• We want to estimate and with the data samples drawn from i.i.d.
Gaussian distribution
These estimators are biased

Probability theory
 Maximum likelihood estimation for curve fitting
• Likelihood function
• Log-likelihood function
• ML estimate is obtained by maximizing the log-likelihood function with

respect to the parameters and .
Probability theory
 Bayesian curve fitting 1
• Maximum a posteriori (MAP) estimation of
• Prior distribution of
• A posteriori distribution of
• MAP estimate is obtained by minimizing
Regularization term
Probability theory
 Bayesian curve fitting 2
• A posteriori distribution of given the input
• Is a posteriori distribution Gaussian?

• What are the mean and covariance matrix?

Machine Learning Theory: Instructor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Theory: Instructor

Uploaded by

Copyright:

Available Formats

Machine learning theory

 Duda, Hart, Stork, “Pattern classification, 2nd edition”

 Goodfellow, “Deep learning”

Deep learning has been

Deep Belief Network 11

Yoshua Bengio (Montreal)

Yann Lecun (NYU)

 Convolutive neural network (CNN)

 Recurrent neural network (RNN)

Car Horse Cow

Train Car Motorbike

• Supervised learning: train Rollout/SL policy networks to predict next move

 Classification between salmon and sea bass

• Unsupervised learning: no teacher. Clustering is performed.

• We might use more features for improvement.

• Classification is perfect for training data.

is the order of polynomial

• Least square method

• Mostly, the data is limited.

• Negative log likelihood is called “error function”

 Multi-variate Gaussian distribution

These estimators are biased

• ML estimate is obtained by maximizing the log-likelihood function with

• MAP estimate is obtained by minimizing

• Is a posteriori distribution Gaussian?

You might also like