You are on page 1of 46

Machine learning theory

 Instructor
• Jun-Won Choi
• Office: 공업센터별관 303-1
• Phone: 02-2220-2316
• Office hour: meeting can be scheduled personally via email

 Prerequisite
• Basic knowledge of probability theory and linear algebra
Textbook
 Main textbook: Bishop, “Pattern recognition and machine
learning”

 Duda, Hart, Stork, “Pattern classification, 2nd edition”


Textbook
 Haykin, “Neural networks and learning machines”

 Goodfellow, “Deep learning”


Rebirth of artificial intelligence
 Ray Kuzweil predicted that technological singularity (the moment in
time that computers’ intelligence levels exceed human intelligence
levels) comes in the year of 2020.

4
Deep learning
 10 breakthrough technologies 2013

5
Deep learning = Deep neural network (DNN)
 Deep = many hidden layers

Data
Representati
on of data

F(x)

6
Applications of deep learning
 Computer vision
• Image classification, object detection, face recognition, face detection,
action recognition, scene understanding
 Natural language processing, speech recognition, translation
• Natural language processing, speech recognizer, question & answer
machine, language translation, image captioning
 Regression & forecasting
• Time-series analysis, financial modeling
 Medical diagnosis
• Breast cancer cell mitosis detection, bio-informatics
 Autonomous vehicle & robotics
• Pedestrian detection, traffic sign recognition, lane tracking, car accident
avoidance, driver behavior modeling

7
Performance of deep learning
 Imagenet: Large image data set (60 million images)
 Imagenet large scale visual recognition challenges (ILSVRC)
• Image classification competition with 1.2 million labeled images and 1,000
categories
• Annual challenge since 2010

Deep learning has been


used since 2012. 8
Why does deep learning work so well?
 Hierarchical feature learning through deep architecture (Inspired by
human brain processing)
 Deep architecture can represent more complex data structure than the
shallow structure.

9
Dark age of neural network
 Most of theory has been developed in 1980-1990’s
 However, neural network has not been successful
• Vanishing gradient problem
• Slow training
• Local optima
• Parameter space is highly nonconvex
• Gradient decent algorithm is easily stuck at the bad local minima
 Most of machine learning research focuses on model-based learning
• Support vector machine (SVM), Gaussian mixture model (GMM), Kernel
methods

10
Renaissance of deep neural network
 Unsupervised pre-training (Hinton, 2006)
• Training using back-propagation suffers from local minima and over-fitting
• Weights are pre-trained without labels (unsupervised learning)
• After pre-training, fine-tuning is performed using labels.

Restricted Bolzmann
Machine (RBM)

Deep Belief Network 11


Renaissance of deep neural network
 Big data
• Need to train hundreds of million parameters
• Overfitting issue can be resolved by using a large amount of data
 GPU
• Highly optimized for parallel processing and matrix operations
• Nividia Gforce TITAN-X : 3072 CUDA cores, 12G memory, 7
Tflops/single precision
• Various development framework
• Based on CUDA programming

12
Renaissance of deep neural network
 Top researchers are being hired by company
Jeffrey Hinton (Toronto)

Andrew Ng (Stanford)

Yoshua Bengio (Montreal)

Yann Lecun (NYU)

13
Popular structures of deep neural network
 Fully-connected deep neural network (DNN)

 Convolutive neural network (CNN)

 Recurrent neural network (RNN)

 Reinforcement learning

14
Application of CNN: object detection
 Experimental results

Car Horse Cow

Train Car Motorbike


15
Application of RNN: image captioning

16
Application of reinforcement learning
 AlphaGo

• Supervised learning: train Rollout/SL policy networks to predict next move


using the 15000 records of the previous plays in KGS Go server.
• Reinforcement learning: RL policy is initialized by SL policy and fine-tuned
by self-play.
Introduction to machine learning
 Reading
• Duda book Ch. 1.
• Bishop book Ch. 1.
Machine learning
 Components of machine learning
• Data
• Task
• Regression: map the data into continuous quantity.
• Classification: map the data into discrete quantity.
• Learning
• Supervised learning
• Unsupervised learning
• Reinforcement learning

• Regression
Data
• Classification
Learning
Classification example [Duda] pp. 1-9

 Classification between salmon and sea bass

Salmon
Sea bass

• We take a picture of fish and decide whether a fish is salman or sea bass
based on the picture.
• Two fishes are different in terms of length, lightness, width, number and
shape of fins, position of the mouth, …
• There are variations in lighting and position of the fish on conveyor.
Classification example
 Learning from the data
Label
• Supervised learning vs unsupervised learning
• Supervised learning: we know the right answer called “label”.

Data

Label B B S B

Data

Label S S B S

• Unsupervised learning: no teacher. Clustering is performed.


Machine learning
 Two stages of machine learning
• Training phase
• Collect the “training data”.
• The model is trained using the training data.
• Try to minimize the classification errors for training data
• Test phase
• Try the data unseen by the classifier.
• Good performance in training errors does not always lead to good
performance in test errors.
Training

Salmon?
Feature 
Classifier
extraction
Sea bass?
Classification example
 Feature extraction
• Length

Histogram

• Lightness
Classification example
 Feature space

• We might use more features for improvement.


Classification example
 Overfitting problem

• Classification is perfect for training data.


• But performance would not be good for the data not seen yet.
• This is the issue of generalization.
Classification example
 Finding the best decision boundary (in terms of generalization
performance) is the key problem of machine learning.
Classification example
 Feature extraction
• Find the most distinguishable representation of data.
• With less training data, we need more domain knowledge.
• Feature should be robust to noise, errors, rotations, translations,
illuminations etc.
Classification example
 Classify handwritten digit
• MNIST dataset
• 60,000 training images (= labeled data)
• 10,000 test images
• Each image has been resized to 28by28 size.
• It is not easy to recognize digits due to variability of handwriting
Classification example
 Classifier performance
Regression example
 Regression task
• Desirable output is a continuous variable.
• Suppose that several (input, target) data pairs are given. These data pairs
are considered as training data.
• The goal of regression is to obtain the target output for the unseen input
data.
(Size, Price): training data
Price

800k
?
600k

Size of house
2000 3000
sqt sqt
Regression example
 Curve fitting problem
• Ten data points are given.
• Data is generated from the model 2

?
Regression example
 Polynomial curve fitting
• We fit the data using the polynomial function of the form

is the order of polynomial


• Least square method


• We minimize the squared error function
Regression example
 How to choose ?
• Model selection problem.
• Not easy to solve.

Weights of polynomial
are turned to learn
random noise

Overfitting
problem
Regression example
 Overfitting problem
• With 9, we can make training error zero
• However, test error becomes very large.
Regression example
 How to overcome overfitting
• More data points can solve overfitting. M=9

• Mostly, the data is limited.


• We should know proper model complexity!
Regression example
 Regularization
• Soothe overfitting problem.
• Shrinkage method
• Add a penalty term which discourages the coefficients from reaching
large values.
Regression example
 S-fold cross validation – choose the best model complexity
• (S-1)/S portion of the available data is used for training and the rest is used
for performance evaluation.
• This process is repeated for all possible choices.
• Finally, performance scores for all runs are averaged.
• The averaged performance scores are compared for various model
candidates.
Probability theory
 Marginalization

Conditional probability
 Joint probability distribution
Probability theory
 Expectation

 Conditional expectation

 Variance
Probability theory
 Covariance

 Covariance matrix
Probability theory
 Bayes’ theorem Likelihood function

A prior distribution

A posteriori distribution
Observed data

• Negative log likelihood is called “error function”


Probability theory
 Gaussian distribution

 Multi-variate Gaussian distribution


Probability theory
 Maximum likelihood estimation
• We want to estimate and with the data samples drawn from i.i.d.
Gaussian distribution

These estimators are biased


Probability theory
 Maximum likelihood estimation for curve fitting
• Likelihood function

• Log-likelihood function

• ML estimate is obtained by maximizing the log-likelihood function with


respect to the parameters and .
Probability theory
 Bayesian curve fitting 1
• Maximum a posteriori (MAP) estimation of
• Prior distribution of

• A posteriori distribution of

• MAP estimate is obtained by minimizing

Regularization term
Probability theory
 Bayesian curve fitting 2
• A posteriori distribution of given the input

• Is a posteriori distribution Gaussian?


• What are the mean and covariance matrix?

You might also like