Lec 24

Class Summary
Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
• Thank you all for participating this class!
• SPOT survey!
• Please give us feedback: lectures, topics, homework, exam, office

hour, piazza
Machine learning algorithms
Supervised Unsupervised
Learning Learning
Discrete Classification Clustering
Dimensionality
Continuous Regression reduction
k-NN (Classification/Regression)
• Model
• Cost function
None
• Learning
Do nothing
• Inference
, where
Linear regression (Regression)
• Model
• Cost function
• Learning
1) Gradient descent: Repeat {}
2) Solving normal equation
• Inference
Naïve Bayes (Classification)
• Model
• Cost function
Maximum likelihood estimation:
Maximum a posteriori estimation :
• Learning
(Discrete )
(Continuous ) mean , variance ,
Logistic regression (Classification)
• Model
• Cost function
• Learning
Gradient descent: Repeat {}
•
Hard-margin SVM formulation
𝑥2
margin
Soft-margin SVM formulation 𝑥1
𝑥2
𝑥
[]
SVM with kernels 𝑓0
𝑓1
• Hypothesis: Given , compute features 𝑓= 𝑓
2
• Predict ⋮
𝑓𝑚
• Training (original)
• Training (with kernel)

SVM parameters
•
Large : Lower bias, high variance.
Small Higher bias, low variance.
• Large features vary more smoothly.

• Higher bias, lower variance
• Small features vary less smoothly.
• Lower bias, higher variance
Slide credit: Andrew Ng

Neural network
𝑥0 𝑎
(2 )
0
𝑥1 𝑎
(2 )
1
“Output”
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 𝑎
(2 )
3
Layer 1 Layer 2 Layer 3 Slide credit: Andrew Ng
Neural network
“activation” of unit in layer
𝑥0 𝑎
(2 )
0 matrix of weights controlling
𝑥1 𝑎
(2 )
1 function mapping from layer to layer
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 (2 ) unit in layer
𝑎 3
units in layer
Size of ?
𝑠 𝑗 +1 ×(𝑠 𝑗 + 1)
Neural network “Pre-activation”
𝑥0 𝑎
(2 )
0
𝑥1 𝑎
(2 )
1
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 𝑎
(2 )
3

Neural network “Pre-activation”
𝑥0 𝑎
(2 )
0
𝑥1 𝑎
(2 )
1
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 𝑎
(2 )
3
Add

Neural network learning its own features
𝑥0 𝑎
(2 )
0
𝑥1 𝑎
(2 )
1
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 𝑎
(2 )
3
Bias / Variance Trade-off
• Training error
• Cross-validation error
Loss
Degree of Polynomial
Source: Andrew Ng
Bias / Variance Trade-off
• Training error
High bias High Variance

Loss
Degree of Polynomial
Bias / Variance Trade-off with
Regularization
• Training error
Loss
λ
Source: Andrew Ng
Bias / Variance Trade-off with
Regularization
• Training error
High Variance High bias

Loss
λ
Source: Andrew Ng
K-means algorithm
Randomly initialize cluster centroids
Cluster assignment step
Repeat{
for = 1 to
index (from 1 to ) of cluster centroid
closest to
Centroid update step
for = 1 to
average (mean) of points assigned to cluster
} Slide credit: Andrew Ng
Expectation Maximization (EM) Algorithm
• Goal: Find that maximizes log-likelihood
Jensen’s inequality:
Expectation Maximization (EM) Algorithm
• Goal: Find that maximizes log-likelihood
- The lower bound works for all possible set of distributions

- We want tight lower-bound:
- When will that happen? with probability 1 ( is a constant)
How should we choose ?
• (because it is a distribution)
EM algorithm
Repeat until convergence{
(E-step) For each , set
(Probabilistic inference)
(M-step) Set
}
Anomaly detection algorithm
1. Choose features that you think might be indicative of anomalous
examples
2. Fit parameters
3. Given new example , compute
Anomaly if
Problem motivation
Movie Alice (1) Bob (2) Carol (3) Dave (4)
(romance) (action)
Love at last 5 5 0 0 0.9 0
Romance 5 ? ? 0 1.0 0.01
forever
Cute puppies ? 4 0 ? 0.99 0
of love
Nonstop car 0 0 5 4 0.1 1.0
chases
Swords vs. 0 0 5 ? 0 0.9
karate
Problem motivation
Movie Alice (1) Bob (2) Carol (3) Dave (4)
(romance) (action)
Love at last 5 5 0 0 ? ?
Romance 5 ? ? 0 ? ?
forever
Cute puppies ? 4 0 ? ? ?
of love
Nonstop car 0 0 5 4 ? ?
chases
Swords vs. 0 0 5 ? ? ?
karate
[] [] [] [] []
(1 )
0 ( 2)
0 ( 3)
0 (4)
0 (1 )
?
𝜃 = 5 𝜃 = 5 𝜃 = 0 𝜃 = 0 𝑥 = ?
0 0 5 5 ?
Collaborative filtering optimization objective
• Given , estimate
• Given , estimate
• Minimize and simultaneously

Collaborative filtering algorithm
• Initialize to small random values
• Minimize using gradient descent (or an advanced optimization
algorithm). For every
• For a user with parameter and movie with (learned) feature , predict a
star rating of
Semi-supervised Learning
Problem Formulation
• Labeled data
• Unlabeled data
• Goal: Learn a hypothesis (e.g., a classifier) that has small error

Deep Semi-supervised Learning
Ensemble methods
• Ensemble methods
• Combine multiple classifiers to make better one
• Committees, majority vote
• Weighted combinations
• Can use same or different classifiers
• Boosting
• Train sequentially; later predictors focus on mistakes by earlier
• Boosting for classification (e.g., AdaBoost)
• Use results of earlier classifiers to know what to work on
• Weight hard examples so we focus on them more
• Example: Viola-Jones for face detection
Generative models
Simple Recurrent Network
Reinforcement learning
• Markov decision process

• Q-learning
• Policy gradient
Final exam sample questions
Conceptual questions
• [True/False] Increasing the value of k in a k-nearest neighbor classifier
will decrease its bias
• [True/False] Backpropagation helps neural network training get
unstuck from local minimum
• [True/False] Linear regression can be solved by either matrix algebra
or gradient descent
• [True/False] Logistic regression can be solved by either matrix algebra
or gradient descent
• [True/False] K-means clustering has a unique solution
• [True/False] PCA has a unique solution
Classification/Regression
• Given a simple dataset
• 1) Estimate the parameters
• 2) Compute training error
• 3) Compute leave-one-out cross-validation error
• 4) Compute testing error

Naïve Bayes
• Compute individual probabilities
• Compute
using Naïve Bayes classifier

Lec 24

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 24

Uploaded by

Copyright:

Available Formats

Class Summary

• Please give us feedback: lectures, topics, homework, exam, office

Discrete Classification Clustering

Soft-margin SVM formulation 𝑥1

• Training (with kernel)

• Large features vary more smoothly.

Slide credit: Andrew Ng

Slide credit: Andrew Ng

Slide credit: Andrew Ng

High bias High Variance

High Variance High bias

- The lower bound works for all possible set of distributions

(E-step) For each , set

3. Given new example , compute

• Minimize and simultaneously

• Goal: Learn a hypothesis (e.g., a classifier) that has small error

• Markov decision process

• 1) Estimate the parameters

• 2) Compute training error

• 3) Compute leave-one-out cross-validation error

• 4) Compute testing error

You might also like