You are on page 1of 39

Class Summary

Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
• Thank you all for participating this class!

• SPOT survey!

• Please give us feedback: lectures, topics, homework, exam, office


hour, piazza
Machine learning algorithms
Supervised Unsupervised
Learning Learning

Discrete Classification Clustering

Dimensionality
Continuous Regression reduction
k-NN (Classification/Regression)
• Model

• Cost function
None
• Learning
Do nothing
• Inference
, where
Linear regression (Regression)
• Model

• Cost function

• Learning
1) Gradient descent: Repeat {}
2) Solving normal equation
• Inference
Naïve Bayes (Classification)
• Model

• Cost function
Maximum likelihood estimation:
Maximum a posteriori estimation :
• Learning

(Discrete )
(Continuous ) mean , variance ,
Logistic regression (Classification)
• Model

• Cost function

• Learning
Gradient descent: Repeat {}


Hard-margin SVM formulation

𝑥2
margin

Soft-margin SVM formulation 𝑥1

𝑥2

𝑥
[]
SVM with kernels 𝑓0
𝑓1
• Hypothesis: Given , compute features 𝑓= 𝑓
2

• Predict ⋮
𝑓𝑚
• Training (original)

• Training (with kernel)


SVM parameters

Large : Lower bias, high variance.
Small Higher bias, low variance.

• Large features vary more smoothly.


• Higher bias, lower variance
• Small features vary less smoothly.
• Lower bias, higher variance

Slide credit: Andrew Ng


Neural network
𝑥0 𝑎
(2 )
0

𝑥1 𝑎
(2 )
1
“Output”
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )

𝑥3 𝑎
(2 )
3
Layer 1 Layer 2 Layer 3 Slide credit: Andrew Ng
Neural network
“activation” of unit in layer
𝑥0 𝑎
(2 )
0 matrix of weights controlling
𝑥1 𝑎
(2 )
1 function mapping from layer to layer
𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 (2 ) unit in layer
𝑎 3
units in layer

Size of ?

𝑠 𝑗 +1 ×(𝑠 𝑗 + 1)
Slide credit: Andrew Ng
Neural network “Pre-activation”

𝑥0 𝑎
(2 )
0

𝑥1 𝑎
(2 )
1

𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 𝑎
(2 )
3

Slide credit: Andrew Ng


Neural network “Pre-activation”

𝑥0 𝑎
(2 )
0

𝑥1 𝑎
(2 )
1

𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )
𝑥3 𝑎
(2 )
3

Add

Slide credit: Andrew Ng


Neural network learning its own features

𝑥0 𝑎
(2 )
0

𝑥1 𝑎
(2 )
1

𝑥2 𝑎
(2 )
2
hΘ( 𝑥 )

𝑥3 𝑎
(2 )
3
Slide credit: Andrew Ng
Bias / Variance Trade-off
• Training error

• Cross-validation error

Loss

Degree of Polynomial
Source: Andrew Ng
Bias / Variance Trade-off
• Training error

• Cross-validation error

High bias High Variance


Loss

Degree of Polynomial
Bias / Variance Trade-off with
Regularization
• Training error

• Cross-validation error

Loss

λ
Source: Andrew Ng
Bias / Variance Trade-off with
Regularization
• Training error

• Cross-validation error

High Variance High bias


Loss

λ
Source: Andrew Ng
K-means algorithm
Randomly initialize cluster centroids
Cluster assignment step
Repeat{
for = 1 to
index (from 1 to ) of cluster centroid
closest to
Centroid update step

for = 1 to
average (mean) of points assigned to cluster
} Slide credit: Andrew Ng
Expectation Maximization (EM) Algorithm
• Goal: Find that maximizes log-likelihood

Jensen’s inequality:
Expectation Maximization (EM) Algorithm
• Goal: Find that maximizes log-likelihood

- The lower bound works for all possible set of distributions


- We want tight lower-bound:
- When will that happen? with probability 1 ( is a constant)
How should we choose ?

• (because it is a distribution)
EM algorithm
Repeat until convergence{

(E-step) For each , set

(Probabilistic inference)

(M-step) Set

}
Anomaly detection algorithm
1. Choose features that you think might be indicative of anomalous
examples

2. Fit parameters

3. Given new example , compute

Anomaly if
Problem motivation
Movie Alice (1) Bob (2) Carol (3) Dave (4)
(romance) (action)
Love at last 5 5 0 0 0.9 0
Romance 5 ? ? 0 1.0 0.01
forever
Cute puppies ? 4 0 ? 0.99 0
of love
Nonstop car 0 0 5 4 0.1 1.0
chases
Swords vs. 0 0 5 ? 0 0.9
karate
Problem motivation
Movie Alice (1) Bob (2) Carol (3) Dave (4)
(romance) (action)
Love at last 5 5 0 0 ? ?
Romance 5 ? ? 0 ? ?
forever
Cute puppies ? 4 0 ? ? ?
of love
Nonstop car 0 0 5 4 ? ?
chases
Swords vs. 0 0 5 ? ? ?
karate

[] [] [] [] []
(1 )
0 ( 2)
0 ( 3)
0 (4)
0 (1 )
?
𝜃 = 5 𝜃 = 5 𝜃 = 0 𝜃 = 0 𝑥 = ?
0 0 5 5 ?
Collaborative filtering optimization objective
• Given , estimate

• Given , estimate

• Minimize and simultaneously


Collaborative filtering algorithm
• Initialize to small random values
• Minimize using gradient descent (or an advanced optimization
algorithm). For every

• For a user with parameter and movie with (learned) feature , predict a
star rating of
Semi-supervised Learning
Problem Formulation
• Labeled data

• Unlabeled data

• Goal: Learn a hypothesis (e.g., a classifier) that has small error


Deep Semi-supervised Learning
Ensemble methods
• Ensemble methods
• Combine multiple classifiers to make better one
• Committees, majority vote
• Weighted combinations
• Can use same or different classifiers
• Boosting
• Train sequentially; later predictors focus on mistakes by earlier
• Boosting for classification (e.g., AdaBoost)
• Use results of earlier classifiers to know what to work on
• Weight hard examples so we focus on them more
• Example: Viola-Jones for face detection
Generative models
Simple Recurrent Network
Reinforcement learning

• Markov decision process


• Q-learning
• Policy gradient
Final exam sample questions
Conceptual questions
• [True/False] Increasing the value of k in a k-nearest neighbor classifier
will decrease its bias
• [True/False] Backpropagation helps neural network training get
unstuck from local minimum
• [True/False] Linear regression can be solved by either matrix algebra
or gradient descent
• [True/False] Logistic regression can be solved by either matrix algebra
or gradient descent
• [True/False] K-means clustering has a unique solution
• [True/False] PCA has a unique solution
Classification/Regression
• Given a simple dataset

• 1) Estimate the parameters

• 2) Compute training error

• 3) Compute leave-one-out cross-validation error

• 4) Compute testing error


Naïve Bayes
• Compute individual probabilities

• Compute
using Naïve Bayes classifier

You might also like