Professional Documents
Culture Documents
Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India
Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India
Optimization for ML
Prateek Jain
Microsoft Research India
http://research.microsoft.com/en-us/people/prajain/
Overview
• High-dimensional Machine Learning
• Many many parameters
• Impose structural assumptions
• Requires solving non-convex optimization
• In general NP-hard
• No provable generic optimization tools
2
Overview
• Most popular approach: convex relaxation
• Solvable in poly-time
• Guarantees under certain assumptions
• Slow in practice
Theoretically
Theoretically
Practical Practical ProvableProvable
AlgorithmsAlgorithms Algorithms
Algorithms
For High-d ML Problems
For High-d ML Problems ForProblems
For High-d ML High-d ML Problems
3
Learning in Large No. of Dimensions
0
(
f
3 0 1… 0 … … …… …2 ...
)
0 9
{Learning, Optimization}
𝑑 4
Linear Model
𝑓 ( 𝑥 )=∑ 𝑤 𝑖 𝑥 𝑖=〈 𝑤 , 𝑥 〉
•
𝑖
• dimensional vector
• No. of training samples:
• For bi-grams: documents!
• Prediction and storage: O(d)
• Prediction time per query: secs
• Over-fitting
00 3 3 0 0… … …… … … ... …0 9... … 0 … 9… … 22 1
5
Another Example: Low-rank Matrix
Completion
•
Task: Complete ratings matrix
• No. of Parameters:
6
Key Issues
• Large no. of training samples required
• Large training time
• Large storage and prediction time
7
Learning with Structure
• Restrict the parameter space
• Linear classification/regression:
• Restrict no. of zeros in to
0 3 0 0 1 0 0 0 0 9
• Say
8
Learning with Structure contd…
• Matrix completion:
𝑟
𝑟
≅
×
𝑊
≅ 𝑈 × 𝑉
𝑇 •
W: characterized by U, V
• No. of variables:
𝑑
2 • U:
• V: 9
Learning with structure Data Fidelity
Function
•
min 𝐿 (𝑤 )
𝑤
• Linear classification/regression •
Comp. Complexity: NP-Hard
• Non-convex
• Matrix completion •
Comp. Complexity: NP-Hard
• Non-convex 10
Other Examples
•
Complexity: undecidable
• Low-rank Tensor completion
• Non-convex
• Robust PCA
•
Complexity: NP-Hard
• Non-convex
11
Convex Relaxations
• Linear classification/regression
•
~
𝐶 ={𝑤 , || 𝑤 | |1 ≤ 𝜆 (𝑠 )}
• Matrix completion ~
• 𝐶 ={𝑊 , || 𝑊 || ∗ ≤ 𝜆 ( 𝑟 ) }
12
Convex Relaxations Contd…
• Robust PCA
~
𝐶 ={𝑊 , 𝑊 = 𝐿+ 𝑆 ,|| 𝐿 | |∗ ≤ 𝜆 ( 𝑟 ) ,|| 𝑆 | |1 ≤ 𝜆 ( 𝑠 ) }
13
Convex Relaxation
• Advantage:
• Convex optimization: Polynomial time
• Generic tools available for optimization
• Systematic analysis
• Disadvantage:
• Optimizes over a much bigger set
• Not scalable to large problems
14
This tutorial’s focus
Don’t Relax!
• Advantage: scalability
• Disadvantage: optimization and its analysis is much harder
• Local minima problems
• Two approaches:
• Projected gradient descent
• Alternating minimization
15
Approach 1: Projected Gradient Descent
•
16
Efficient Projection
• Sparse linear regression/classification
•
17
Approach 2: Alternating Minimization
•
• Alternating Minimization:
• Fix U, optimize for V
• Generic technique
• If each individual problem is “easy”
• Generic technique, e.g., EM algorithms
Results for Several Problems
• Sparse regression [Jain et al.’14, Garg and Khandekar’09]
• Sparsity
• Robust Regression [Bhatia et al.’15]
• Sparsity+output sparsity
• Dictionary Learning [Agarwal et al.’14]
• Matrix Factorization + Sparsity
• Phase Sensing [Netrapalli et al.’13]
• System of Quadratic Equations
• Vector-value Regression [Jain & Tewari’15]
• Sparsity+positive definite matrix
19
Results Contd…
• Low-rank
Matrix Regression [Jain et al.’10, Jain et al.’13]
• Low-rank structure
• Low-rank Matrix Completion [Jain & Netrapalli’15, Jain et al.’13]
• Low-rank structure
• Robust PCA [Netrapalli et al.’14]
• Low-rank Sparse Matrices
• Tensor Completion [Jain and Oh’14]
• Low-tensor rank
• Low-rank matrix approximation [Bhojanapalli et al.’15]
• Low-rank structure
20
Sparse Linear Regression
d
=
n
𝑦 =
𝑋 𝑤
•But:
• sparse ( non-zeros)
21
Motivation: Single Pixel Camera
• Formulate as C 1-vs-all binary problems
• Learn s.t. prediction is
• Imagenet has 20,000 categories
• Problem: Train 20,000 SVM’s
• Prediction time: O(20,000 d)
Sparsity
• Typically an image has only objects
Label
Compressive Sensing of Labels
𝑦
𝑋 Label
100 ×
¿
20000
• : number of non-zeros
27
Non-convexity of Low-rank manifold
1 0 0.5
0.5 0 + 0.5 1 = 0.5
0 0 0
Convex Relaxation
•
• Relaxed Problem:
29
Our Approach : Projected Gradient Descent
•
30
[Jain, Tewari, Kar’2014]
Projection onto ball?
•
Important Properties
A Stronger Result?
• 2 𝑑 −𝑠 2
|| 𝑃 𝑠 𝑧 − 𝑧 | | ≤
( ) 2 ∗
|| 𝑃 𝑠 𝑧 − 𝑧 | |2
∗ ( )
𝑑−𝑠
Our Approach : Projected Gradient Descent
•
34
[Jain, Tewari, Kar’2014]
Convex-projections vs Non-convex
Projections
• For non-convex sets, we only have:
𝑤 Xw
Proof under RIP
• Let
• Let
• Let
• ball with non-zeros and
• 0-mean distribution
• Bounded fourth moment
𝑋
𝑑
𝑛=𝑂(𝑠 log )
𝑠
𝑛
Popular RIP Ensembles
𝑋
•
𝑑
𝑛=𝑂(𝑠 log )
𝑠
𝑛
2 2 2
( 1 − 𝛿 𝑠 ) ||𝐰 | | ≤ || 𝐗𝐰 | | ≤ ( 1 + 𝛿 𝑠 ) || 𝐰 | | , 𝑤 ∈ 𝐶
Proof?
But what if RIP is not possible?
Statistical Guarantees
•
•
• sparse
56
[Jain, Tewari, Kar’2014]
Proof?
•
Proof?
General Result for Any Function
•
• satisfies RSC/RSS, i.e.,
After steps
• If and
•,
Convex
• sparse
• Number of iterations:
Non-Convex
𝑦 =
𝑋 𝑤
+ 𝑏
∗
𝑦= 𝑋 𝑤 +𝑏
Typical b:
a) Deterministic error :
b) Gaussian error :
Robust Regression
•
• We want to be a constant
• Entries of can be unbounded!
• can be arbitrarily large
• Still we want:
RR Algorithm
•
• For t=0, 1, ….
A
≅ ×
-dim, k-sparse
vector
𝑛
𝑟
•Overcomplete dictionaries:
• Goal: Given , compute
• Using small number of samples
Existing Results
• Generalization error bounds [VMB’11, MPR’12, MG’13, TRS’13]
• But assumes that the optimal solution is reached
• Do not cover exact recovery with finite many samples
• Identifiability of [HS’11]
• Require exponentially many samples
• Exact recovery [SWW’12]
• Restricted to square dictionary
• In practice, overcomplete dictionary is more useful
Generating Model
• Generate dictionary
• Assume to be incoherent, i.e.,
• Initialize
• Using clustering+SVD method of [AAN’13] or [AGM’13]
Results [AAJNT’13]
• Assumptions:
• is incoherent ()
• Use randomness in
Simulations
Emirically:
Known result:
Summary
• Consider high-dimensional structured problems
• Sparsity
• Block sparsity
• Tree-based sparsity
• Error sparsity
• Iterative hard thresholding style method
• Practical/easy to implement
• Fast convergence
• RIP/RSC/subGaussian data: Provable guarantees
http://research.microsoft.com/en-us/people/prajain/
Purushottam Kar Kush Bhatia
Asst. Prof.
Univ of Michigan
Next Lecture
• Low-rank Structure
• Matrix Regression
• Matrix Completion
• Robust PCA
• Low-rank Tensor Structure
• Tensor completion
Block-sparse Signals
•
• Total no. of measurements:
• Correlated signals:
• Method--- Group norms: or
• Improvement in sample complexity if