Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

Provable Non-convex
Optimization for ML
Prateek Jain
Microsoft Research India
http://research.microsoft.com/en-us/people/prajain/
Overview
• High-dimensional Machine Learning
• Many many parameters
• Impose structural assumptions
• Requires solving non-convex optimization
• In general NP-hard
• No provable generic optimization tools
2
Overview
• Most popular approach: convex relaxation
• Solvable in poly-time
• Guarantees under certain assumptions
• Slow in practice
Theoretically
Theoretically
Practical Practical ProvableProvable
AlgorithmsAlgorithms Algorithms
Algorithms
For High-d ML Problems
For High-d ML Problems ForProblems
For High-d ML High-d ML Problems
3
Learning in Large No. of Dimensions
0
(
f
3 0 1… 0 … … …… …2 ...
)
0 9
{Learning, Optimization}

𝑑 4
Linear Model
𝑓 ( 𝑥 )=∑ 𝑤 𝑖 𝑥 𝑖=〈 𝑤 , 𝑥 〉

•
𝑖
• dimensional vector
• No. of training samples:
• For bi-grams: documents!
• Prediction and storage: O(d)
• Prediction time per query: secs
• Over-fitting
00 3 3 0 0… … …… … … ... …0 9... … 0 … 9… … 22 1
5
Another Example: Low-rank Matrix
Completion
•
Task: Complete ratings matrix
• No. of Parameters:
6
Key Issues
• Large no. of training samples required
• Large training time
• Large storage and prediction time
7
Learning with Structure
• Restrict the parameter space
• Linear classification/regression:
• Restrict no. of zeros in to
0 3 0 0 1 0 0 0 0 9
• Say
• Need to learn only parameters
8
Learning with Structure contd…
• Matrix completion:
𝑟
𝑟
≅

×

𝑊

≅ 𝑈 × 𝑉
𝑇 •
W: characterized by U, V
• No. of variables:
𝑑
2 • U:
• V: 9
Learning with structure Data Fidelity
Function
•
min 𝐿 (𝑤 )

𝑤
• Linear classification/regression •
Comp. Complexity: NP-Hard
• Non-convex
• Matrix completion •
Comp. Complexity: NP-Hard
• Non-convex 10
Other Examples
•
Complexity: undecidable
• Low-rank Tensor completion
• Non-convex
• Robust PCA
•
Complexity: NP-Hard
• Non-convex
11
Convex Relaxations
• Linear classification/regression
•
~
𝐶 ={𝑤 , || 𝑤 | |1 ≤ 𝜆 (𝑠 )}

• Matrix completion ~
• 𝐶 ={𝑊 , || 𝑊 || ∗ ≤ 𝜆 ( 𝑟 ) }

12
Convex Relaxations Contd…
• Low-rank Tensor completion

~

𝐶 ={𝑊 , ||𝑊 ||∗ ≤ 𝜆 ( 𝑟 ) }
• Robust PCA
~
𝐶 ={𝑊 , 𝑊 = 𝐿+ 𝑆 ,|| 𝐿 | |∗ ≤ 𝜆 ( 𝑟 ) ,|| 𝑆 | |1 ≤ 𝜆 ( 𝑠 ) }

13
Convex Relaxation
• Advantage:
• Convex optimization: Polynomial time
• Generic tools available for optimization
• Systematic analysis
• Disadvantage:
• Optimizes over a much bigger set
• Not scalable to large problems
14
This tutorial’s focus
Don’t Relax!
• Advantage: scalability
• Disadvantage: optimization and its analysis is much harder
• Local minima problems
• Two approaches:
• Projected gradient descent
• Alternating minimization
15
Approach 1: Projected Gradient Descent
•
16
Efficient Projection
• Sparse linear regression/classification
•
• Low-rank Matrix completion

•
• SVD (top- singular components)
17
Approach 2: Alternating Minimization
•
• Alternating Minimization:
• Fix U, optimize for V
• Fix V, optimize for U
• Generic technique
• If each individual problem is “easy”
• Generic technique, e.g., EM algorithms
Results for Several Problems
• Sparse regression [Jain et al.’14, Garg and Khandekar’09]
• Sparsity
• Robust Regression [Bhatia et al.’15]
• Sparsity+output sparsity
• Dictionary Learning [Agarwal et al.’14]
• Matrix Factorization + Sparsity
• Phase Sensing [Netrapalli et al.’13]
• System of Quadratic Equations
• Vector-value Regression [Jain & Tewari’15]
• Sparsity+positive definite matrix
19
Results Contd…
• Low-rank
Matrix Regression [Jain et al.’10, Jain et al.’13]
• Low-rank structure
• Low-rank Matrix Completion [Jain & Netrapalli’15, Jain et al.’13]
• Robust PCA [Netrapalli et al.’14]
• Low-rank Sparse Matrices
• Tensor Completion [Jain and Oh’14]
• Low-tensor rank
• Low-rank matrix approximation [Bhojanapalli et al.’15]
20
Sparse Linear Regression

d
=
n

𝑦 =
𝑋 𝑤

•But:
• sparse ( non-zeros)
21
Motivation: Single Pixel Camera
• For 1Megapixel image, 1Million measurements would be required
Picture taken from Baranuik et al.

Sparsity
• Most images are sparse in wavelet transform space

• Typically around 2.5% coefficients are significant
Picture taken from Baranuik et al.

Motivation: Multi-label Classification

• Formulate as C 1-vs-all binary problems
• Learn s.t. prediction is
• Imagenet has 20,000 categories
• Problem: Train 20,000 SVM’s
• Prediction time: O(20,000 d)
Sparsity
• Typically an image has only objects
Label
Compressive Sensing of Labels

𝑦
𝑋 Label

100 ×

¿

20000
• Learn classifiers/regression functions

• Use Recovery algorithms to map back to label space
• Proposed by Hsu et al and then later pursued by several works
Sparse Linear Regression
•
• : number of non-zeros
• NP-hard problem in general 

• : non-convex function
27
Non-convexity of Low-rank manifold
1 0 0.5
0.5 0 + 0.5 1 = 0.5
0 0 0
Convex Relaxation
•
• Relaxed Problem:
• Known to promote sparsity

• Pros: a) Principled approach, b) Captures correlations between features
• Cons: Slow to optimize
29
Our Approach : Projected Gradient Descent
•
30
[Jain, Tewari, Kar’2014]
Projection onto ball?
•
Important Properties
A Stronger Result?
• 2 𝑑 −𝑠 2
|| 𝑃 𝑠 𝑧 − 𝑧 | | ≤
( ) 2 ∗
|| 𝑃 𝑠 𝑧 − 𝑧 | |2
∗ ( )
𝑑−𝑠
Our Approach : Projected Gradient Descent
•
34
Convex-projections vs Non-convex
Projections
• For non-convex sets, we only have:
• 0-th order condition

• But, for projection onto convex set :
• 1-st order condition
• 0 order condition sufficient for convergence of Proj. Grad. Descent?

• In general, NO 
• But, for certain specially structured problems, YES!!!
Convex-Projected Gradient Descent Proof?
• Let
• Let
• Let
• convex set and
Restricted Isometry Property (RIP)
• satisfies RIP if, for all sparse vectors acts as an
Isometry
• Formally: For all -sparse
2 2 2
(1− 𝛿 𝑠) |∨𝐰 | | ≤ || 𝐗𝐰 | | ≤(1+ 𝛿 𝑠) |∨𝐰 | |

𝑤 Xw

Proof under RIP
• Let
• Let
• Let
• ball with non-zeros and
[Blumensath & Davies’09, Garg & Khandekar’09]

Variations
• Fully corrective version:
• Two stage algorithms:

Summary so far…
• High-dimensional problems
• Need to impose structure on

• Sparsity
• Projection easy!
• Projected Gradient works (if RIP is satisfied)
• Several variants exist
Which Matrices Satisfy RIP?
2 2 2

( 1 − 𝛿 𝑠 ) ||𝐰 | | ≤ || 𝐗𝐰 | | ≤ ( 1 + 𝛿 𝑠 ) || 𝐰 | | , || 𝑤 || 0 ≤ 𝑠
• Several ensembles of random matrices

• Large enough
• For example:
• 0-mean distribution
• Bounded fourth moment

𝑋
𝑑
𝑛=𝑂(𝑠 log )
𝑠
𝑛
Popular RIP Ensembles

𝑋
•
𝑑
𝑛=𝑂(𝑠 log )
𝑠
𝑛
• Most popular examples:

Proof of RIP for Gaussian Ensemble
•
• Then, satisfies RIP at -sparsity with constant

Other structures?
• Group sparsity
• Tree sparsity
• Union of subspaces (polynomially many subspaces)
• Projection easy for each one of these problems

• Gaussian matrices satisfy RIP (because union of small no. of subspaces)
General Result
• Let
• Let
• Any non-convex set and
2 2 2

( 1 − 𝛿 𝑠 ) ||𝐰 | | ≤ || 𝐗𝐰 | | ≤ ( 1 + 𝛿 𝑠 ) || 𝐰 | | , 𝑤 ∈ 𝐶
Proof?
But what if RIP is not possible?
Statistical Guarantees
•
•
• sparse
56
Proof?
•
Proof?
General Result for Any Function
•
• satisfies RSC/RSS, i.e.,
• IHT and several similar algorithm guarantee:
After steps
• If and

Theory and Practice
•
•,
Convex
• sparse
• Number of iterations:
Non-Convex
[Jain, Tewari, Kar’2014] 62

Summary so far…
• High-dimensional problems
• Need to impose structure on

• Sparsity
• Projection easy!
• Projected Gradient works (if RIP is satisfied)
• Several variants exist
• RIP/RSC style proof works for subgaussian data
• Other structures also allowed
Robust Regression

d
n = n
+

𝑦 =
𝑋 𝑤

+ 𝑏

∗

𝑦= 𝑋 𝑤 +𝑏

Typical b:
a) Deterministic error :
b) Gaussian error :
Robust Regression
•
• We want to be a constant
• Entries of can be unbounded!
• can be arbitrarily large
• Still we want:
RR Algorithm
•
• For t=0, 1, ….
• Algorithm: was vaguely proposed by Legendre-1805

Result
•
Proof?
Proof?
Empirical Results
Empirical Results
Empirical Results
One-bit Compressive Sensing
• Compressed Sensing:
• Require to know exactly

• In practice, finite bit representation, some quantization
required
• One-bit CS: extreme quantization
• Easily implementable through comparators

• Results in two categories:
• Support Recovery [HB12, GNJN13]
• Approximate Recovery [PV12, GNJN13]
Phase-Retrieval
• Another extreme:
• Useful in several imaging applications

• A field in itself
• Ideas from sparse-vector and low-rank matrix estimation [C12, NJS13]
Dictionary Learning
A

≅ ×

-dim, k-sparse
vector

Data Point Dictionary

Dictionary Learning
Y A X
𝑑

≅ ×

𝑟

𝑛

𝑟

•Overcomplete dictionaries:
• Goal: Given , compute
• Using small number of samples
Existing Results
• Generalization error bounds [VMB’11, MPR’12, MG’13, TRS’13]
• But assumes that the optimal solution is reached
• Do not cover exact recovery with finite many samples
• Identifiability of [HS’11]
• Require exponentially many samples
• Exact recovery [SWW’12]
• Restricted to square dictionary
• In practice, overcomplete dictionary is more useful
Generating Model
• Generate dictionary
• Assume to be incoherent, i.e.,
• Generate random samples

• Each is -sparse
• Generate observations:
Algorithm
• Typically practical algorithm: alternating minimization
• Initialize
• Using clustering+SVD method of [AAN’13] or [AGM’13]
Results [AAJNT’13]
• Assumptions:
• is incoherent ()
• Sparsity: (better result by AGM’13)
• After -steps of AltMin:

Proof Sketch
• Initialization step ensures that:
• Lower bound on each element of + above bound:

• is recovered exactly
• Robustness of compressive sensing!
• can be expressed exactly as:
• Use randomness in
Simulations
Emirically:
Known result:
Summary
• Consider high-dimensional structured problems
• Sparsity
• Block sparsity
• Tree-based sparsity
• Error sparsity
• Iterative hard thresholding style method
• Practical/easy to implement
• Fast convergence
• RIP/RSC/subGaussian data: Provable guarantees
http://research.microsoft.com/en-us/people/prajain/
Purushottam Kar Kush Bhatia
PostDoc Research Fellow

MSR, India MSR, India
Ambuj Tewari
Asst. Prof.
Univ of Michigan
Next Lecture
• Low-rank Structure
• Matrix Regression
• Matrix Completion
• Robust PCA
• Low-rank Tensor Structure
• Tensor completion
Block-sparse Signals
•
• Total no. of measurements:
• Correlated signals:
• Method--- Group norms: or
• Improvement in sample complexity if

Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Provable Non-Convex Optimization For ML: Prateek Jain Microsoft Research India

Uploaded by

Copyright:

Available Formats

Provable Non-convex

• Need to learn only parameters

• Low-rank Tensor completion

• Low-rank Matrix completion

• Fix V, optimize for U

• For 1Megapixel image, 1Million measurements would be required

Picture taken from Baranuik et al.

• Most images are sparse in wavelet transform space

Picture taken from Baranuik et al.

• Learn classifiers/regression functions

• NP-hard problem in general 

• Known to promote sparsity

• 0-th order condition

• 1-st order condition

• 0 order condition sufficient for convergence of Proj. Grad. Descent?

[Blumensath & Davies’09, Garg & Khandekar’09]

• Two stage algorithms:

• Need to impose structure on

• Several ensembles of random matrices

• Most popular examples:

• Then, satisfies RIP at -sparsity with constant

• Projection easy for each one of these problems

• IHT and several similar algorithm guarantee:

[Jain, Tewari, Kar’2014]

[Jain, Tewari, Kar’2014] 62

• Need to impose structure on

• Algorithm: was vaguely proposed by Legendre-1805

• Require to know exactly

• Easily implementable through comparators

• Useful in several imaging applications

Data Point Dictionary

• Generate random samples

• Sparsity: (better result by AGM’13)

• After -steps of AltMin:

• Lower bound on each element of + above bound:

PostDoc Research Fellow

You might also like