Professional Documents
Culture Documents
Stochastic Gradient Methods
Stochastic Gradient Methods
Introduction
Supervised machine learning and convex optimization
Beyond the separation of statistics and optimization
Introduction
Supervised machine learning and convex optimization
Beyond the separation of statistics and optimization
Stochastic approximation
Observation of fn (n) = f (n) + n
n = additive noise (typically i.i.d.)
May only observe a function which is positively correlated to f (n)
Stochastic approximation
Goal: Minimizing a function f defined on a Hilbert space H
given only unbiased estimates fn (n) of its gradients f (n) at
certain points n H
Stochastic approximation
Observation of fn (n) = f (n) + n
n = additive noise (typically i.i.d.)
May only observe a function which is positively correlated to f (n)
n = n1 nfn (n1)
1
Pn1
Polyak-Ruppert averaging: n = n k=0 k
Which learning rate sequence n? Classical setting: n = Cn
Convex stochastic approximation
Key properties of f and/or fn
Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous
Strong convexity: f -strongly convex
n = n1 nfn (n1)
1
Pn1
Polyak-Ruppert averaging: n = n k=0 k
Which learning rate sequence n? Classical setting: n = Cn
Machine learning/optimization
Known minimax rates of convergence (Nemirovski and Yudin, 1983;
Agarwal et al., 2010)
Strongly convex: O(n1)
Non-strongly convex: O(n1/2)
Achieved with and/or without averaging (up to log terms)
Non-asymptotic analysis (high-probability bounds)
Online setting and regret bounds
Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan
et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz
et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009)
Nesterov and Vial (2008); Nemirovski et al. (2009)
Convex stochastic approximation
Related work
Stochastic approximation
Asymptotic analysis
Non convex case with strong convexity around the optimum
n = Cn with = 1 is not robust to the choice of C
(1/2, 1) is robust with averaging
Broadie et al. (2009); Kushner and Yin (2003); Kulchitski and
Mozgovo (1991); Polyak and Juditsky (1992); Ruppert (1988);
Fabian (1968)
Problem set-up - General assumptions
Variance of estimates: There exists 2 > 0 such that for all n > 1,
E(kfn ()k2|Fn1) 6 2, w.p.1, where is a global minimizer of f
Proof technique
Derive deterministic recursion for n = Ekn k2
Take-home message
Use = 1/2 with averaging to be adaptive to strong convexity
Conclusions / Extensions
Stochastic approximation for machine learning
Mixing convex optimization and statistics
Non-asymptotic analysis through moment computations
Averaging with longer steps is (more) robust and adaptive
Bounded gradient assumption leads to better rates
Introduction
Supervised machine learning and convex optimization
Beyond the separation of statistics and optimization
Stochastic approximation
Assumes infinite data stream
Observations are used only once
Directly minimizes testing cost Ez (, z)
Going beyond a single pass over the data
Stochastic approximation
Assumes infinite data stream
Observations are used only once
Directly minimizes testing cost Ez (, z)
1
Constant step size t = :
2nL
t h 2i
9
E kt k2 6 1 3k0 k2 + 2
8Ln 4L
Linear rate with iteration cost independent of n ...
... but, same behavior as batch gradient and IAG (cyclic version)
Proof technique
Designing a quadratic Lyapunov function for a n-th order non-linear
stochastic dynamical system
Stochastic average gradient
Convergence analysis - II
1
Pn
Assume that each fi is L-smooth and n i=1 fi is -strongly convex
1 8
Constant step size t = , if >
2n L n
1 t
E f(t) f() 6 C 1
8n
4 2
with C = 3n k0 k + 3n 8 log 1 + n
16L 2
4L + 1
1 pegasos 4
10 RDA RDA
SAG (2/(L+n)) SAG (2/(L+n))
3.5
2
3
10
2.5
2
3
10
1.5
1
4
10 0.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Effective Passes Effective Passes
pegasos pegasos
RDA 1.9 RDA
0
10 SAG (2/(L+n)) SAG (2/(L+n))
1.85
1.8
2 1.75
10
1.7
1.65
4
10 1.6
1.55
1.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Effective Passes Effective Passes