Obd 04 PDF

CS390FF: Special Topics in Data Sciences: Big Data Optimization
KAUST, Fall 2017
1.4 Convergence Analysis of the Basic Method
109 / 165
Covariance Matrix and Total Variance of a Random Vector

Definition 35 (Covariance matrix)
If x 2 Rn is a random vector, then the matrix
def ⇥ ⇤
Var (x) = E (x E [x])(x E [x])>
is called the covariance matrix of x.

Definition 36 (Total Variance)
If x 2 Rn is a random vector, then the value
def ⇥ ⇤ ⇥ ⇤
TVar (x) = E (x E [x])> (x E [x]) = E kx E [x] k2
is called the total variance of x.

Exercise 8
Let x 2 Rn be a random vector. Show that:
(i) The total variance is the trace of the covariance matrix:
TVar(x) = Tr (Var (x))
⇥ ⇤
(ii) TVar (U> B1/2 x) = E kx E [x] k2B .
110 / 165
Strong vs Weak Convergence
Definition 37 (Strong and Weak Convergence)
We say that a sequence of random vectors {xk } converges to x⇤
I weakly if kE [xk x⇤ ] k2B ! 0 as k ! 1
⇥ ⇤
I strongly if E kxk x⇤ k2B ! 0 as k ! 1 (aka L2 convergence)
The following lemma explains why strong convergence is a stronger

convergence concept than weak convergence.
Lemma 38
For any random vector xk 2 Rn and any x⇤ 2 Rn we have the identity
⇥ ⇤ h i
2 2 2
E kxk x⇤ kB = kE [xk x⇤ ]kB + E kxk E [xk ]kB .
| {z }
TVar (U> B1/2 xk )
As a consequence, strong convergence implies

I weak convergence,
I convergence of TVar (U> B1/2 xk ) to zero.
111 / 165
Proof of Lemma 38
Let µ = E [xk ]. Then

⇥ ⇤ ⇥ ⇤
E kxk x⇤ k2B = E kxk µ+µ x⇤ k2B
⇥ ⇤
= E kxk µk2B + kµ x⇤ k2B + 2hxk µ, µ x⇤ iB
⇥ ⇤
= E kxk µk2B + kµ x⇤ k2B + 2hE [xk µ], µ x⇤ iB
| {z }
0
⇥ 2
⇤ 2
= E kxk µkB + kµ x⇤ kB .
In the first step we have expanded the square and in the second step we
have used linearity of expectation.
112 / 165
Weak Convergence
113 / 165
Weak Convergence
Theorem 39 (Weak Convergence 1)
Choose any x0 2 Rn and let {xk } be the random iterates produced by
Algorithm 2. Let x⇤ 2 L be chosen arbitrarily. Then
1
E [xk+1 x⇤ ] = I !B E [Z] E [xk x⇤ ] . (35)
Moreover, by transforming the error via the linear mapping

h ! U> B1/2 h, this can be written in the form
h i
E U B (xk x⇤ ) = (I !⇤)k U> B1/2 (x0 x⇤ ),
> 1/2
(36)
which is separable in the coordinates of the transformed error:

h i
E ui B (xk x⇤ ) = (1 ! i )k ui> B1/2 (x0 x⇤ ),
> 1/2
i = 1, 2, . . . , n.
(37)
Finally,
n
X ⇣ ⌘2
kE [xk x⇤ ] k2B = (1 ! i) 2k
ui> B1/2 (x0 x⇤ ) . (38)
i=1
114 / 165
Weak Convergence
Theorem 40 (Convergence 2)
Let x⇤ = ⇧B
L (x0 ). Then for all i = 1, 2, . . . , n,
(
h i 0 if = 0,
i
E ui> B1/2 (xk x⇤ ) =
(1 ! i )k ui> B1/2 (x0 x⇤ ) if i > 0.
(39)
Moreover,
kE [xk x⇤ ] k2B  ⇢k (!)kx0 x⇤ k2B , (40)
where the rate is given by
def
⇢(!) = max (1 ! i )2 . (41)
i: i >0
115 / 165
Necessary and Sufficient Conditions for Convergence
Corollary 41 (Necessary and sufficient conditions)

Let Assumption 3 (exactness) hold. Choose any x0 2 Rn and let
x⇤ = ⇧BL (x0 ).
If {xk } are the random iterates produced by Algorithm 2, then the

following statements are equivalent:
(i) |1 ! i | < 1 for all i for which i >0
(ii) 0 < ! < 2/ max
⇥ ⇤
(iii) E ui> B1/2 (xk x⇤ ) ! 0 for all i
(iv) kE [xk x⇤ ] k2B ! 0
116 / 165
Proof of Theorems 39 and 40 - I
We first start with a lemma.
Lemma 42
Let Assumption 3 (exactness) hold. Consider arbitrary x 2 Rn and let
> 1/2
x⇤ = ⇧BL (x). If i = 0, then ui B (x x⇤ ) = 0.
Proof.
From (19) we see that x x⇤ = B 1 A> w for some w 2 Rm . Therefore,
ui> B1/2 (x x⇤ ) = ui> B 1/2 A> w . By Theorem 29, we have
Range (ui : i = 0) = Null AB 1/2 , from which it follows that
ui> B 1/2 A = 0.
Proof of Theorem 39: Algorithm 2 can be written in the form
1
ek+1 = (I !B Zk )ek , (42)
where ek = xk x⇤ . Multiplying both sides of this equation by B1/2 from

the left, and taking expectation conditional on ek , we obtain
h i
E B ek+1 | ek = (I !B 1/2 E [Z] B 1/2 )B1/2 ek .
1/2
117 / 165
Proof of Theorems 39 and 40 - II
Taking expectations on both sides and using the tower property, we get
⇥ ⇤ ⇥ ⇥ ⇤⇤ ⇥ ⇤
E B1/2 ek+1 = E E B1/2 ek+1 | ek = (I !B 1/2 E [Z] B 1/2 )E B1/2 ek .
We now replace B 1/2 E [Z] B 1/2 by its eigenvalue decomposition

U⇤U> (see (33)), multiply both sides of the last inequality by U> from
the left, and use linearity of expectation to obtain
⇥ ⇤ ⇥ ⇤
E U> B1/2 ek+1 = (I !⇤)E U> B1/2 ek .
Unrolling the recurrence, we get (36). When this is written

coordinate-by-coordinate, (37) follows. Identity (38) follows immediately
by equating standard Euclidean norms of both sides of (36).
Proof of Theorem 40: If x⇤ = ⇧B

L (x0 ), then from Lemma 42 we see
> 1/2
that i = 0 implies ui B (x0 x⇤ ) = 0. Using this in (37) gives (39).
118 / 165
Proof of Theorems 39 and 40 - III
Finally, inequality (40) follows from
(38)
n
X ⇣ ⌘2
kE [xk x⇤ ] k2B = (1 ! i )2k ui> B1/2 (x0 x⇤ )
i=1
X ⇣ ⌘2
= (1 ! i )2k ui> B1/2 (x0 x⇤ )
i: i >0
(41) X ⇣ ⌘2
 ⇢k (!) ui> B1/2 (x0 x⇤ )
i: i >0
X ⇣ ⌘2 X ⇣ ⌘2
= ⇢k (!) ui> B1/2 (x0 x⇤ ) + ⇢k (!) ui> B1/2 (x0 x⇤ )
i: i >0 i: i =0
X⇣ ⌘2
= ⇢k (!) ui> B1/2 (x0 x⇤ )
i
X
= k
⇢ (!) (x0 x⇤ )> B1/2 ui ui> B1/2 (x0 x⇤ )
i
!
X X
= ⇢k (!) (x0 x⇤ )> B1/2 ui ui> B1/2 (x0 x⇤ ) = ⇢k (!)kx0 x⇤ k2B .
i i
P
The last identity follows from the fact that i ui ui> = UU> = I.
119 / 165
Optimal Stepsize Choice for Weak Convergence
120 / 165
Convergence Rate as a Function of !
We now consider the problem of choosing the stepsize (relaxation)

parameter !.
In view of (40) and (41), the optimal relaxation parameter is the one
solving the following optimization problem:
⇢
min ⇢(!) = max (1 ! i )2 . (43)
!2R i: i >0
We solve the above problem in the next result (Theorem 43).
121 / 165
Optimal Stepsize
Theorem 43 (Stepsize Choice)
def +
Let ! ⇤ = 2/( min + max ). Then the objective of (43) is given by
8
> 2 !0
<(1 ! max ) if
+ 2
⇢(!) = (1 ! min ) if 0  !  !⇤ . (44)
>
:
(1 ! max )
2 if ! !⇤
Moreover, ⇢ is decreasing on ( 1, ! ⇤ ] and increasing on [! ⇤ , +1), and

hence the optimal solution of (43) is ! ⇤ . Further, we have:
(i) If we choose ! = 1 (no over-relaxation), then
+ 2
⇢(1) = (1 min ) . (45)
(ii) If we choose ! = 1/ max (over-relaxation), then

⇣ +
⌘2 (34)
⇣ ⌘2
1
⇢(1/ max ) = 1 = 1 . (46)
min
max ⇣
(iii) If we choose ! = ! ⇤ (optimal over-relaxation), the optimal rate is

⇣ +
⌘2 (34)
⇣ ⌘2
⇤ 2 2
⇢(! ) = 1 +
min
= 1 ⇣+1 . (47)
min + max 122 / 165
Proof of Theorem 43
Recall that max  1. Letting
⇢i (!) = (1 ! i )2 ,
it can be shown that
⇢(!) = max{⇢j (!), ⇢n (!)},
where j is such that j = + ⇤

min . Note that ⇢j (!) = ⇢n (!) for ! 2 {0, ! }.
From this we deduce that ⇢j ⇢n on ( 1, 0], ⇢j  ⇢n on [0, ! ⇤ ], and
⇢j ⇢n on [! ⇤ , +1), obtaining (44). We see that ⇢ is decreasing on
( 1, ! ⇤ ], and increasing on [! ⇤ , +1).
The remaining results follow directly by plugging specific values of ! into

(44).
123 / 165
Strong Convergence
124 / 165
Decrease of Distance is Proportional to fS
Lemma 44 (Decrease of Distance)

Choose x0 2 Rn and let {xk }1k=0 be the random iterates produced by
Algorithm 2, with an arbitrary relaxation parameter ! 2 R. Let x⇤ 2 L.
Then we have the identities kxk+1 xk k2B = 2! 2 fSk (xk ), and
kxk+1 x⇤ k2B = kxk x⇤ k2B 2!(2 !)fSk (xk ). (48)

⇥ ⇤
Moreover, E kxk+1 xk k2B = 2! 2 E [f (xk )], and
⇥ ⇤ ⇥ ⇤
E kxk+1 x⇤ k2B = E kxk x⇤ k2B 2!(2 !)E [f (xk )] . (49)
Remarks: Equation (48) says that for any x⇤ 2 L, in the k-th iteration of
Algorithm 2 the distance of the current iterate from x⇤ decreases by the
amount 2!(2 !)fSk (xk ).
125 / 165
Lower Bound on a Quadratic

Lemma 45
Let Assumption 3 be satisfied. Then the inequality
x >B 1/2
E [Z] B 1/2
x +
min (B
1/2
E [Z] B 1/2 )x > x (50)
holds for all x 2 Range B 1/2

A> .
Proof.
It is known that for any matrix M 2 Rm⇥n , the inequality
x > M> Mx + > >

min (M M)x x
holds for all x 2 Range M> . Applying this with M = (E [Z])1/2 B 1/2 ,
we see that (50) holds for all x 2 Range B 1/2 (E [Z])1/2 . However,
⇣ ⌘ ⇣ ⌘
1/2 1/2 1/2 1/2 1/2 1/2 >
Range B (E [Z]) = Range B (E [Z]) (B (E [Z]) )
⇣ ⌘ ⇣ ⌘
= Range B 1/2 E [Z] B 1/2 = Range B 1/2 A> ,
where the last identity follows by combining Assumption 3 and

Theorem 29. 126 / 165
Proof of Lemma 44 - I
Recall that Algorithm 2 performs the update
1
xk+1 = xk !B Zk (xk x⇤ ).
From this we get
kxk+1 xk k2B = ! 2 kB 1
Zk (xk x⇤ )k2B
(21)
= ! 2 (xk x⇤ )> Zk (xk x⇤ )
(22)
= 2! 2 fSk (xk ). (51)
In a similar vein,
kxk+1 x⇤ k2B = k(I !B 1

Zk )(xk x⇤ )k2B
= (xk x⇤ )> (I !Zk B 1
)B(I !B 1
Zk )(xk x⇤ )
(21)
= (xk x⇤ )> (B !(2 !)Zk )(xk x⇤ )
(22)
= kxk x⇤ k2B 2!(2 !)fSk (xk ), (52)
127 / 165
Proof of Lemma 44 - II
establishing (48).
Taking expectation in (51) and using the tower property, we get

⇥ ⇤ ⇥ ⇥ ⇤⇤
E kxk+1 xk k2B = E E kxk+1 xk k2B | xk
(51)
= 2! 2 E [E [fSk (xk ) | xk ]]
= 2! 2 E [f (xk )] ,
where in the last step we have used the definition of f .
Taking expectation in (48), we get

⇥ ⇤ ⇥ ⇥ ⇤⇤
E kxk+1 x⇤ k2B = E E kxk+1 x⇤ k2B | xk
(52) ⇥ ⇤
= E kxk x⇤ k2B 2!(2 !)f (xk )
⇥ ⇤
= E kxk x⇤ k2B 2!(2 !)E [f (xk )] .
128 / 165
Quadratic Bounds
Lemma 46 (Quadratic bounds)

For all x 2 Rn and x⇤ 2 L we have
+ 1
min · f (x)  krf (x)k2B  max · f (x). (53)
2
and
max
f (x)  kx x⇤ k2B . (54)
2
Moreover, if Assumption 3 holds, then for all x 2 Rn and x⇤ = ⇧B
L (x) we
have
+
min
kx x⇤ k2B  f (x). (55)
2
129 / 165
Proof of Lemma 46 - I
In view of (17) and (33), we obtain a spectral characterization of f :
1X
n ⇣ ⌘2
f (x) = i ui> B1/2 (x x⇤ ) , (56)
2
i=1
where x⇤ is any point in L. On the other hand, in view of (28) and (33),
we have
krf (x)k2B = kB 1
E [Z] (x x⇤ )k2B (57)
> 1
= (x x⇤ ) E [Z] B E [Z] (x x⇤ )
= (x x⇤ )> B1/2 (B 1/2
E [Z] B 1/2
)(B 1/2
E [Z] B 1/2
)B1/2 (x x⇤ )
= (x x⇤ )> B1/2 U(U> B 1/2
E [Z] B 1/2
U)2 U> B1/2 (x x⇤ )
(33)
= (xx⇤ )> B1/2 U⇤2 U> B1/2 (x x⇤ )
n
X ⇣ ⌘2
2 > 1/2
= i u i B (x x ⇤ ) . (58)
i=1
Inequality (53) follows by comparing (56) and (57), using the bounds
+ 2
min i  i  max i ,
which hold for i for which i > 0.

130 / 165
Proof of Lemma 46 - II
We now move to the bounds involving norms. First, note that for any
x⇤ 2 L we have
(17) 1
f (x) = (x x⇤ )> E [Z] (x x⇤ ) (59)
2
1 1/2
= (B (x x⇤ ))> (B 1/2 E [Z] B 1/2
)B1/2 (x x⇤ ).
2
The upper bound follows by applying the inequality
1/2 1/2
B E [Z] B max I.
If x⇤ = ⇧B
L (x), then in view of (19), we have
⇣ ⌘
1/2 1/2 >
B (x x⇤ ) 2 Range B A .
Applying Lemma 45 to (59), we get the lower bound.
131 / 165
Strong Convergence
Theorem 47 (Strong convergence)
Let Assumption 3 (exactness) hold and set x⇤ = ⇧BL (x0 ). Let {xk } be the
random iterates produced by Algorithm 2, where the relaxation parameter
def ⇥ ⇤
satisfies 0 < ! < 2, and let rk = E kxk x⇤ k2B . Then for all k 0 we
have
(1 !(2 !) max )k r0  rk  (1 !(2 !) + k
min ) r0 . (60)
The best rate is achieved when ! = 1.
Proof.
Let k = E [f (xk )]. We have
(49) (55)
+
rk+1 = rk 2!(2 !) k  rk !(2 !) min rk ,
and
(49) (54)
rk+1 = rk 2!(2 !) k rk !(2 !) max rk .
Inequalities (60) follow from this by unrolling the recurrences.
132 / 165
Convergence of f (xk )
133 / 165
Convergence of f (xk )
Theorem 48 (Convergence of f )
Choose x0 2 Rn , and let {xk }1
k=0 be the random iterates produced by
Algorithm 2, where the relaxation parameter satisfies 0 < ! < 2.
def Pk 1
(i) Let x⇤ 2 L. The average iterate x̂k = k1 t=0 xt for all k 1
satisfies
kx0 x⇤ k2B
E [f (x̂k )]  . (61)
2!(2 !)k
(ii) Now let Assumption 3 hold. For x⇤ = ⇧B
L (x0 ) and k 0 we have
+ k max kx0 x⇤ k2B

E [f (xk )]  1 !(2 !) min . (62)
2
The best rate is achieved when ! = 1.
134 / 165
Proof of Theorem 48
⇥ ⇤
(i) Let k = E [f (xk )] and rk = E kxk x⇤ k2B . By summing up the
identities from (49), we get
k 1
X
2!(2 !) t = r0 rk .
t=0
Therefore, using Jensen’s inequality, we get

" k 1 # k 1
1X 1X r0 rk r0
E [f (x̂k )]  E f (xt ) = t =  .
k t=0 k t=0 2!(2 !)k 2!(2 !)k
(ii) Combining inequality (54) with Theorem 47, we get
max ⇥ ⇤ (60) + k max kx0 x⇤ k2B

E [f (xk )]  E kxk x⇤ k2B  1 !(2 !) min .
2 2
135 / 165
136 / 165

Obd 04 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Obd 04 PDF

Uploaded by

Copyright:

Available Formats

CS390FF: Special Topics in Data Sciences: Big Data Optimization

KAUST, Fall 2017

1.4 Convergence Analysis of the Basic Method

Covariance Matrix and Total Variance of a Random Vector

is called the covariance matrix of x.

is called the total variance of x.

The following lemma explains why strong convergence is a stronger

As a consequence, strong convergence implies

Let µ = E [xk ]. Then

Moreover, by transforming the error via the linear mapping

which is separable in the coordinates of the transformed error:

Necessary and Sufficient Conditions for Convergence

Corollary 41 (Necessary and sufficient conditions)

If {xk } are the random iterates produced by Algorithm 2, then the

where ek = xk x⇤ . Multiplying both sides of this equation by B1/2 from

Proof of Theorems 39 and 40 - II

We now replace B 1/2 E [Z] B 1/2 by its eigenvalue decomposition

Unrolling the recurrence, we get (36). When this is written

Proof of Theorem 40: If x⇤ = ⇧B

Optimal Stepsize Choice for Weak Convergence

We now consider the problem of choosing the stepsize (relaxation)

We solve the above problem in the next result (Theorem 43).

Moreover, ⇢ is decreasing on ( 1, ! ⇤ ] and increasing on [! ⇤ , +1), and

(ii) If we choose ! = 1/ max (over-relaxation), then

(iii) If we choose ! = ! ⇤ (optimal over-relaxation), the optimal rate is

Recall that max  1. Letting

it can be shown that

⇢(!) = max{⇢j (!), ⇢n (!)},

where j is such that j = + ⇤

The remaining results follow directly by plugging specific values of ! into

Lemma 44 (Decrease of Distance)

Then we have the identities kxk+1 xk k2B = 2! 2 fSk (xk ), and

kxk+1 x⇤ k2B = kxk x⇤ k2B 2!(2 !)fSk (xk ). (48)

Lower Bound on a Quadratic

holds for all x 2 Range B 1/2

x > M> Mx + > >

where the last identity follows by combining Assumption 3 and

From this we get

kxk+1 x⇤ k2B = k(I !B 1

Taking expectation in (51) and using the tower property, we get

where in the last step we have used the definition of f .

Taking expectation in (48), we get

Lemma 46 (Quadratic bounds)

which hold for i for which i > 0.

Applying Lemma 45 to (59), we get the lower bound.

Inequalities (60) follow from this by unrolling the recurrences.

+ k max kx0 x⇤ k2B

Therefore, using Jensen’s inequality, we get

(ii) Combining inequality (54) with Theorem 47, we get

max ⇥ ⇤ (60) + k max kx0 x⇤ k2B

You might also like