You are on page 1of 14

CS390FF: Special Topics in Data Sciences: Big Data Optimization

KAUST, Fall 2017

1.4 Convergence Analysis of the Basic Method

109 / 165

Covariance Matrix and Total Variance of a Random Vector


Definition 35 (Covariance matrix)
If x 2 Rn is a random vector, then the matrix
def ⇥ ⇤
Var (x) = E (x E [x])(x E [x])>

is called the covariance matrix of x.


Definition 36 (Total Variance)
If x 2 Rn is a random vector, then the value
def ⇥ ⇤ ⇥ ⇤
TVar (x) = E (x E [x])> (x E [x]) = E kx E [x] k2

is called the total variance of x.


Exercise 8
Let x 2 Rn be a random vector. Show that:
(i) The total variance is the trace of the covariance matrix:
TVar(x) = Tr (Var (x))
⇥ ⇤
(ii) TVar (U> B1/2 x) = E kx E [x] k2B .
110 / 165
Strong vs Weak Convergence
Definition 37 (Strong and Weak Convergence)
We say that a sequence of random vectors {xk } converges to x⇤
I weakly if kE [xk x⇤ ] k2B ! 0 as k ! 1
⇥ ⇤
I strongly if E kxk x⇤ k2B ! 0 as k ! 1 (aka L2 convergence)

The following lemma explains why strong convergence is a stronger


convergence concept than weak convergence.
Lemma 38
For any random vector xk 2 Rn and any x⇤ 2 Rn we have the identity
⇥ ⇤ h i
2 2 2
E kxk x⇤ kB = kE [xk x⇤ ]kB + E kxk E [xk ]kB .
| {z }
TVar (U> B1/2 xk )

As a consequence, strong convergence implies


I weak convergence,
I convergence of TVar (U> B1/2 xk ) to zero.
111 / 165

Proof of Lemma 38

Let µ = E [xk ]. Then


⇥ ⇤ ⇥ ⇤
E kxk x⇤ k2B = E kxk µ+µ x⇤ k2B
⇥ ⇤
= E kxk µk2B + kµ x⇤ k2B + 2hxk µ, µ x⇤ iB
⇥ ⇤
= E kxk µk2B + kµ x⇤ k2B + 2hE [xk µ], µ x⇤ iB
| {z }
0
⇥ 2
⇤ 2
= E kxk µkB + kµ x⇤ kB .

In the first step we have expanded the square and in the second step we
have used linearity of expectation.

112 / 165
Weak Convergence

113 / 165

Weak Convergence
Theorem 39 (Weak Convergence 1)
Choose any x0 2 Rn and let {xk } be the random iterates produced by
Algorithm 2. Let x⇤ 2 L be chosen arbitrarily. Then
1
E [xk+1 x⇤ ] = I !B E [Z] E [xk x⇤ ] . (35)

Moreover, by transforming the error via the linear mapping


h ! U> B1/2 h, this can be written in the form
h i
E U B (xk x⇤ ) = (I !⇤)k U> B1/2 (x0 x⇤ ),
> 1/2
(36)

which is separable in the coordinates of the transformed error:


h i
E ui B (xk x⇤ ) = (1 ! i )k ui> B1/2 (x0 x⇤ ),
> 1/2
i = 1, 2, . . . , n.
(37)
Finally,
n
X ⇣ ⌘2
kE [xk x⇤ ] k2B = (1 ! i) 2k
ui> B1/2 (x0 x⇤ ) . (38)
i=1
114 / 165
Weak Convergence

Theorem 40 (Convergence 2)
Let x⇤ = ⇧B
L (x0 ). Then for all i = 1, 2, . . . , n,
(
h i 0 if = 0,
i
E ui> B1/2 (xk x⇤ ) =
(1 ! i )k ui> B1/2 (x0 x⇤ ) if i > 0.
(39)
Moreover,
kE [xk x⇤ ] k2B  ⇢k (!)kx0 x⇤ k2B , (40)
where the rate is given by
def
⇢(!) = max (1 ! i )2 . (41)
i: i >0

115 / 165

Necessary and Sufficient Conditions for Convergence

Corollary 41 (Necessary and sufficient conditions)


Let Assumption 3 (exactness) hold. Choose any x0 2 Rn and let
x⇤ = ⇧BL (x0 ).

If {xk } are the random iterates produced by Algorithm 2, then the


following statements are equivalent:
(i) |1 ! i | < 1 for all i for which i >0
(ii) 0 < ! < 2/ max
⇥ ⇤
(iii) E ui> B1/2 (xk x⇤ ) ! 0 for all i
(iv) kE [xk x⇤ ] k2B ! 0

116 / 165
Proof of Theorems 39 and 40 - I
We first start with a lemma.
Lemma 42
Let Assumption 3 (exactness) hold. Consider arbitrary x 2 Rn and let
> 1/2
x⇤ = ⇧BL (x). If i = 0, then ui B (x x⇤ ) = 0.

Proof.
From (19) we see that x x⇤ = B 1 A> w for some w 2 Rm . Therefore,
ui> B1/2 (x x⇤ ) = ui> B 1/2 A> w . By Theorem 29, we have
Range (ui : i = 0) = Null AB 1/2 , from which it follows that
ui> B 1/2 A = 0.
Proof of Theorem 39: Algorithm 2 can be written in the form
1
ek+1 = (I !B Zk )ek , (42)

where ek = xk x⇤ . Multiplying both sides of this equation by B1/2 from


the left, and taking expectation conditional on ek , we obtain
h i
E B ek+1 | ek = (I !B 1/2 E [Z] B 1/2 )B1/2 ek .
1/2

117 / 165

Proof of Theorems 39 and 40 - II

Taking expectations on both sides and using the tower property, we get
⇥ ⇤ ⇥ ⇥ ⇤⇤ ⇥ ⇤
E B1/2 ek+1 = E E B1/2 ek+1 | ek = (I !B 1/2 E [Z] B 1/2 )E B1/2 ek .

We now replace B 1/2 E [Z] B 1/2 by its eigenvalue decomposition


U⇤U> (see (33)), multiply both sides of the last inequality by U> from
the left, and use linearity of expectation to obtain
⇥ ⇤ ⇥ ⇤
E U> B1/2 ek+1 = (I !⇤)E U> B1/2 ek .

Unrolling the recurrence, we get (36). When this is written


coordinate-by-coordinate, (37) follows. Identity (38) follows immediately
by equating standard Euclidean norms of both sides of (36).

Proof of Theorem 40: If x⇤ = ⇧B


L (x0 ), then from Lemma 42 we see
> 1/2
that i = 0 implies ui B (x0 x⇤ ) = 0. Using this in (37) gives (39).

118 / 165
Proof of Theorems 39 and 40 - III
Finally, inequality (40) follows from
(38)
n
X ⇣ ⌘2
kE [xk x⇤ ] k2B = (1 ! i )2k ui> B1/2 (x0 x⇤ )
i=1
X ⇣ ⌘2
= (1 ! i )2k ui> B1/2 (x0 x⇤ )
i: i >0
(41) X ⇣ ⌘2
 ⇢k (!) ui> B1/2 (x0 x⇤ )
i: i >0
X ⇣ ⌘2 X ⇣ ⌘2
= ⇢k (!) ui> B1/2 (x0 x⇤ ) + ⇢k (!) ui> B1/2 (x0 x⇤ )
i: i >0 i: i =0
X⇣ ⌘2
= ⇢k (!) ui> B1/2 (x0 x⇤ )
i
X
= k
⇢ (!) (x0 x⇤ )> B1/2 ui ui> B1/2 (x0 x⇤ )
i
!
X X
= ⇢k (!) (x0 x⇤ )> B1/2 ui ui> B1/2 (x0 x⇤ ) = ⇢k (!)kx0 x⇤ k2B .
i i
P
The last identity follows from the fact that i ui ui> = UU> = I.

119 / 165

Optimal Stepsize Choice for Weak Convergence

120 / 165
Convergence Rate as a Function of !

We now consider the problem of choosing the stepsize (relaxation)


parameter !.

In view of (40) and (41), the optimal relaxation parameter is the one
solving the following optimization problem:

min ⇢(!) = max (1 ! i )2 . (43)
!2R i: i >0

We solve the above problem in the next result (Theorem 43).

121 / 165

Optimal Stepsize
Theorem 43 (Stepsize Choice)
def +
Let ! ⇤ = 2/( min + max ). Then the objective of (43) is given by
8
> 2 !0
<(1 ! max ) if
+ 2
⇢(!) = (1 ! min ) if 0  !  !⇤ . (44)
>
:
(1 ! max )
2 if ! !⇤

Moreover, ⇢ is decreasing on ( 1, ! ⇤ ] and increasing on [! ⇤ , +1), and


hence the optimal solution of (43) is ! ⇤ . Further, we have:
(i) If we choose ! = 1 (no over-relaxation), then
+ 2
⇢(1) = (1 min ) . (45)

(ii) If we choose ! = 1/ max (over-relaxation), then


⇣ +
⌘2 (34)
⇣ ⌘2
1
⇢(1/ max ) = 1 = 1 . (46)
min
max ⇣

(iii) If we choose ! = ! ⇤ (optimal over-relaxation), the optimal rate is


⇣ +
⌘2 (34)
⇣ ⌘2
⇤ 2 2
⇢(! ) = 1 +
min
= 1 ⇣+1 . (47)
min + max 122 / 165
Proof of Theorem 43

Recall that max  1. Letting

⇢i (!) = (1 ! i )2 ,

it can be shown that

⇢(!) = max{⇢j (!), ⇢n (!)},

where j is such that j = + ⇤


min . Note that ⇢j (!) = ⇢n (!) for ! 2 {0, ! }.
From this we deduce that ⇢j ⇢n on ( 1, 0], ⇢j  ⇢n on [0, ! ⇤ ], and
⇢j ⇢n on [! ⇤ , +1), obtaining (44). We see that ⇢ is decreasing on
( 1, ! ⇤ ], and increasing on [! ⇤ , +1).

The remaining results follow directly by plugging specific values of ! into


(44).

123 / 165

Strong Convergence

124 / 165
Decrease of Distance is Proportional to fS

Lemma 44 (Decrease of Distance)


Choose x0 2 Rn and let {xk }1k=0 be the random iterates produced by
Algorithm 2, with an arbitrary relaxation parameter ! 2 R. Let x⇤ 2 L.

Then we have the identities kxk+1 xk k2B = 2! 2 fSk (xk ), and

kxk+1 x⇤ k2B = kxk x⇤ k2B 2!(2 !)fSk (xk ). (48)


⇥ ⇤
Moreover, E kxk+1 xk k2B = 2! 2 E [f (xk )], and
⇥ ⇤ ⇥ ⇤
E kxk+1 x⇤ k2B = E kxk x⇤ k2B 2!(2 !)E [f (xk )] . (49)

Remarks: Equation (48) says that for any x⇤ 2 L, in the k-th iteration of
Algorithm 2 the distance of the current iterate from x⇤ decreases by the
amount 2!(2 !)fSk (xk ).

125 / 165

Lower Bound on a Quadratic


Lemma 45
Let Assumption 3 be satisfied. Then the inequality

x >B 1/2
E [Z] B 1/2
x +
min (B
1/2
E [Z] B 1/2 )x > x (50)

holds for all x 2 Range B 1/2


A> .
Proof.
It is known that for any matrix M 2 Rm⇥n , the inequality

x > M> Mx + > >


min (M M)x x

holds for all x 2 Range M> . Applying this with M = (E [Z])1/2 B 1/2 ,
we see that (50) holds for all x 2 Range B 1/2 (E [Z])1/2 . However,
⇣ ⌘ ⇣ ⌘
1/2 1/2 1/2 1/2 1/2 1/2 >
Range B (E [Z]) = Range B (E [Z]) (B (E [Z]) )
⇣ ⌘ ⇣ ⌘
= Range B 1/2 E [Z] B 1/2 = Range B 1/2 A> ,

where the last identity follows by combining Assumption 3 and


Theorem 29. 126 / 165
Proof of Lemma 44 - I
Recall that Algorithm 2 performs the update
1
xk+1 = xk !B Zk (xk x⇤ ).

From this we get

kxk+1 xk k2B = ! 2 kB 1
Zk (xk x⇤ )k2B
(21)
= ! 2 (xk x⇤ )> Zk (xk x⇤ )
(22)
= 2! 2 fSk (xk ). (51)

In a similar vein,

kxk+1 x⇤ k2B = k(I !B 1


Zk )(xk x⇤ )k2B
= (xk x⇤ )> (I !Zk B 1
)B(I !B 1
Zk )(xk x⇤ )
(21)
= (xk x⇤ )> (B !(2 !)Zk )(xk x⇤ )
(22)
= kxk x⇤ k2B 2!(2 !)fSk (xk ), (52)
127 / 165

Proof of Lemma 44 - II
establishing (48).

Taking expectation in (51) and using the tower property, we get


⇥ ⇤ ⇥ ⇥ ⇤⇤
E kxk+1 xk k2B = E E kxk+1 xk k2B | xk
(51)
= 2! 2 E [E [fSk (xk ) | xk ]]
= 2! 2 E [f (xk )] ,

where in the last step we have used the definition of f .

Taking expectation in (48), we get


⇥ ⇤ ⇥ ⇥ ⇤⇤
E kxk+1 x⇤ k2B = E E kxk+1 x⇤ k2B | xk
(52) ⇥ ⇤
= E kxk x⇤ k2B 2!(2 !)f (xk )
⇥ ⇤
= E kxk x⇤ k2B 2!(2 !)E [f (xk )] .

128 / 165
Quadratic Bounds

Lemma 46 (Quadratic bounds)


For all x 2 Rn and x⇤ 2 L we have

+ 1
min · f (x)  krf (x)k2B  max · f (x). (53)
2
and
max
f (x)  kx x⇤ k2B . (54)
2
Moreover, if Assumption 3 holds, then for all x 2 Rn and x⇤ = ⇧B
L (x) we
have
+
min
kx x⇤ k2B  f (x). (55)
2

129 / 165

Proof of Lemma 46 - I
In view of (17) and (33), we obtain a spectral characterization of f :

1X
n ⇣ ⌘2
f (x) = i ui> B1/2 (x x⇤ ) , (56)
2
i=1

where x⇤ is any point in L. On the other hand, in view of (28) and (33),
we have
krf (x)k2B = kB 1
E [Z] (x x⇤ )k2B (57)
> 1
= (x x⇤ ) E [Z] B E [Z] (x x⇤ )
= (x x⇤ )> B1/2 (B 1/2
E [Z] B 1/2
)(B 1/2
E [Z] B 1/2
)B1/2 (x x⇤ )
= (x x⇤ )> B1/2 U(U> B 1/2
E [Z] B 1/2
U)2 U> B1/2 (x x⇤ )
(33)
= (xx⇤ )> B1/2 U⇤2 U> B1/2 (x x⇤ )
n
X ⇣ ⌘2
2 > 1/2
= i u i B (x x ⇤ ) . (58)
i=1

Inequality (53) follows by comparing (56) and (57), using the bounds
+ 2
min i  i  max i ,

which hold for i for which i > 0.


130 / 165
Proof of Lemma 46 - II

We now move to the bounds involving norms. First, note that for any
x⇤ 2 L we have
(17) 1
f (x) = (x x⇤ )> E [Z] (x x⇤ ) (59)
2
1 1/2
= (B (x x⇤ ))> (B 1/2 E [Z] B 1/2
)B1/2 (x x⇤ ).
2
The upper bound follows by applying the inequality
1/2 1/2
B E [Z] B max I.

If x⇤ = ⇧B
L (x), then in view of (19), we have
⇣ ⌘
1/2 1/2 >
B (x x⇤ ) 2 Range B A .

Applying Lemma 45 to (59), we get the lower bound.

131 / 165

Strong Convergence
Theorem 47 (Strong convergence)
Let Assumption 3 (exactness) hold and set x⇤ = ⇧BL (x0 ). Let {xk } be the
random iterates produced by Algorithm 2, where the relaxation parameter
def ⇥ ⇤
satisfies 0 < ! < 2, and let rk = E kxk x⇤ k2B . Then for all k 0 we
have
(1 !(2 !) max )k r0  rk  (1 !(2 !) + k
min ) r0 . (60)
The best rate is achieved when ! = 1.

Proof.
Let k = E [f (xk )]. We have

(49) (55)
+
rk+1 = rk 2!(2 !) k  rk !(2 !) min rk ,

and
(49) (54)
rk+1 = rk 2!(2 !) k rk !(2 !) max rk .

Inequalities (60) follow from this by unrolling the recurrences.

132 / 165
Convergence of f (xk )

133 / 165

Convergence of f (xk )

Theorem 48 (Convergence of f )
Choose x0 2 Rn , and let {xk }1
k=0 be the random iterates produced by
Algorithm 2, where the relaxation parameter satisfies 0 < ! < 2.
def Pk 1
(i) Let x⇤ 2 L. The average iterate x̂k = k1 t=0 xt for all k 1
satisfies
kx0 x⇤ k2B
E [f (x̂k )]  . (61)
2!(2 !)k
(ii) Now let Assumption 3 hold. For x⇤ = ⇧B
L (x0 ) and k 0 we have

+ k max kx0 x⇤ k2B


E [f (xk )]  1 !(2 !) min . (62)
2
The best rate is achieved when ! = 1.

134 / 165
Proof of Theorem 48
⇥ ⇤
(i) Let k = E [f (xk )] and rk = E kxk x⇤ k2B . By summing up the
identities from (49), we get
k 1
X
2!(2 !) t = r0 rk .
t=0

Therefore, using Jensen’s inequality, we get


" k 1 # k 1
1X 1X r0 rk r0
E [f (x̂k )]  E f (xt ) = t =  .
k t=0 k t=0 2!(2 !)k 2!(2 !)k

(ii) Combining inequality (54) with Theorem 47, we get

max ⇥ ⇤ (60) + k max kx0 x⇤ k2B


E [f (xk )]  E kxk x⇤ k2B  1 !(2 !) min .
2 2

135 / 165

136 / 165

You might also like