You are on page 1of 7

Applied Analysis B, week 21 Witold Sadowski

45 Maxima and minima of quadratic forms with constraints


The method of Lagrange multipliers has many applications. In the next couple of sections we
will take a closer look at one of the applications that is particularly relevant to Data Science.
But before we do that let’s have a look at an example that will play an important role in further
investigations.

Example 45.1. Consider a function:

F (x, y, z) = αx2 + βy 2 + γz 2

where 0 < α < β < γ. Find the greatest value of F on the sphere

S = {(x, y, z) ∈ R3 : x2 + y 2 + z 2 = 1}.

solution The level sets of F are ellipsoids in R3 :

So, from the geometrical point of view we look for points where these ellipsoids are tangent
to the unit sphere. We will use the method of Lagrange multipliers. To this end we define a
function:
g(x, y, z) = x2 + y 2 + z 2 − 1.
Then the unit sphere S is the zero level set of g. We have

∇F (x, y, z) = (2αx, 2βy, 2γz)


∇g(x, y, z) = (2x, 2y, 2z)

Therefore we want to solve the following system of equations:




 2αx = 2λx
2βy = 2λy


 2γz = 2λz
 2
x + y2 + z2 = 1

We can see that the first three equations can be written in the form:

Qx = λx,

where   
α 0 0 x
Q= 0 β 0  and x =  y 
0 0 γ z
So this is a problem of finding eigenvalues and eigenvectors. Since the matrix Q is diagonal we
solve this system quickly and combine it with the condition x2 + y 2 + z 2 = 1. We obtain three
cases:

57
Case 1. λ = α and (x, y, z) = (±1, 0, 0)
Case 2. λ = β and (x, y, z) = (0, ±1, 0)
Case 3. λ = γ and (x, y, z) = (0, 0, ±1).
We now need to check the value of the function F at the six points ”under suspicion”:

F (±1, 0, 0) = α, F (0, ±1, 0) = β, F (0, 0, ±1) = γ

Hence the smallest value of F on S is equal to α, and the greatest value is equal to γ. The six
points that we needed to consider can be easily identified on the sphere:

46 Dimension reduction algorithm


One of the central problems in Data Science is the question of finding a low-dimensional space
in which we can store information about some large set of points that initially belong to the
space RN when N is very large. This problem is called ”dimension reduction”. We will study
it first in the most simple case, when N = 2 (so not very large...) and the ”low-dimensional
space” is spanned by just one vector. The main idea is simple: given a set X of points in R2
(with a zero mean) we want to find a line y = αx that maximizes the ”variance” (or dispersion
around the mean) of the orthogonal projection of the set X onto the line l.

In other words if X = {(xi , yi ) : i = 1, 2, ..., n} and P X = {(x̃i , ỹi ) : i = 1, 2, ..., n}, where
(x̃i , ỹi ), i = 1, 2, ..., n are the orthogonal projections of the elements of X onto the line l, then
we want to maximise the quantity:
n
1X
|(x̃i , ỹi )|2 .
n
i=1

Let u = (u1 , u2 ) be a unit vector in the direction of the line l. From elementary results in linear
algebra we have
(x̃i , ỹi ) = ((xi , yi ) · (u1 , u2 ))u.
Hence, we want to maximise the function
n n
1X 1X
F (u1 , u2 ) = |((xi , yi ) · (u1 , u2 ))u|2 = ((xi , yi ) · (u1 , u2 ))2 .
n n
i=1 i=1

58
Example 46.1. Consider the following four points:
√ √
(x1 , y1 ) = (− 3, 0), (x2 , y2 ) = ( 3, 0), (x3 , y3 ) = (−1, −2), (x4 , y4 ) = (1, 2)

Find a unit vector u = (u1 , u2 ) which maximises the function:


4
1X
F (u) = ((xi , yi ) · u))2 .
4
i=1

solution We need to maximise the function


1h √ √ i
F (u1 , u2 ) = (− 3u1 + 0)2 + ( 3u1 + 0)2 + (−u1 − 2u2 )2 + (u1 + 2u2 )2
4
subject to the condition |u| = 1. We notice that
1 2
3u1 + 3u21 + u21 + 4u1 u2 + 4u22 + u21 + 4u1 u2 + 4u22 = 2u21 + 2u1 u2 + 2u22

F (u1 , u2 ) =
4
Note that:   
2 1 u1
F (u1 , u2 ) = (u1 , u2 )
1 2 u2
or in a more compact form:
F (u) = uT Su,
where  
2 1
S=
1 2
The gradient of F is given by

∇F (u1 , u2 ) = (4u1 + 2u2 , 4u2 + 2u1 ).

This again can be written in a simpler form as

∇F (u) = 2Su.

The constraint |u| = 1 can be expressed as

g(u) = 1 where g(u) = u21 + u22 .

The gradient of g - that we have calculated many times before - can then be expressed as follows:

∇g(u) = 2(u1 , u2 ) = 2u.

The condition that the gradient of F is parallel to the gradient of g is expressed as:

2Su = λ2u.

Hence the method of Lagrangian multipliers requires us to solve a system of equations:



Su = λu
|u| = 1

We now find the eigenvalues of  


2 1
S=
1 2

59
We have
det S = 0 ⇔ (2 − λ)(2 − λ) − 1 = 0
We solve this quadratic equation and get λ = 1 or λ = 3. If λ = 1 then the eigenvectors are
given by     
2 1 u1 u1
= ,
1 2 u2 u2
so we get (u1 , u2 ) = α(1, −1). A unit vector in this direction is thus given by

(−1, 1)
w= √ .
2
We note that
F (w) = 1.
When λ = 3 then - using similar technique as above - we find that the corresponding eigenvector
is given by (u1 , u2 ) = α(1, 1). The unit vector in this direction is given by

(1, 1)
v= √ .
2
We note that
F (v) = 3
Hence the greatest value of F on the unit sphere is 3, and the smallest is equal to 1.

47 Diagonalisation of symmetric matrices


In the last example the greatest value of a quadratic function turned out to be the greatest
eigenvalue of the corresponding matrix. This was not by accident.
Suppose that a quadratic function is given by a symmetric matrix S:

F (u) = uT Su.

Let w and v be unit eigenvectors corresponding to the eigenvalues λ1 and λ2 , respectively,


where 0 < λ1 < λ2 . The vectors w and v are orthogonal:

w·v =0

(according to the Theorem below). Hence if we write down a vector u as the linear combination
of v and w we obtain:

F (αw + βv) = (αw + βv) · (S(αw + βv))


= (αw + βv) · (λ1 αw + λ2 βv)
= λ1 α2 w · w + λ1 αβv · w + λ2 αβv · w + λ2 β 2 v · v
= λ1 α2 + λ2 β 2 .

Since vectors v and w are orthonormal it follows that the whole problem can now be written
as follows:

Maximise λ1 α2 + λ2 β 2 subject to the condition α2 + β 2 = 1.

60
Following the ideas from Example 45.1:
• the greatest value of F on the unit circle is attained when α = 0, β = ±1. We then have
F (v) = λ2 .

• the smallest value is obtained when α = ±1 and β = 0. We then have


F (w) = λ1 .
Therefore, the greatest value of F on the unit circle is λ2 and the smallest λ1 .
Does it always work so nice? The following theorem tells us that yes, it always works.
Theorem 47.1. Let Q be a symmetric real matrix that is positive definite. Then
ˆ all eigenvalues of Q are positive real numbers
ˆ there exists an orthonormal basis that consists of eigenvectors of Q
ˆ Q is diagonalisable
Proof. We will show this theorem in the simplest case of n = 2. Consider
 
a b
Q=
b c
We have  
a−λ b
det(Q − λI) = det = (a − λ)(c − λ) − b2 .
b c−λ
We look for λ such that
(a − λ)(c − λ) − b2 = 0.
We have
λ2 − (a + c)λ + ac − b2
Hence
∆ = (a + c)2 − 4ac + 4b2 = (a − c)2 + 4b2
If a = c and b = 0 then we get λ = a = c. This is the situation when
 
a 0
Q= .
0 a
In that case any orthonormal basis consists of eigenvectors of Q.
If a ̸= c or if b ̸= 0 then ∆ > 0 and we obtain two solutions:
√ √
a+c− ∆ a+c+ ∆
λ1 = and λ2 = .
2 2
Clearly, these eigenvalues are real and λ1 ̸= λ2 . Since we assumed that Q is positive definite we
have λ1 > 0 and λ2 > 0.
Now we need to show that the corresponding eigenvectors are orthogonal. Let v correspond to
λ1 and w correspond to λ2 :
Qv = λ1 v and Qw = λ2 w.
Since Q is symmetric it follows that
Qw · v = w · Qv.
Hence
λ1 w · v = λ2 w · v
Therefore, using the fact that λ1 ̸= λ2 , we have
w · v = 0.

61
48 Dimension reduction algorithm II
The results of the previous section can be summarised in the following algorithm that can be
easily generalised.
Consider the set X of n data points on R2 :

X = {(xi , yi ) : i = 1, 2, ..., n}

We assume that the mean value of xi and the mean value of yi is zero and that the variance of
each coordinate is 1.
For a given line l: y = αx, we consider the orthogonal projection of X onto the line l:

P (X) = {(xi , yi ) · (u1 , u2 )u : i = 1, 2, ..., n},

where u is a unit vector parallel to l. We want to find a line y = αx that - roughly speaking -
maximises ”the variance” of P X on the line l.
STEP 1. We write the problem in the following form: maximise the function
n
1X
F (u) = ((xi , yi ) · (u1 , u2 ))2 = u · (Su),
n
i=1
  n n n
s11 s12 1X 2 1X 1X 2
where : S = and s11 = xi , s12 = s21 = x i yi , s22 = yi .
s21 s22 n n n
i=1 i=1 i=1
 
2 1
Note that in the previous section we considered S = .
1 2
STEP 2. We find two orthonormal eigenvectors and two corresponding eigenvalues of S:

Sw = λ1 w, Sv = λ2 v, v ⊥ w, |v| = |w| = 1, 0 ≤ λ1 ≤ λ2 .

Note that in the previous example we found the following eigenvectors and eigenvalues:
• w = (−1,1)

2
corresponding to the eigenvalue λ1 = 1 and
(1,1)
•v= √
2
corresponding to the eigenvalue λ2 = 3.
STEP 3. In the basis w, v the function F can be expressed in a very simple form:

F (αw + βv) = λ1 α2 + λ2 β 2

It follows that the smallest value of F on the circle |u| = 1 is equal to λ1 and is attained in the
direction ±w, while the greatest value of F is equal to λ2 and is attained in the direction ±v.

62
Remark. The algorithm can be easily generalised. For example consider a set

X = {(xi , yi , zi ) ∈ R3 : i = 1, 2, ..., n}.

The 1-dimensional space which maximises the ”variance” of projected elements of X is obtained
as follows: First we define the symmetric matrix S with elements:
n n n
1X 2 1X 2 1X 2
s11 = xi , s22 = yi , s33 = zi ,
n n n
i=1 i=1 i=1
n n n
1 X 1 X 1X
s12 = s21 = xi yi , s13 = s31 = xi zi , s23 = s32 = yi z i .
n n n
i=1 i=1 i=1

Then we find the corresponding eigenvectors w, v, r corresponding to eigenvalues: 0 ≤ λ1 ≤


λ2 ≤ λ3 . The one-dimensional space containing ”most information” about the set X is spanned
by the eigenvector r corresponding to the greatest eigenvalue λ3 . The two-dimensional space
containing ”most information” about the set X is spanned by r and v.
Further generalisations are relatively straightforward.

63

You might also like