Lecture1 introductionPCA

Lecture 1: Introduction to optimization
DSA5103 Optimization algorithms for data modelling
Yangjing Zhang
15-Aug-2023
NUS
Contacts
• Lecturer: Yangjing Zhang

• Email: yj.zhang [AT] nus.edu.sg
• Office: S17-05-16
• Feel free to contact me through emails with proper salutation:
. Dr. Zhang (3formal)
. Yangjing (3informal)
. Hi/Hi Professor (7)
• Programming tools: I will use Matlab for demonstration. Feel free to
use Python/R/Julia
lecture1 1/59
About
Probably will not cover deep learning, reinforcement learning ...
lecture1 2/59
Assessment
Final grade based on:
• Final exam: 40%

. Interpretation, computation
. Implementation of some algorithms
. No proof required
• Homework (#4 sets): 60%
lecture1 3/59
Today’s content1
• Introduction
1. Motivating examples
2. Graphical illustration
3. General nonlinear programming
4. Basic calculus and linear algebra
5. Necessary/Sufficient optimality conditions
6. Convex sets/functions
• Principal component analysis
1. Recap linear algebra (eigenvalues/eigenvectors) and statistics
(mean/covariance)
2. PCA
1 Most of the contents are adopted from [1]

lecture1 4/59
Motivating examples
Simple examples
Example 1
If K units of capital and L units of labor are used, a company can
produce KL units of a product. Capital can be purchased at $4 per
unit and labor can be purchased at $1 per unit. A total of $8000 is
available to purchase capital and labor. How can the firm maximize the
quantity of the product manufactured?
lecture1 5/59
Simple examples
Example 1
If K units of capital and L units of labor are used, a company can
produce KL units of a product. Capital can be purchased at $4 per
unit and labor can be purchased at $1 per unit. A total of $8000 is
available to purchase capital and labor. How can the firm maximize the
quantity of the product manufactured?
Solution. Let K = units of capital purchased, and L = units of labor

purchased. The problem to solve is
maximize KL
subject to 4K + L ≤ 8000, K ≥ 0, L ≥ 0.
lecture1 5/59
Simple examples
Example 2
It costs a company $c to produce a unit of a product. If the company
charges $p per unit, and the customers demand D(p) units, what price
should the company charge to maximize its profit?
lecture1 6/59
Simple examples
Example 2
It costs a company $c to produce a unit of a product. If the company
charges $p per unit, and the customers demand D(p) units, what price
should the company charge to maximize its profit?
Solution. The firm’s decision variable is p, and the profit is (p − c)D(p).

The problem to solve is
maximize (p − c)D(p)
subject to p ≥ 0.
lecture1 6/59
Linear regression
• Predict the price for house with area 3500
10 6
8
6
Price
1
1000 2000 3000 4000 5000 6000 7000 8000 9000
Area
Figure 1: Housing prices data2
2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
lecture1 7/59
Linear regression
area #bedrooms #bathrooms stories ··· price

7420 4 2 2 ··· 13300000
8960 4 4 4 ··· 12250000
..
.
• Predictor/feature vector a1 = (7420, 4, 2, 2, . . . )T ∈ Rp

• Response/target b1 = 13300000
• p: #features. n: #samples
• Fit a linear model b = xT a + α
lecture1 8/59
Linear regression
• Given data (ai , bi ), i = 1, . . . , n where ai ∈ Rp is the predictor

vector and bi ∈ R is the response.
• Fit a linear model b = xT a + α.
• The goal is to find x ∈ Rp and α ∈ R
n
1X
minimizex,α (bi − xT ai − α)2
2 i=1
lecture1 9/59
Optimal transportation (OT)
Example: OT
s supplies a1 , . . . , as of goods must be transport to meet t demands
b1 , . . . , bt of customers. The cost of transporting one unit of ith good
to jth customer is cij . What is the optimal transportation plan (xij
unit of ith good transported to jth customer) to minimize the cost?
. For example, the cost

cij may be determined
by the distance
. A feasible transportation
!
0.1 0.4 0
plan
0.1 0.2 0.2
. [Exercise] Construct
another feasible
transportation plan
lecture1 10/59
Cost: X
cij xij
Constraints:
demands
supplies
lecture1 11/59
Solution. Suppose xij unit of ith good is transported to jth customer.

P
The cost is ij cij xij . The problem to solve is
s X
X t
minimize cij xij
i=1 j=1
Xt
subject to xij = ai , i ∈ [s] = {1, . . . , s}
j=1
Xs
xij = bj , j ∈ [t]
i=1
xij ≥ 0, i ∈ [s], j ∈ [t].
• Such a problem arises in computing the Wasserstein distance (a hot

topic in machine learning and statistics) to measure the similarity of
two discrete probability distributions.
Ps Ps
• In this case, i=1 ai = 1 and i=1 bj = 1 correspond to two
discrete probability distributions.
lecture1 12/59
Graphical illustration
Graphical illustration in 1-dimension
√
maximize f (x) = x 24 − x
50
40
30
f(x)
20
10
0
0 10 20 30
x
√
x = 16 achieves the maximal value 16 24 − 16
lecture1 13/59
minimize f (x) = (x1 − 4)2 + (x2 − 6)2

subject to x1 ≤ 4
x2 ≤ 6
3x1 + 2x2 ≤ 18
x1 ≥ 0
x2 ≥ 0
• Note that (x1 − 4)2 + (x2 − 6)2 measures the distance between two
points (x1 , x2 )T and (4, 6)T
• Graphically, we want to find the shortest distance of (x1 , x2 )T from
(4, 6)T
lecture1 14/59
minimize (x1 − 4)2 + (x2 − 6)2 8
subject to x1 ≤ 4 7
x2 ≤ 6 6
3x1 + 2x2 ≤ 18 5
2
x
x1 ≥ 0 4
3
x2 ≥ 0
2
0
0 2 4 6 8
x
1
(x1 , x2 ) = ( 34 66 T 34 2 66
13 , 13 ) achieves the minimal value ( 13 − 4) + ( 13 − 6)
2
lecture1 15/59
Graphical illustration
• Good for intuition

• Give solutions visually
• Only work in 1-dimension or 2-dimension
• For higher dimensional problems, one has to rely on algebraic
conditions to find (as well as to verify) the solutions
lecture1 16/59
General nonlinear programming
Nonlinear programming
A general nonlinear programming problem (NLP) is
minimize (or maximize) f (x)

subject to gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [p]
• f , gi , and hj are (possibly nonlinear) functions of variable

x = (x1 , . . . , xn )T ∈ Rn
• f objective function, gi (x) = 0 equality constraint, hj (x) ≤ 0
inequality constraint
• It suffices to discuss minimization problems since
minimize f (x) ⇐⇒ maximize − f (x)
lecture1 17/59
Terminology and notation
• Vectors in Rn will always be column vectors, unless otherwise

specified
. row vector (x1 , x2 , x3 )
. column vector (x1 ; x2 ; x3 ) or (x1 , x2 , x3 )T
• The feasible set for an NLP is the set
S = {x ∈ Rn | g1 (x) = 0, . . . , gm (x) = 0, h1 (x) ≤ 0, . . . , hp (x) ≤ 0}
• A point in the feasible set is a feasible solution or feasible point;

otherwise, it is an infeasible solution or infeasible point
• When there is no constraint, S = Rn , we say the NLP is
unconstrained
lecture1 18/59
Local/global minimizer
minimize f (x) = (sin(πx))2 ex − x2

subject to 0 ≤ x ≤ 3.5
25
20
15
10
global minimizer
5
-5
local minimizer
-10
0 0.5 1 1.5 2 2.5 3 3.5
x
• Local/global minimizer may not be unique
• A global minimizer is a local minimizer. However, the converse is
not true in general lecture1 19/59
Local/global minimizer
Definition (Local minimizer and global minimizer)

Let S be the feasible set. Define B (y) = {x ∈ Rn | kx − yk < } to be
p
the open ball with center y and radius . (kxk = x21 + · · · + x2n )
1. A point x∗ ∈ S is said to be a local minimizer of f if there exists
> 0 such that
f (x∗ ) ≤ f (x) ∀ x ∈ S ∩ B (x∗ )
2. A point x∗ ∈ S is said to be a global minimizer of f if there exists

> 0 such that
f (x∗ ) ≤ f (x) ∀ x ∈ S
lecture1 20/59
Basic calculus and linear algebra
Interior point
Definition (Interior point)

Let S ⊆ Rn be a nonempty set. A point x ∈ S is called an interior point
of S if there exists > 0 such that
B (x) ⊆ S
Figure 2: The point x is an interior point of S. The point y is on the

boundary of S. Image from internet
lecture1 21/59
Gradient vector
Definition (Gradient vector)

Let S ⊆ Rn be a nonempty set. Suppose f : S → R, and x is an interior
point of S such that f is differentiable at x. Then the gradient vector of
f at x is  
∂f
∂x1 (x)
 . 
∇f (x) =  . 
 . 
∂f
∂xn (x)
• At x∗ , the value f (x) decreases most rapidly along the direction

−∇f (x∗ ). On the other hand, it increases most rapidly along the
direction ∇f (x∗ ).
lecture1 22/59
Hessian matrix
Definition (Hessian matrix)

Let S ⊆ Rn be a nonempty set. Suppose f : S → R, and x is an interior
point of S such that f has second order partial derivatives at x. Then
the Hessian of f at x is the n × n matrix
 2
∂2f ∂2f

∂ f
∂x1 ∂x1 (x) ∂x1 ∂x2 (x) · · · ∂x1 ∂xn (x)
 .. .. .. .. 
Hf (x) = 
 . . . .


2 2 2
∂ f ∂ f ∂ f
∂xn ∂x1 (x) ∂xn ∂x2 (x) · · · ∂xn ∂xn (x)
2
• The ij-entry of Hf (x) is ∂x∂i ∂x
f
j
(x). In general, Hf (x) is not
symmetric. However, if f has continuous second order derivatives,
then the Hessian matrix is symmetric.
lecture1 23/59
Example
Compute the gradient and Hessian

(1) f (x) = x3 + 2x + ex , x ∈ R.
(2) f (x) = x31 + 2x1 x2 + x22 , x = (x1 ; x2 ) ∈ R2 .
lecture1 24/59
Example
Compute the gradient and Hessian

(1) f (x) = x3 + 2x + ex , x ∈ R.
(2) f (x) = x31 + 2x1 x2 + x22 , x = (x1 ; x2 ) ∈ R2 .
Solution (1).
∇f (x) = f 0 (x) = 3x2 + 2 + ex , Hf (x) = f 00 (x) = 6x + ex .
Solution (2). " # " #

∂f
∂x1 (x) 3x21 + 2x2
∇f (x) = ∂f =
∂x2 (x) 2x1 + 2x2
∂2f ∂2f
" # " #
∂x1 ∂x1 (x) ∂x1 ∂x2 (x) 6x1 2
Hf (x) = ∂2f ∂2f
=
∂x2 ∂x1 (x) ∂x2 ∂x2 (x)
2 2
lecture1 24/59
Positive (semi)definite matrices
Definition (Positive (semi)definite)

Let A be a real n × n matrix.
(1) A is said to be positive semidefinite if xT Ax ≥ 0, ∀ x ∈ Rn .

(2) A is said to be positive definite if xT Ax > 0, ∀ x 6= 0.
(3) A is said to be negative semidefinite if −A is positive (semi)definite.
(4) A is said to be negative definite if −A is positive definite.
(5) A is said to be indefinite if A is neither positive nor negative
semidefinite.
lecture1 25/59
Positive (semi)definite matrices
To check whether a matrix is positive semidefinite by definition is not an

easy task in general. The following theorem may be useful to check for
definiteness of a matrix.
Theorem (Eigenvalue test)
Let A be a real symmetric n × n matrix.
(1) A is positive semidefinite if and only if every eigenvalue of A is

nonnegative.
(2) A is positive definite if and only if every eigenvalue of A is positive.
(3) A is negative semidefinite if and only if every eigenvalue of A is
nonpositive.
(4) A is negative definite if and only if every eigenvalue of A is negative.
(5) A is indefinite if and only if it has both a positive eigenvalue and a
negative eigenvalue.
lecture1 26/59
Necessary/Sufficient optimality
conditions
Unconstrained NLP
Consider an unconstrained NLP
minimize f (x)
Assume that f : Rn → R is nonlinear and differentiable. A point x∗ is

called a stationary point of f if ∇f (x∗ ) = 0.
Next, we investigate some optimality conditions of a local minimizer
• Necessary (optimality) condition

• Sufficient (optimality) condition
lecture1 27/59
Necessary/sufficient condition
Necessary condition If x∗ is a local minimizer of f , then

(1) x∗ is a stationary point, i.e., ∇f (x∗ ) = 0
(2) Hf (x∗ ) is positive semidefinite
• confine our search for global minimizers within the set of stationary
points
lecture1 28/59
Necessary/sufficient condition
Necessary condition If x∗ is a local minimizer of f , then

(2) Hf (x∗ ) is positive semidefinite
• confine our search for global minimizers within the set of stationary
points
Sufficient condition If
(2) Hf (x∗ ) is positive definite,
then x∗ is a local minimizer of f .
• verify that a point is indeed a local minimizer
lecture1 28/59
Example
Example
Let f (x) = x21 + x22 − 2x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Find a stationary point of f .
(2) Verify that the stationary point is a local minimizer (by checking
the sufficient condition).
lecture1 29/59
Example
Example
Let f (x) = x21 + x22 − 2x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Find a stationary point of f .
(2) Verify that the stationary point is a local minimizer (by checking
the sufficient condition).
Solution. We let
" # " # " #
∂f ∗
∂x1 (x ) 2x∗1 0
∇f (x∗ ) = ∂f ∗
= = 0 ⇒ x∗ =
∂x2 (x ) 2x∗2 − 2 1
Thus x∗ = (0, 1)T is a stationary point of f . Next, we verify that
" 2
∂2f
# " #
∂ f ∗ ∗
∗ ∂x1 ∂x1 (x ) ∂x1 ∂x2 (x ) 2 0
Hf (x ) = ∂2f ∗ ∂2f ∗
=
∂x2 ∂x1 (x ) ∂x2 ∂x2 (x )
0 2
is positive definite (since the eigenvalues of Hf (x∗ ) are 2, 2, which are
positive).
lecture1 29/59
Example
Example
Let f (x) = x21 x2 + x1 x32 − x1 x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Verify that x = (0; 1) a stationary point of f .
(2) Is x = (0; 1) a local minimizer?
lecture1 30/59
Example
Example
Let f (x) = x21 x2 + x1 x32 − x1 x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Verify that x = (0; 1) a stationary point of f .
(2) Is x = (0; 1) a local minimizer?
Solution.
" # " #
2x1 x2 + x32 − x2 0
∇f (x) = 2 ⇒ ∇f (0; 1) =
x1 + 3x1 x22 − x1 0
Thus x = (1; 0) is a stationary point of f . Next,
" # " #
∗ 2x2 2x1 + 3x22 − 1 2 2
Hf (x ) = ⇒ Hf (0; 1) =
2x1 + 3x22 − 1 6x1 x2 2 0
√
is not positive semidefinite (since the eigenvalues of Hf are 1 + 5,
√
1 − 5 < 0). The necessary condition fails, thus x = (0; 1) is not a local
minimizer.
lecture1 30/59
Convex sets/functions
Convex sets
Definition (Convex set)

A set D ∈ Rn is said to be a convex set if for any two points x and y in
D, the line segment joining x and y also lies in D. That is,
x, y ∈ D ⇒ λx + (1 − λ)y ∈ D ∀ λ ∈ [0, 1].
Figure 3: Left: convex. Right: non-convex. Image from internet
lecture1 31/59
(Strictly) convex functions
Definition ((Strictly) convex function)

Let D ⊆ Rn be a convex set. Consider a function f : D → R.
(1) The function f is said to be convex if
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ D, λ ∈ [0, 1].
(2) The function f is said to be strictly convex if
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y),
for all distinct x, y ∈ D, λ ∈ (0, 1).
• Note that −f is convex ⇐⇒ f is concave
lecture1 32/59
Geometrical interpretation
For a convex function f , the line For a concave function f , the line
segment joining f (x) and f (y) segment joining f (x) and f (y)
lies above the graph of f in [x, y]. lies below the graph of f in [x, y].
lecture1 33/59
Geometrical interpretation
lecture1 34/59
Benefit of convexity
Consider an unconstrained convex NLP
minimize f (x)
where f : Rn → R is convex. It holds that

(1) any local minimizer is a global minimizer.
(2) if f is strictly convex, then the global minimizer is unique.
lecture1 35/59
Test for convexity
Theorem (Test for convexity of a differentiable function)

Suppose that f has continuous second partial derivatives on an open
convex set D in Rn .
(1) The function f is convex on D if and only if the Hessian matrix
Hf (x) is positive semidefinite at each x ∈ D.
(2) If the Hessian matrix Hf (x) is positive definite at each x ∈ D, then
f is strictly convex on D.
(3) If the Hessian matrix Hf (x̂) is indefinite at some point x̂ ∈ D, then
f is not a convex nor a concave function on D.
lecture1 36/59
Preliminaries
Notation
• Sn : the space of n × n symmetric matrices.

• Sn+ : the space of n × n symmetric and positive semidefinite matrices
(all eigenvalues ≥ 0).
• Let A be a matrix. Ai· denotes the i-th row of A, and A·j denotes
the j-th column of A.
 
1 2 3 4 5
. For example, given A =  6 7 8 9 10. Then
 
11 12 13 14 15
 
1 h i
A·1 =  6 , and A1· = 1 2 3 4 5 .
 
11
lecture1 37/59
Eigenvalues/eigenvectors
• The eigenvalue decomposition of A ∈ Sn is given by

   T
| | λ1 | |
T
A = QΛQ = Q·1 ··· Q·n   .. ···
 Q·1 Q·n 
   
.
| | λn | |
where Q is an orthogonal matrix whose columns are eigenvectors of

A, Λ is a diagonal matrix with eigenvalues of A on the diagonal.
lecture1 38/59
Eigenvalues/eigenvectors: An example
" #
1.5 0.5
Consider A = .
0.5 1.5
Figure 4: Blue points are xi ∈ R2 , i = 1, . . . , 1000 randomly drawn from

[0, 1] × [0, 1]. Red points are yi = Axi after linear transformation.
lecture1 39/59
Eigenvalues/eigenvectors: An example
• Eigenvalue decomposition:
" #" #" #T
−0.7071 0.7071 1 −0.7071 0.7071
A= .
0.7071 0.7071 2 0.7071 0.7071
eigenvector 1 eigenvector 2
. The linear transformation

results in a scaling of 2
along eigenvetor 2.
. The linear transformation
results in a scaling of 1
along eigenvetor 1.
• The linear transformation results in a scaling of λ along the

eigenvector associated with λ.
lecture1 40/59
Mean/covariance
Let x1 , . . . , xn ∈ Rp be n observations of a random variable x.
• Mean vector:
n
1X
µ = x̄ = xi .
n i=1
• (Sample/Empirical) Covariance matrix:
n
1 X
Σ= (xi − µ)(xi − µ)T .
n − 1 i=1
. Recall: Covariance matrices are symmetric and positive semidefinite.

• Standard deviation (for p = 1):
v
u n
u1 X
σ=t (xi − µ)2 .
n i=1
lecture1 41/59
Mean/covariance: An example
" # " #
1 2
Mean µ = , covariance Σ = . We generate 1000 observations
4 3
of a Gaussian random variable
xi ∼ N (µ, Σ), i = 1, . . . , 1000.
mu = [1;4];
Sigma = [2 0; 0 3];
for i = 1:1000
xi = mvnrnd ( mu , Sigma ) ;
plot ( xi (1) , xi (2) , 'k . ') ;
hold on ;
end
. We can see that the data

" #is
1
centered at mean µ =
4
lecture1 42/59
" # " # " # " #

1 2 0.5 1 2 1
µ= Σ= µ= Σ=
4 0.5 3 4 1 3
lecture1 43/59
" # " # " # " #
1 2 2 1 2 −2
µ= Σ= µ= Σ=
4 2 3 4 −2 3
Can you find a vector that would approximate the data?
lecture1 44/59
" # " # " # " #
1 2 2 1 2 −2
µ= Σ= µ= Σ=
4 2 3 4 −2 3
Can you find a vector that would approximate the data? Blue one!
lecture1 45/59
PCA: Intuition
" # " #
1 2 2
µ= Σ= Consider eigenvectors of covariance matrix Σ
4 2 3
. λ = 0.4382 " #
−0.7882
eigenvector 1 =
0.6154
. λ = 4.5616 " #
0.6154
eigenvector 2 =
0.7882
. eigenvector 2 accounts for most
of the data
lecture1 46/59
Principal component analysis
PCA
• PCA is often used to reduce the dimensionality of large data sets

while preserving as much information as possible
• PCA allows us to identify the principal directions in which the data
varies
Next, we use an example to demonstrate PCA step by step.
10
Example 1
8
Consider a data set that consists of

6
p = 2 variables on n = 8 samples
4
(1, 2), (3, 3), (3, 5), (5, 4), (5, 6),
2
(6, 5), (8, 7), (9, 8).

-2 2 4 6 8 10 12
-2
lecture1 47/59
PCA: step-be-step explanation
• Input: an n × p matrix X, each row is an observation or sample,

each column is a predictor variable
• Step 1: standardization of each column of X — transform all
variables to the same scale
X·j − mean(X·j )
X·j =
standard deviation(X·j )
X = [1 ,2; 3 ,3; 3 ,5; 5 ,4; 5 ,6; 6 ,5; 8 ,7; 9 ,8]; % input

[n , p ] = size ( X ) ;
% standardize each column
for j = 1: p
X (: , j ) = ( X (: , j ) - mean ( X (: , j ) ) ) / std ( X (: , j ) ) ;
end
lecture1 48/59
10
1.5
8
1
6
0.5
-2 -1 1 2
-0.5
2
-1
-2 2 4 6 8 10 12
-1.5
-2
Figure 5: Input data Figure 6: Standardized data
lecture1 49/59
• Step 2: p × p covariance matrix computation — understand how

the variables of the data are varying from the mean with respect to
each other
Sigma = cov ( X ) ; % compute covariance matrix
which yields
Sigma =
1.0000 0.9087
0.9087 1.0000
lecture1 50/59
• Step 3: eigenvalue decomposition — by ranking eigenvectors in

order of their eigenvalues, highest to lowest, one obtains the
principal components in order of significance
[Q , D ] = eig ( Sigma ) ; % eigenvalue decomposition
[d , ind ] = sort ( diag ( D ) , ' descend ') ; % sort eigenvalues
D = D ( ind , ind ) ; % reorder diagonal of D ( eigenvalues )
Q = Q (: , ind ) ; % reorder columns of Q ( eigenvectors )
which yields
Q = . The first eigenvalue of 1.9087 is larger

than the second eigenvalue of 0.0913
-0.7071 0.7071
. The first eigenvector captures 95%
-0.7071 -0.7071 1.9087
( 1.9087+0.0913 ) of the information
D = . The second eigenvector captures 5%
0.0913
( 1.9087+0.0913 ) of the information
1.9087 0
0 0.0913
lecture1 51/59
1.5
0.5
-2 -1 1 2
-0.5
-1
-1.5
Figure 7: Principal directions. (-0.7071, -0.7071) carries 95% of the

information, and (0.7071, -0.7071) carries 5% of the information
lecture1 52/59
• Step 4: recast the data along the principal components axes —

the jth column of Xnew is the jth principal component
Xnew = XQ
Xnew = X * Q ; % new data
It is up to you to choose whether

. to keep all the components
. or discard the ones of lesser significance
There are some practical rules in real applications (introduced later).
lecture1 53/59
1.5
0.5
-2 -1 1 2
-0.5
-1
-1.5
Figure 8: Recast the data along the first principal direction. Discard the
second component and the dimension is reduced by 1.
lecture1 54/59
Standard PCA workflow
1. Make sure the data X are rows=observations and columns=variables

2. Standardize the columns of X
3. Run [Q, Xnew , d, tsquared, explained] = pca(X)
4. Using the variance% in “explained”, choose k (usually 1, 2, or 3)
components for visual analysis
. For example, if d = (1.9087, 0.0913), explained= (95.4, 4.6), one
may choose k = 1 as the first principal component carries 95.4% of
the information
. For example, if d = (2.9108, 0.9212, 0.1474, 0.0206),
explained= (72.8, 23.0, 3.7, 0.5), one may choose k = 2 as the first
two principal components carry 95.8% of the information
5. Plot Xnew (:, 1), . . . , Xnew (:, k) on a k-dimensional plot
lecture1 55/59
Application: Iris flower data set
• Iris flower data set3 contains 3 species (Setosa/Versicolor/Virginica)

of 50 instances each. n = 150
• p = 4 attributes: sepal length, sepal width, petal length, petal width
sepal length sepal width pedal length petal width species

(cm) (cm) (cm) (cm)
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
Table 1: The first 3 rows of the 150-instance Iris data set
3 See Wikepedia for more details

lecture1 56/59
• Choose 2 principal components (reduce the size of our data by 50%)

• First 2 principal components explain 72.8% + 23.0% = 95.8% of
total variance of the data
• Based on 2 principal components, classification would still be an
easy task
3
Principal component 2 (23.0%)
-1
-2
-3
-3 -2 -1 0 1 2 3 4
Figure 9: Setosa (blue), Versicolor (green), Virginica (yellow)

lecture1 57/59
• Choose 1 principal components (throwing out a massive 75% of our

data)
• Still end up with a remarkably effective single principle component
for the purposes of classification
-1
-3 -2 -1 0 1 2 3 4
Figure 10: Setosa (blue), Versicolor (green), Virginica (yellow)
Programming Exercise
Do PCA on the Iris flower data set and plot Figure 9 and Figure 10.
lecture1 58/59
Listing 1: Matlab codes applying PCA to Iris flower data set

[X , y ] = iris_dataset ; % load the data
X = X '; % make sure rows = observations and columns = variables
y = vec2ind ( y ) '; % species label
for j = 1: size (X ,2) % standardize columns of X
X (: , j ) = ( X (: , j ) - mean ( X (: , j ) ) ) / std ( X (: , j ) ) ;
end
[Q , Xnew ,d , tsquared , explained ] = pca ( X ) ; % PCA
% plot 2 principal components
figure ; scatter ( Xnew (: ,1) , Xnew (: ,2) ,25 , y ) ;
pc1 = sprintf ( ' Principal component 1 (%2.1 f %%) ' ,...
explained (1) ) ;
pc2 = sprintf ( ' Principal component 2 (%2.1 f %%) ' ,...
explained (2) ) ;
xlabel ( pc1 , ' fontSize ' ,15) ; ylabel ( pc2 , ' fontSize ' ,15) ;
% plot 1 principal component
figure ; scatter ( Xnew (: ,1) , zeros ( size (X ,1) ,1) ,25 , y ) ;
xlabel ( pc1 , ' fontSize ' ,15) ;
lecture1 59/59
References i
K.-C. Toh.
MA5268: Theory and algorithms for nonlinear optimization.

Lecture1 introductionPCA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture1 introductionPCA

Uploaded by

Copyright:

Available Formats

Lecture 1: Introduction to optimization

DSA5103 Optimization algorithms for data modelling

• Lecturer: Yangjing Zhang

Probably will not cover deep learning, reinforcement learning ...

Final grade based on:

• Final exam: 40%

1 Most of the contents are adopted from [1]

Solution. Let K = units of capital purchased, and L = units of labor

Solution. The firm’s decision variable is p, and the profit is (p − c)D(p).

• Predict the price for house with area 3500

Figure 1: Housing prices data2

area #bedrooms #bathrooms stories ··· price

• Predictor/feature vector a1 = (7420, 4, 2, 2, . . . )T ∈ Rp

• Given data (ai , bi ), i = 1, . . . , n where ai ∈ Rp is the predictor

. For example, the cost

Solution. Suppose xij unit of ith good is transported to jth customer.

• Such a problem arises in computing the Wasserstein distance (a hot

minimize f (x) = (x1 − 4)2 + (x2 − 6)2

minimize (x1 − 4)2 + (x2 − 6)2 8

• Good for intuition

A general nonlinear programming problem (NLP) is

minimize (or maximize) f (x)

• f , gi , and hj are (possibly nonlinear) functions of variable

minimize f (x) ⇐⇒ maximize − f (x)

• Vectors in Rn will always be column vectors, unless otherwise

S = {x ∈ Rn | g1 (x) = 0, . . . , gm (x) = 0, h1 (x) ≤ 0, . . . , hp (x) ≤ 0}

• A point in the feasible set is a feasible solution or feasible point;

minimize f (x) = (sin(πx))2 ex − x2

Definition (Local minimizer and global minimizer)

f (x∗ ) ≤ f (x) ∀ x ∈ S ∩ B (x∗ )

2. A point x∗ ∈ S is said to be a global minimizer of f if there exists

Definition (Interior point)

Figure 2: The point x is an interior point of S. The point y is on the

Definition (Gradient vector)

• At x∗ , the value f (x) decreases most rapidly along the direction

Definition (Hessian matrix)

Compute the gradient and Hessian

Compute the gradient and Hessian

∇f (x) = f 0 (x) = 3x2 + 2 + ex , Hf (x) = f 00 (x) = 6x + ex .

Solution (2). " # " #

Definition (Positive (semi)definite)

(1) A is said to be positive semidefinite if xT Ax ≥ 0, ∀ x ∈ Rn .

To check whether a matrix is positive semidefinite by definition is not an

(1) A is positive semidefinite if and only if every eigenvalue of A is

Consider an unconstrained NLP

Assume that f : Rn → R is nonlinear and differentiable. A point x∗ is

• Necessary (optimality) condition

Necessary condition If x∗ is a local minimizer of f , then

Necessary condition If x∗ is a local minimizer of f , then

• verify that a point is indeed a local minimizer

Definition (Convex set)

x, y ∈ D ⇒ λx + (1 − λ)y ∈ D ∀ λ ∈ [0, 1].

Figure 3: Left: convex. Right: non-convex. Image from internet

Definition ((Strictly) convex function)

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ D, λ ∈ [0, 1].

(2) The function f is said to be strictly convex if

f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y),

for all distinct x, y ∈ D, λ ∈ (0, 1).

• Note that −f is convex ⇐⇒ f is concave

Consider an unconstrained convex NLP

where f : Rn → R is convex. It holds that

Theorem (Test for convexity of a differentiable function)

• Sn : the space of n × n symmetric matrices.

f (x∗ ) ≤ f (x) ∀ x ∈ S ∩ B (x∗ )