You are on page 1of 75

Lecture 1: Introduction to optimization

DSA5103 Optimization algorithms for data modelling

Yangjing Zhang
15-Aug-2023
NUS
Contacts

• Lecturer: Yangjing Zhang


• Email: yj.zhang [AT] nus.edu.sg
• Office: S17-05-16
• Feel free to contact me through emails with proper salutation:
. Dr. Zhang (3formal)
. Yangjing (3informal)
. Hi/Hi Professor (7)
• Programming tools: I will use Matlab for demonstration. Feel free to
use Python/R/Julia

lecture1 1/59
About

Probably will not cover deep learning, reinforcement learning ...

lecture1 2/59
Assessment

Final grade based on:

• Final exam: 40%


. Interpretation, computation
. Implementation of some algorithms
. No proof required
• Homework (#4 sets): 60%

lecture1 3/59
Today’s content1

• Introduction
1. Motivating examples
2. Graphical illustration
3. General nonlinear programming
4. Basic calculus and linear algebra
5. Necessary/Sufficient optimality conditions
6. Convex sets/functions
• Principal component analysis
1. Recap linear algebra (eigenvalues/eigenvectors) and statistics
(mean/covariance)
2. PCA

1 Most of the contents are adopted from [1]


lecture1 4/59
Motivating examples
Simple examples

Example 1
If K units of capital and L units of labor are used, a company can
produce KL units of a product. Capital can be purchased at $4 per
unit and labor can be purchased at $1 per unit. A total of $8000 is
available to purchase capital and labor. How can the firm maximize the
quantity of the product manufactured?

lecture1 5/59
Simple examples

Example 1
If K units of capital and L units of labor are used, a company can
produce KL units of a product. Capital can be purchased at $4 per
unit and labor can be purchased at $1 per unit. A total of $8000 is
available to purchase capital and labor. How can the firm maximize the
quantity of the product manufactured?

Solution. Let K = units of capital purchased, and L = units of labor


purchased. The problem to solve is

maximize KL
subject to 4K + L ≤ 8000, K ≥ 0, L ≥ 0.

lecture1 5/59
Simple examples

Example 2
It costs a company $c to produce a unit of a product. If the company
charges $p per unit, and the customers demand D(p) units, what price
should the company charge to maximize its profit?

lecture1 6/59
Simple examples

Example 2
It costs a company $c to produce a unit of a product. If the company
charges $p per unit, and the customers demand D(p) units, what price
should the company charge to maximize its profit?

Solution. The firm’s decision variable is p, and the profit is (p − c)D(p).


The problem to solve is

maximize (p − c)D(p)
subject to p ≥ 0.

lecture1 6/59
Linear regression

• Predict the price for house with area 3500

10 6
8

6
Price

1
1000 2000 3000 4000 5000 6000 7000 8000 9000

Area

Figure 1: Housing prices data2

2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset

lecture1 7/59
Linear regression

area #bedrooms #bathrooms stories ··· price


7420 4 2 2 ··· 13300000
8960 4 4 4 ··· 12250000
..
.

• Predictor/feature vector a1 = (7420, 4, 2, 2, . . . )T ∈ Rp


• Response/target b1 = 13300000
• p: #features. n: #samples
• Fit a linear model b = xT a + α

lecture1 8/59
Linear regression

• Given data (ai , bi ), i = 1, . . . , n where ai ∈ Rp is the predictor


vector and bi ∈ R is the response.
• Fit a linear model b = xT a + α.
• The goal is to find x ∈ Rp and α ∈ R
n
1X
minimizex,α (bi − xT ai − α)2
2 i=1

lecture1 9/59
Optimal transportation (OT)
Example: OT
s supplies a1 , . . . , as of goods must be transport to meet t demands
b1 , . . . , bt of customers. The cost of transporting one unit of ith good
to jth customer is cij . What is the optimal transportation plan (xij
unit of ith good transported to jth customer) to minimize the cost?

. For example, the cost


cij may be determined
by the distance
. A feasible transportation
!
0.1 0.4 0
plan
0.1 0.2 0.2
. [Exercise] Construct
another feasible
transportation plan
lecture1 10/59
Optimal transportation (OT)

Cost: X
cij xij

Constraints:

demands
supplies

lecture1 11/59
Optimal transportation (OT)

Solution. Suppose xij unit of ith good is transported to jth customer.


P
The cost is ij cij xij . The problem to solve is
s X
X t
minimize cij xij
i=1 j=1
Xt
subject to xij = ai , i ∈ [s] = {1, . . . , s}
j=1
Xs
xij = bj , j ∈ [t]
i=1
xij ≥ 0, i ∈ [s], j ∈ [t].

• Such a problem arises in computing the Wasserstein distance (a hot


topic in machine learning and statistics) to measure the similarity of
two discrete probability distributions.
Ps Ps
• In this case, i=1 ai = 1 and i=1 bj = 1 correspond to two
discrete probability distributions.
lecture1 12/59
Graphical illustration
Graphical illustration in 1-dimension


maximize f (x) = x 24 − x

50

40

30
f(x)

20

10

0
0 10 20 30
x


x = 16 achieves the maximal value 16 24 − 16
lecture1 13/59
Graphical illustration in 2-dimension

minimize f (x) = (x1 − 4)2 + (x2 − 6)2


subject to x1 ≤ 4
x2 ≤ 6
3x1 + 2x2 ≤ 18
x1 ≥ 0
x2 ≥ 0

• Note that (x1 − 4)2 + (x2 − 6)2 measures the distance between two
points (x1 , x2 )T and (4, 6)T
• Graphically, we want to find the shortest distance of (x1 , x2 )T from
(4, 6)T

lecture1 14/59
Graphical illustration in 2-dimension

minimize (x1 − 4)2 + (x2 − 6)2 8

subject to x1 ≤ 4 7

x2 ≤ 6 6

3x1 + 2x2 ≤ 18 5

2
x
x1 ≥ 0 4

3
x2 ≥ 0
2

0
0 2 4 6 8

x
1

(x1 , x2 ) = ( 34 66 T 34 2 66
13 , 13 ) achieves the minimal value ( 13 − 4) + ( 13 − 6)
2

lecture1 15/59
Graphical illustration

• Good for intuition


• Give solutions visually
• Only work in 1-dimension or 2-dimension
• For higher dimensional problems, one has to rely on algebraic
conditions to find (as well as to verify) the solutions

lecture1 16/59
General nonlinear programming
Nonlinear programming

A general nonlinear programming problem (NLP) is

minimize (or maximize) f (x)


subject to gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [p]

• f , gi , and hj are (possibly nonlinear) functions of variable


x = (x1 , . . . , xn )T ∈ Rn
• f objective function, gi (x) = 0 equality constraint, hj (x) ≤ 0
inequality constraint
• It suffices to discuss minimization problems since

minimize f (x) ⇐⇒ maximize − f (x)

lecture1 17/59
Terminology and notation

• Vectors in Rn will always be column vectors, unless otherwise


specified
. row vector (x1 , x2 , x3 )
. column vector (x1 ; x2 ; x3 ) or (x1 , x2 , x3 )T
• The feasible set for an NLP is the set

S = {x ∈ Rn | g1 (x) = 0, . . . , gm (x) = 0, h1 (x) ≤ 0, . . . , hp (x) ≤ 0}

• A point in the feasible set is a feasible solution or feasible point;


otherwise, it is an infeasible solution or infeasible point
• When there is no constraint, S = Rn , we say the NLP is
unconstrained

lecture1 18/59
Local/global minimizer

minimize f (x) = (sin(πx))2 ex − x2


subject to 0 ≤ x ≤ 3.5
25

20

15

10
global minimizer
5

-5

local minimizer
-10
0 0.5 1 1.5 2 2.5 3 3.5

x
• Local/global minimizer may not be unique
• A global minimizer is a local minimizer. However, the converse is
not true in general lecture1 19/59
Local/global minimizer

Definition (Local minimizer and global minimizer)


Let S be the feasible set. Define B (y) = {x ∈ Rn | kx − yk < } to be
p
the open ball with center y and radius . (kxk = x21 + · · · + x2n )
1. A point x∗ ∈ S is said to be a local minimizer of f if there exists
 > 0 such that

f (x∗ ) ≤ f (x) ∀ x ∈ S ∩ B (x∗ )

2. A point x∗ ∈ S is said to be a global minimizer of f if there exists


 > 0 such that
f (x∗ ) ≤ f (x) ∀ x ∈ S

lecture1 20/59
Basic calculus and linear algebra
Interior point

Definition (Interior point)


Let S ⊆ Rn be a nonempty set. A point x ∈ S is called an interior point
of S if there exists  > 0 such that
B (x) ⊆ S

Figure 2: The point x is an interior point of S. The point y is on the


boundary of S. Image from internet
lecture1 21/59
Gradient vector

Definition (Gradient vector)


Let S ⊆ Rn be a nonempty set. Suppose f : S → R, and x is an interior
point of S such that f is differentiable at x. Then the gradient vector of
f at x is  
∂f
∂x1 (x)
 . 
∇f (x) =  . 
 . 
∂f
∂xn (x)

• At x∗ , the value f (x) decreases most rapidly along the direction


−∇f (x∗ ). On the other hand, it increases most rapidly along the
direction ∇f (x∗ ).

lecture1 22/59
Hessian matrix

Definition (Hessian matrix)


Let S ⊆ Rn be a nonempty set. Suppose f : S → R, and x is an interior
point of S such that f has second order partial derivatives at x. Then
the Hessian of f at x is the n × n matrix
 2
∂2f ∂2f

∂ f
∂x1 ∂x1 (x) ∂x1 ∂x2 (x) · · · ∂x1 ∂xn (x)
 .. .. .. .. 
Hf (x) = 
 . . . .


2 2 2
∂ f ∂ f ∂ f
∂xn ∂x1 (x) ∂xn ∂x2 (x) · · · ∂xn ∂xn (x)

2
• The ij-entry of Hf (x) is ∂x∂i ∂x
f
j
(x). In general, Hf (x) is not
symmetric. However, if f has continuous second order derivatives,
then the Hessian matrix is symmetric.

lecture1 23/59
Example

Compute the gradient and Hessian


(1) f (x) = x3 + 2x + ex , x ∈ R.
(2) f (x) = x31 + 2x1 x2 + x22 , x = (x1 ; x2 ) ∈ R2 .

lecture1 24/59
Example

Compute the gradient and Hessian


(1) f (x) = x3 + 2x + ex , x ∈ R.
(2) f (x) = x31 + 2x1 x2 + x22 , x = (x1 ; x2 ) ∈ R2 .
Solution (1).

∇f (x) = f 0 (x) = 3x2 + 2 + ex , Hf (x) = f 00 (x) = 6x + ex .

Solution (2). " # " #


∂f
∂x1 (x) 3x21 + 2x2
∇f (x) = ∂f =
∂x2 (x) 2x1 + 2x2

∂2f ∂2f
" # " #
∂x1 ∂x1 (x) ∂x1 ∂x2 (x) 6x1 2
Hf (x) = ∂2f ∂2f
=
∂x2 ∂x1 (x) ∂x2 ∂x2 (x)
2 2

lecture1 24/59
Positive (semi)definite matrices

Definition (Positive (semi)definite)


Let A be a real n × n matrix.

(1) A is said to be positive semidefinite if xT Ax ≥ 0, ∀ x ∈ Rn .


(2) A is said to be positive definite if xT Ax > 0, ∀ x 6= 0.
(3) A is said to be negative semidefinite if −A is positive (semi)definite.
(4) A is said to be negative definite if −A is positive definite.
(5) A is said to be indefinite if A is neither positive nor negative
semidefinite.

lecture1 25/59
Positive (semi)definite matrices

To check whether a matrix is positive semidefinite by definition is not an


easy task in general. The following theorem may be useful to check for
definiteness of a matrix.
Theorem (Eigenvalue test)
Let A be a real symmetric n × n matrix.

(1) A is positive semidefinite if and only if every eigenvalue of A is


nonnegative.
(2) A is positive definite if and only if every eigenvalue of A is positive.
(3) A is negative semidefinite if and only if every eigenvalue of A is
nonpositive.
(4) A is negative definite if and only if every eigenvalue of A is negative.
(5) A is indefinite if and only if it has both a positive eigenvalue and a
negative eigenvalue.

lecture1 26/59
Necessary/Sufficient optimality
conditions
Unconstrained NLP

Consider an unconstrained NLP

minimize f (x)

Assume that f : Rn → R is nonlinear and differentiable. A point x∗ is


called a stationary point of f if ∇f (x∗ ) = 0.
Next, we investigate some optimality conditions of a local minimizer

• Necessary (optimality) condition


• Sufficient (optimality) condition

lecture1 27/59
Necessary/sufficient condition

Necessary condition If x∗ is a local minimizer of f , then


(1) x∗ is a stationary point, i.e., ∇f (x∗ ) = 0
(2) Hf (x∗ ) is positive semidefinite

• confine our search for global minimizers within the set of stationary
points

lecture1 28/59
Necessary/sufficient condition

Necessary condition If x∗ is a local minimizer of f , then


(1) x∗ is a stationary point, i.e., ∇f (x∗ ) = 0
(2) Hf (x∗ ) is positive semidefinite

• confine our search for global minimizers within the set of stationary
points

Sufficient condition If
(1) x∗ is a stationary point, i.e., ∇f (x∗ ) = 0
(2) Hf (x∗ ) is positive definite,
then x∗ is a local minimizer of f .

• verify that a point is indeed a local minimizer

lecture1 28/59
Example
Example
Let f (x) = x21 + x22 − 2x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Find a stationary point of f .
(2) Verify that the stationary point is a local minimizer (by checking
the sufficient condition).

lecture1 29/59
Example
Example
Let f (x) = x21 + x22 − 2x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Find a stationary point of f .
(2) Verify that the stationary point is a local minimizer (by checking
the sufficient condition).

Solution. We let
" # " # " #
∂f ∗
∂x1 (x ) 2x∗1 0
∇f (x∗ ) = ∂f ∗
= = 0 ⇒ x∗ =
∂x2 (x ) 2x∗2 − 2 1
Thus x∗ = (0, 1)T is a stationary point of f . Next, we verify that
" 2
∂2f
# " #
∂ f ∗ ∗
∗ ∂x1 ∂x1 (x ) ∂x1 ∂x2 (x ) 2 0
Hf (x ) = ∂2f ∗ ∂2f ∗
=
∂x2 ∂x1 (x ) ∂x2 ∂x2 (x )
0 2
is positive definite (since the eigenvalues of Hf (x∗ ) are 2, 2, which are
positive).
lecture1 29/59
Example
Example
Let f (x) = x21 x2 + x1 x32 − x1 x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Verify that x = (0; 1) a stationary point of f .
(2) Is x = (0; 1) a local minimizer?

lecture1 30/59
Example
Example
Let f (x) = x21 x2 + x1 x32 − x1 x2 , x = (x1 ; x2 ) ∈ R2 .
(1) Verify that x = (0; 1) a stationary point of f .
(2) Is x = (0; 1) a local minimizer?

Solution.
" # " #
2x1 x2 + x32 − x2 0
∇f (x) = 2 ⇒ ∇f (0; 1) =
x1 + 3x1 x22 − x1 0
Thus x = (1; 0) is a stationary point of f . Next,
" # " #
∗ 2x2 2x1 + 3x22 − 1 2 2
Hf (x ) = ⇒ Hf (0; 1) =
2x1 + 3x22 − 1 6x1 x2 2 0

is not positive semidefinite (since the eigenvalues of Hf are 1 + 5,

1 − 5 < 0). The necessary condition fails, thus x = (0; 1) is not a local
minimizer.
lecture1 30/59
Convex sets/functions
Convex sets

Definition (Convex set)


A set D ∈ Rn is said to be a convex set if for any two points x and y in
D, the line segment joining x and y also lies in D. That is,

x, y ∈ D ⇒ λx + (1 − λ)y ∈ D ∀ λ ∈ [0, 1].

Figure 3: Left: convex. Right: non-convex. Image from internet

lecture1 31/59
(Strictly) convex functions

Definition ((Strictly) convex function)


Let D ⊆ Rn be a convex set. Consider a function f : D → R.
(1) The function f is said to be convex if

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ D, λ ∈ [0, 1].

(2) The function f is said to be strictly convex if

f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y),

for all distinct x, y ∈ D, λ ∈ (0, 1).

• Note that −f is convex ⇐⇒ f is concave

lecture1 32/59
Geometrical interpretation

For a convex function f , the line For a concave function f , the line
segment joining f (x) and f (y) segment joining f (x) and f (y)
lies above the graph of f in [x, y]. lies below the graph of f in [x, y].

lecture1 33/59
Geometrical interpretation

lecture1 34/59
Benefit of convexity

Consider an unconstrained convex NLP

minimize f (x)

where f : Rn → R is convex. It holds that


(1) any local minimizer is a global minimizer.
(2) if f is strictly convex, then the global minimizer is unique.

lecture1 35/59
Test for convexity

Theorem (Test for convexity of a differentiable function)


Suppose that f has continuous second partial derivatives on an open
convex set D in Rn .
(1) The function f is convex on D if and only if the Hessian matrix
Hf (x) is positive semidefinite at each x ∈ D.
(2) If the Hessian matrix Hf (x) is positive definite at each x ∈ D, then
f is strictly convex on D.
(3) If the Hessian matrix Hf (x̂) is indefinite at some point x̂ ∈ D, then
f is not a convex nor a concave function on D.

lecture1 36/59
Preliminaries
Notation

• Sn : the space of n × n symmetric matrices.


• Sn+ : the space of n × n symmetric and positive semidefinite matrices
(all eigenvalues ≥ 0).
• Let A be a matrix. Ai· denotes the i-th row of A, and A·j denotes
the j-th column of A.
 
1 2 3 4 5
. For example, given A =  6 7 8 9 10. Then
 
11 12 13 14 15
 
1 h i
A·1 =  6 , and A1· = 1 2 3 4 5 .
 
11

lecture1 37/59
Eigenvalues/eigenvectors

• The eigenvalue decomposition of A ∈ Sn is given by


   T
| | λ1 | |
T
A = QΛQ = Q·1 ··· Q·n   .. ···
 Q·1 Q·n 
   
.
| | λn | |

where Q is an orthogonal matrix whose columns are eigenvectors of


A, Λ is a diagonal matrix with eigenvalues of A on the diagonal.

lecture1 38/59
Eigenvalues/eigenvectors: An example

" #
1.5 0.5
Consider A = .
0.5 1.5

Figure 4: Blue points are xi ∈ R2 , i = 1, . . . , 1000 randomly drawn from


[0, 1] × [0, 1]. Red points are yi = Axi after linear transformation.

lecture1 39/59
Eigenvalues/eigenvectors: An example

• Eigenvalue decomposition:
" #" #" #T
−0.7071 0.7071 1 −0.7071 0.7071
A= .
0.7071 0.7071 2 0.7071 0.7071
eigenvector 1 eigenvector 2

. The linear transformation


results in a scaling of 2
along eigenvetor 2.
. The linear transformation
results in a scaling of 1
along eigenvetor 1.

• The linear transformation results in a scaling of λ along the


eigenvector associated with λ.
lecture1 40/59
Mean/covariance

Let x1 , . . . , xn ∈ Rp be n observations of a random variable x.

• Mean vector:
n
1X
µ = x̄ = xi .
n i=1
• (Sample/Empirical) Covariance matrix:
n
1 X
Σ= (xi − µ)(xi − µ)T .
n − 1 i=1

. Recall: Covariance matrices are symmetric and positive semidefinite.


• Standard deviation (for p = 1):
v
u n
u1 X
σ=t (xi − µ)2 .
n i=1

lecture1 41/59
Mean/covariance: An example
" # " #
1 2
Mean µ = , covariance Σ = . We generate 1000 observations
4 3
of a Gaussian random variable
xi ∼ N (µ, Σ), i = 1, . . . , 1000.

mu = [1;4];
Sigma = [2 0; 0 3];
for i = 1:1000
xi = mvnrnd ( mu , Sigma ) ;
plot ( xi (1) , xi (2) , 'k . ') ;
hold on ;
end

. We can see that the data


" #is
1
centered at mean µ =
4

lecture1 42/59
Mean/covariance: An example

" # " # " # " #


1 2 0.5 1 2 1
µ= Σ= µ= Σ=
4 0.5 3 4 1 3

lecture1 43/59
Mean/covariance: An example
" # " # " # " #
1 2 2 1 2 −2
µ= Σ= µ= Σ=
4 2 3 4 −2 3

Can you find a vector that would approximate the data?

lecture1 44/59
Mean/covariance: An example
" # " # " # " #
1 2 2 1 2 −2
µ= Σ= µ= Σ=
4 2 3 4 −2 3

Can you find a vector that would approximate the data? Blue one!

lecture1 45/59
PCA: Intuition

" # " #
1 2 2
µ= Σ= Consider eigenvectors of covariance matrix Σ
4 2 3
. λ = 0.4382 " #
−0.7882
eigenvector 1 =
0.6154
. λ = 4.5616 " #
0.6154
eigenvector 2 =
0.7882
. eigenvector 2 accounts for most
of the data

lecture1 46/59
Principal component analysis
PCA

• PCA is often used to reduce the dimensionality of large data sets


while preserving as much information as possible
• PCA allows us to identify the principal directions in which the data
varies

Next, we use an example to demonstrate PCA step by step.

10

Example 1
8

Consider a data set that consists of


6

p = 2 variables on n = 8 samples
4

(1, 2), (3, 3), (3, 5), (5, 4), (5, 6),
2

(6, 5), (8, 7), (9, 8).


-2 2 4 6 8 10 12

-2

lecture1 47/59
PCA: step-be-step explanation

• Input: an n × p matrix X, each row is an observation or sample,


each column is a predictor variable
• Step 1: standardization of each column of X — transform all
variables to the same scale
X·j − mean(X·j )
X·j =
standard deviation(X·j )

X = [1 ,2; 3 ,3; 3 ,5; 5 ,4; 5 ,6; 6 ,5; 8 ,7; 9 ,8]; % input


[n , p ] = size ( X ) ;
% standardize each column
for j = 1: p
X (: , j ) = ( X (: , j ) - mean ( X (: , j ) ) ) / std ( X (: , j ) ) ;
end

lecture1 48/59
PCA: step-be-step explanation

10

1.5

8
1

6
0.5

-2 -1 1 2

-0.5
2

-1

-2 2 4 6 8 10 12

-1.5

-2

Figure 5: Input data Figure 6: Standardized data

lecture1 49/59
PCA: step-be-step explanation

• Step 2: p × p covariance matrix computation — understand how


the variables of the data are varying from the mean with respect to
each other
Sigma = cov ( X ) ; % compute covariance matrix

which yields
Sigma =

1.0000 0.9087
0.9087 1.0000

lecture1 50/59
PCA: step-be-step explanation

• Step 3: eigenvalue decomposition — by ranking eigenvectors in


order of their eigenvalues, highest to lowest, one obtains the
principal components in order of significance
[Q , D ] = eig ( Sigma ) ; % eigenvalue decomposition
[d , ind ] = sort ( diag ( D ) , ' descend ') ; % sort eigenvalues
D = D ( ind , ind ) ; % reorder diagonal of D ( eigenvalues )
Q = Q (: , ind ) ; % reorder columns of Q ( eigenvectors )

which yields

Q = . The first eigenvalue of 1.9087 is larger


than the second eigenvalue of 0.0913
-0.7071 0.7071
. The first eigenvector captures 95%
-0.7071 -0.7071 1.9087
( 1.9087+0.0913 ) of the information
D = . The second eigenvector captures 5%
0.0913
( 1.9087+0.0913 ) of the information
1.9087 0
0 0.0913
lecture1 51/59
PCA: step-be-step explanation

1.5

0.5

-2 -1 1 2

-0.5

-1

-1.5

Figure 7: Principal directions. (-0.7071, -0.7071) carries 95% of the


information, and (0.7071, -0.7071) carries 5% of the information

lecture1 52/59
PCA: step-be-step explanation

• Step 4: recast the data along the principal components axes —


the jth column of Xnew is the jth principal component

Xnew = XQ

Xnew = X * Q ; % new data

It is up to you to choose whether


. to keep all the components
. or discard the ones of lesser significance
There are some practical rules in real applications (introduced later).

lecture1 53/59
PCA: step-be-step explanation

1.5

0.5

-2 -1 1 2

-0.5

-1

-1.5

Figure 8: Recast the data along the first principal direction. Discard the
second component and the dimension is reduced by 1.

lecture1 54/59
Standard PCA workflow

1. Make sure the data X are rows=observations and columns=variables


2. Standardize the columns of X
3. Run [Q, Xnew , d, tsquared, explained] = pca(X)
4. Using the variance% in “explained”, choose k (usually 1, 2, or 3)
components for visual analysis
. For example, if d = (1.9087, 0.0913), explained= (95.4, 4.6), one
may choose k = 1 as the first principal component carries 95.4% of
the information
. For example, if d = (2.9108, 0.9212, 0.1474, 0.0206),
explained= (72.8, 23.0, 3.7, 0.5), one may choose k = 2 as the first
two principal components carry 95.8% of the information
5. Plot Xnew (:, 1), . . . , Xnew (:, k) on a k-dimensional plot

lecture1 55/59
Application: Iris flower data set

• Iris flower data set3 contains 3 species (Setosa/Versicolor/Virginica)


of 50 instances each. n = 150
• p = 4 attributes: sepal length, sepal width, petal length, petal width

sepal length sepal width pedal length petal width species


(cm) (cm) (cm) (cm)
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

Table 1: The first 3 rows of the 150-instance Iris data set

3 See Wikepedia for more details


lecture1 56/59
Application: Iris flower data set

• Choose 2 principal components (reduce the size of our data by 50%)


• First 2 principal components explain 72.8% + 23.0% = 95.8% of
total variance of the data
• Based on 2 principal components, classification would still be an
easy task

3
Principal component 2 (23.0%)

-1

-2

-3
-3 -2 -1 0 1 2 3 4
Principal component 1 (72.8%)

Figure 9: Setosa (blue), Versicolor (green), Virginica (yellow)


lecture1 57/59
Application: Iris flower data set

• Choose 1 principal components (throwing out a massive 75% of our


data)
• Still end up with a remarkably effective single principle component
for the purposes of classification

-1
-3 -2 -1 0 1 2 3 4

Principal component 1 (72.8%)

Figure 10: Setosa (blue), Versicolor (green), Virginica (yellow)

Programming Exercise
Do PCA on the Iris flower data set and plot Figure 9 and Figure 10.

lecture1 58/59
Application: Iris flower data set

Listing 1: Matlab codes applying PCA to Iris flower data set


[X , y ] = iris_dataset ; % load the data
X = X '; % make sure rows = observations and columns = variables
y = vec2ind ( y ) '; % species label
for j = 1: size (X ,2) % standardize columns of X
X (: , j ) = ( X (: , j ) - mean ( X (: , j ) ) ) / std ( X (: , j ) ) ;
end
[Q , Xnew ,d , tsquared , explained ] = pca ( X ) ; % PCA
% plot 2 principal components
figure ; scatter ( Xnew (: ,1) , Xnew (: ,2) ,25 , y ) ;
pc1 = sprintf ( ' Principal component 1 (%2.1 f %%) ' ,...
explained (1) ) ;
pc2 = sprintf ( ' Principal component 2 (%2.1 f %%) ' ,...
explained (2) ) ;
xlabel ( pc1 , ' fontSize ' ,15) ; ylabel ( pc2 , ' fontSize ' ,15) ;
% plot 1 principal component
figure ; scatter ( Xnew (: ,1) , zeros ( size (X ,1) ,1) ,25 , y ) ;
xlabel ( pc1 , ' fontSize ' ,15) ;

lecture1 59/59
References i

K.-C. Toh.
MA5268: Theory and algorithms for nonlinear optimization.

You might also like