You are on page 1of 15

Foundations of Information Systems Engineering

IEE 605

Practice Midterm Exam– Mar 16th, 2022

1. One page note sheet (double sided) are allowed. Calculators without internet capacity are
allowed.

2. Write your answer on this exam paper. DO NOT use your own paper. If you need more
space, write on the back side of this exam paper.

3. The test is worth 100 points. The assigned points for each question are given beside it.
There is an extra 10 points that you can earn.

4. When time is called, DO NOT enter any more answers on your exam. You must have
a photo ID to take an exam. Please place your student ID card on the desk. Failure to
present a photo ID for an exam results in an automatic 10% deduction. In addition, until
a valid photo ID is presented to the instructor, your exam score is “0”

5. Cell phones, electronic readers, computers, smart watches, and other internet-capable de-
vices are not allowed. If your cell phone or other internet-capable device is seen out in any
way during an exam, you will be given a zero for the exam and reported for a violation of
ASU’s Academic Integrity Policy.

6. Sign the Honor Code below:


I commit to uphold the ideals of honor and integrity by refusing to betray the trust will
violate ASU Academic Integrity Policy.
Name:
Student ID:
Signature: Data:

Page 1 of 11 – Foundations of Information Systems Engineering (IEE 605)


Part 3 Computational Problem (34 points) Please compute the following problem.
Necessary computation step is needed and write down the equation you use
for each subproblem would help you on gaining intermediate points for each question.
However, please try to be concise about your answer. A too long an

1. Logistic Regression
X
L(θ) = −yi log(σ(θT xi )) − (1 − yi ) log(1 − σ(θT xi ))
i

1 d
Here, given that σ(x) = 1+exp(−x) and dx σ(x) = σ(x)(1 − σ(x)).

(a) (10 pionts) Please first derive the gradient of L(θ).

X
L(θ) = −yi log(σ(θT xi )) − (1 − yi ) log(1 − σ(θT xi ))
i

First we know that

σ ′ (x) = σ(x)(1 − σ(x))

!
∂L(θ) ∂ X T T
= −yi log(σ(θ xi )) − (1 − yi ) log(1 − σ(θ xi ))
∂θ ∂θ
i
X ∂ ∂

T T
= −yi log(σ(θ xi )) − (1 − yi ) log(1 − σ(θ xi ))
∂θ ∂θ
i
 
X yi ∂ T (1 − yi ) ∂ T
= − σ(θ xi ) + σ(θ xi )
σ(θT xi ) ∂θ (1 − σ(θT xi )) ∂θ
i
X  (1 − yi ) yi


= T
− T
σ(θT xi )
(1 − σ(θ xi )) σ(θ xi ) ∂θ
i
X  (1 − yi ) yi

= − σ(θT xi )(1 − σ(θT xi ))xi
(1 − σ(θT xi )) σ(θT xi )
i
X
(1 − yi )σ(θT xi ) − yi (1 − σ(θT xi )) xi

=
i
X
σ(θT xi ) − yi σ(θT xi ) − yi + yi σ(θT xi ) xi

=
i
X
σ(θT xi ) − yi xi

=
i

Because the derivative of sums is the sum of derivatives, the gradient of θ is simply
the sum of this term for each training datapoint.
(b) (10 pionts) Please compute the Hessian matrix . and answer if the problem is convex?

Page 2 of 11 – Foundations of Information Systems Engineering (IEE 605)


∂ 2 L(θ)
 
∂ ∂L(θ)
=
∂θ2 ∂θ ∂θ
∂ X
σ(θT xi ) − yi xi

=
∂θ
i
X ∂
σ(θT xi ) − yi xi

=
∂θ
i
X ∂
= σ(θT xi )xi
∂θ
i
X
= σ(θT xi )(1 − σ(θT xi ))xi xTi
i

∂ 2 L(θ)
Given that for each term xi xTi ⪰ 0 and σ(θT xi )(1 − σ(θT xi )) > 0, ∂θ2
⪰0
(c) (7 pionts) Please answer if the problem is strongly convex? If not, what method you
would like to use to make the method strongly convex?
The problem is not strongly convex. To make it strongly convex, L2 regularization
can be added.
(d) (7 pionts) What is the computational complexity of the gradient method.
To compete the gradient, we need to compute
i. ŷi = σ(θT xi ), i = 1, · · · , n which takes a inner product a length p vector, which
is O(p) for each i and O(np) for all i
ii. ri = ŷi − yi , which is also O(1) for each i and O(n) for all i
iii. ∂L(θ)
P
∂θ = i ri xi , takes the summation n length p vectors, O(np)
iv. Gradient update θ(k+1) = θ(k) − c ∂L(θ)
∂θ , takes O(p) if the gradient is known.
Total complexity is O(np).
(e) Suppose that we like to add L1 penalty to the Logistic regression as
X
L(θ) = −yi log(σ(θT xi )) − (1 − yi ) log(1 − σ(θT xi )) + λ∥θ∥1
i

Please write down the proximal gradient algorithm to solve this question.
The solution is given by θ(k+1) = Sλ/2 (θ(k) − c ∂L(θ)
∂θ ), where the Sλ (x) is the soft-
thresholding operator.

2. Quadratic Optimization

(a) First, we need to compute the Hessian


i. Compute Hessian (3 points): The gradient of f (x) = 21 x⊤ Qx + b⊤ x + c is

∇f (x) = Qx + b

and the Hessian is ∇2 f (x) = Q.


ii. Condition of the unique solution and strong convexity (3 points): is
∇2 f (x) ≻ 0, which implies Q ≻ 0 or Q is positive definite matrix (or all eigen
values are positive)
(b) Please derive a closed-form solution for Problem (1). What are the space complexity.

Page 3 of 11 – Foundations of Information Systems Engineering (IEE 605)


i. Closed form equations (2pts) Closed form solution is solved by
∇f (x) = Qx + b = 0
Solution is solved by Qx = −b or x = −Q−1 b.
i. (2pts) Space complexity: Need to store Q, b, which has O(p2 ) complexity
ii. (2pts) Time complexity: Compute Q−1 is of O(p3 ) complexity.
(c) Gradient descent algorithm
i. (5pts) Solution is given by
x(k+1) = x(k) − α∇f (x(k) )
= x(k) − α(Qx(k) + b)
ii. (3pts) The time complexity is O(p2 ), where the most expensitve step is comput-
ing Qx(k) .
iii. (2pts) The space complexity is to store the matrix Q, which is still O(p2 )
(d) (7 pts) Simple linear regression: the problem is to optimize
3
X
(yi − xi θ1 − θ0 )2
i=1

It can be written as the matrix form as


∥y − Xθ∥2
 
1 1  
θ0
where X =  1 2 ,θ =
 . The optimization variables are 2D vector θ =
θ1
1 4
 
θ0
. (2pts)
θ1
The quadratic programming is given by
min θ⊤ X ⊤ Xθ − 2θ⊤ X ⊤ y + y ⊤ y.
θ

Therefore, Q = 2X ⊤ X and b = 2X ⊤ y. c = y ⊤ y.
 
  1 1    
⊤ 1 1 1  3 7 6 14
Here Q = 2X X = 2 1 2  =2 = . (3pts)
1 2 4 7 21 14 42
1 4
 
  0    
⊤ 1 1 1   3 6
b = 2X y = 2 1 =2 = . (3pts)
1 2 4 10 20
2
 
 0
c = 0 1 2  1  = 5. (1 pts)
2
THE SOLUTION IS NOT UNIQUE. As long as you have Q, b, c propotional to the
solution within a constance, the solution is still correct. For example, if you have
   
3 7 3
Q= ,b = , c = 2.5.
7 21 10
The solution is also correct. Finally, the c actually doesn’t change the solution so only
1 pts is given for c.

Page 4 of 11 – Foundations of Information Systems Engineering (IEE 605)


(e) (7 pts)
Learning Rate 1 is largest given that it is the only that the algorithm diverges.
Learning Rate 2 (the same of Learning rate 4) is has smallest learning rate.
Learning Rate 5 has the proper learning rate and get to the best solution. (The one
to choose.)
Learning Rate 3 is also pretty large given it converges faster than 5 in the beginning.
We rank the learning rates as follows:
Learning Rate 1 > Learning Rate 3 > Learning Rate 5 > Learning Rate 2
(f) (Bonus 5 pts)
The updating equation is given by

x(k+1) = x(k) − α∇f (x(k) ) = x(k) − α(Qx(k) + b)


= (I − αQ)x(k) − αb

Here, we will use a notation A ≺ B, if any eigenvalue of A is smaller than any


eigenvalue of B, or the all eigenvalues of A − B is smaller than 0.
The condition for this equation to converge is that the eigen value of −I ≺ (I −αQ) ≺
I, which means all eigenvalues of (I − αQ) is between (−1, 1). given Q ≻ 0 and α > 0,
(I − αQ) ≺ I always hold, or the eigenvalues of I − αQ are always smaller than 1.
Therefore, we only require−I ≺ (I − αQ), which implies that αQ ≺ 2I. This means
the eigen value of Q, denoted as αλ(Q) < 2. In another word, we need α < λmax2 (Q) ,
where λmax (Q) is the largest eigenvalue of Q.
(g) (Bonus 7 pts)
Fomulation: The solution is given by

α(k) = min f (x(k) − α∇f (x(k) )).


α

The solution is given by

df (x(k) − α∇f (x(k) ))


= 0.

df (x(k) − α∇f (x(k) ))
= −∇f (x(k) )⊤ ∇f (x(k) − α∇f (x(k) )) = 0

Need to plug in the gradient as

∇f (x(k) − α∇f (x(k) )) = Q(x(k) − α∇f (x(k) )) + b


= Qx(k) + b − αQ∇f (x(k) )
= ∇f (x(k) ) − αQ∇f (x(k) )
= (I − αQ)∇f (x(k) )

Therefore, the solution is given by

0 = ∇f (x(k) )⊤ ∇f (x(k) − α∇f (x(k) ))


= ∇f (x(k) )⊤ (I − αQ)∇f (x(k) ) = 0

The solution is

∇f (x(k) )⊤ ∇f (x(k) ) = α∇f (x(k) )⊤ Q∇f (x(k) )

Page 5 of 11 – Foundations of Information Systems Engineering (IEE 605)


Therefore, the solution of α is given by
∇f (x(k) )⊤ ∇f (x(k) )
α= ,
∇f (x(k) )⊤ Q∇f (x(k) )

where ∇f (x(k) ) = Qx(k) + b.


3. Solution
(a) Gradient of the Distance Function
Given the distance function dist(x, C) = miny∈C ∥y − x∥2 , where ∥y − x∥2 is the
Euclidean distance between x and y, the projection of x onto C, denoted as PC (x),
is the point in C that is closest to x. Mathematically, PC (x) = arg miny∈C ∥y − x∥2 .
The gradient of the distance function can be derived from the definition of the pro-
jection. The distance function can be rewritten using the projection as dist(x, C) =
∥PC (x) − x∥2 .
qP
The Euclidean norm of a vector v is ∥v∥2 = 2
i vi , and its gradient with respect to
v
v is ∇v ∥v∥2 = ∥v∥ 2
.
To find ∇x dist(x, C), we use the chain rule. The difficulty here is that PC (x) depends
on x, but for the distance function, we treat PC (x) as a constant with respect to x
because PC (x) is the point in C that does not change as we infinitesimally vary x
around its current value. Thus, the change in dist(x, C) with respect to x is solely
due to the direct dependence of the norm on x, not on how PC (x) might move within
C.
Taking the gradient of dist(x, C) with respect to x, we get:
x − PC (x)
∇ dist(x, C) = ∇ (∥PC (x) − x∥2 ) = .
∥PC (x) − x∥2

(b) Also recall subgradient rule: if f (x) = maxi=1,...m fi (x), then


 
[
∂f (x) = conv  ∂fi (x)
i:fi (x)=f (x)

So if fi (x) = f (x) and gi ∈ ∂fi (x), then gi ∈ ∂f (x).


Put these two facts together for intersection of sets problem, with fi (x) = dist (x, Ci )
: if Ci is farthest set from x (so fi (x) = f (x) ), and
x − PCi (x)
gi = ∇fi (x) =
∥x − PCi (x)∥2
then gi ∈ ∂f (x).
(c) Now apply subgradient method, with certain step size tk . At iteration k, with Ci
farthest from x(k−1) , we perform update
x(k−1) − PCi x(k−1)

(k) (k−1)
x =x − tk (k−1) 
x − PCi x(k−1) 2

4. The given problem is:


1
min xT Qx + bT x subject to l≤x≤u
x 2

Page 6 of 11 – Foundations of Information Systems Engineering (IEE 605)


(a) Solution
To minimize over xi while keeping all other variables xj (j ̸= i) fixed, we first con-
side the objective function’s derivative with respect to xi . The objective function is
quadratic, and its derivative with respect to xi can be expressed as:
 
  n X n n
∂ 1 T ∂ 1 X X
x Qx + bT x =  xk Qkj xj + bj xj  .
∂xi 2 ∂xi 2
k=1 j=1 j=1

Since Q is symmetric Qkj = Qjk , focusing on terms involving xi gives:


n
X
Qij xj + bi − Qii xi
j=1

When solving for xi , we want to set this derivative equal to zero to find the minimum.
Thus, we solve for xi :
X
Qii xi + Qij xj + bi = 0
j̸=i

Rearranging terms to solve for xi :


P
j̸=i Qij xj + bi
xi = − .
Qii

However, we must also consider the box constraints l ≤ x ≤ u. This is where the
truncation (projection) operator T[l,ui ] comes into play, ensuring that xi stays within
its bounds:
 P 
− j̸=i Q ij x j + bi
xi = T[li ,ui ]  .
Qii

(b) Deriving the Coordinate Descent Algorithm


i. Initialization: Start with an initial guess for x, denoted x(0) , ensuring l ≤ x(0) ≤ u.
ii. Iteration: For each iteration k, update each coordinate xi in sequence while keep-
ing all other coordinates fixed. Specifically, for each i from 1 to n, update xi
using the formula derived in part (a):
P (k) !
(k+1) bi − j̸=i Qij xj
xi = T[li ,ui ] ,
Qii

(k)
where xj denotes the value of xj at iteration k, except for the current xi which
is being updated.
iii. Convergence Check: After each full cycle of updates (i.e., after updating all n
coordinates), check for convergence. This can be done by evaluating whether the
change in the objective function or the change in the variable values is below a
predetermined threshold. If the convergence criteria are met, stop the algorithm.
iv. Repeat: If the convergence criteria are not met, repeat step 2 with the updated
values of x.

Page 7 of 11 – Foundations of Information Systems Engineering (IEE 605)


5. The proximal operator (proximal map)

(a) To address the optimization problem without setting λ = 1 for convenience and to
maintain consistency with the provided problem notation, let’s derive the proximal
map for the L2 norm, h(z) = ∥z∥2 , with the general λ. The optimization problem to
solve is:
 
1 2
proxλh (v) = argminx∈Rn λ∥x∥2 + ∥x − v∥2 .
2

i. Step 1: Optimization Problem Formulation Given that λ > 0, the problem can
be rewritten as:
 
1 2
argminx∈Rn λ∥x∥2 + ∥x − v∥2 .
2

ii. Step 2: Optimality Condition The optimality condition for this problem is derived
from the first-order condition, considering that x and v are vectors in Rn is
x
λ + x − v = 0.
∥x∥2

Therefore, we know that for x ̸= 0, the vector x must be parallel to v. This


implies that there exists a scalarσ such that x = σv. Substituting x = σv into
the optimality condition, given x = σv, the equation becomes:
σv
λ + σv − v = 0
|σ|∥v∥2

Simplifying, we find:
v
λ + (σ − 1)v = 0
∥v∥2

iii. Step 3: Solving for σ Rearranging for σ, we get:


v
σv = v − λ
∥v∥2

iv. Step 4: Applying the Constraint The value of σ indicates how much v is scaled
to get x. However, we must ensure x is not scaled negatively, which corresponds
to the condition σ ≥ 0 or:
1
1−λ ≥ 0 ⇒ ∥v∥2 ≥ λ.
∥v∥2

Hence, the proximal map is:


( 
λ
1 − ∥v∥ v if ∥v∥2 > λ,
proxλh (v) = 2

0 otherwise.

This formula correctly scales v towards the origin by a factor dependent on λ


and ∥v∥2 , ensuring that the result adheres to the constraint that x = 0 if ∥v∥2 ≤
λ, which is consistent with the notion of a proximal operator applying a "soft
threshold".

Page 8 of 11 – Foundations of Information Systems Engineering (IEE 605)


(b) Defining:
 
1
x̂ = proxλ1 ∥·∥1 +λ2 ∥·∥2 (b) = arg min ∥x − b∥22 + λ1 ∥x∥1 + λ2 ∥x∥2
x 2

The optimality condition implies:

0 ∈ x̂ − b + λ1 ∂∥x̂∥1 + λ2 ∂∥x̂∥2

Where:
( (
[−1, 1] if x̂i = 0 {z | ∥z∥2 ≤ 1} if x̂ = 0
u ∈ ∂∥·∥1 (x̂i ) = , v ∈ ∂∥ · ∥2 (x) = x̂
sgn (x̂i ) ̸ 0
if x̂i = ∥x̂∥2 ̸ 0
if x̂ =

Notes - The optimization problem tries to minimize x̂ norms while keeping it close
to b. - For any element which is not zero in x̂ its sign is identical to the corre-
sponding element in b. Namely ∀i ∈ {j | x̂j ̸= 0} , sgn (x̂i ) = sgn(b). The reason is
simple, If sgn (x̂i ) ̸= sgn(b) then by setting x̂i = −x̂i one could minimize the distance
to b while keeping the norms the same which is a contradiction to the x̂ being optimal.
i. Case x̂ = 0 In this case the above suggests:

b = λ1 u + λ2 v ⇐⇒ b − λ1 u = λ2 v

Since ui ∈ [−1, 1] and ∥v∥2 ≤ 1 one could see that as long as∥b − λ1 u∥2 ≤ λ2 one
could set x̂ = 0 while equality of the constraints hold. Looking for the edge cases
(With regard to b ) is simple since it cam be done element wise between $b$ and
$u$. It indeed happens when v = sign(b) which yields:

x̂ = 0 ⇐⇒ ∥b − λ1 sign(b)∥2 ≤ λ2 ⇐⇒ ∥Sλ1 (b)∥2 ≤ λ2

Where Sλ (·) is the Soft Threshold function with parameter λ.


ii. Case x̂ ̸= 0 In this case the above suggests:

0 = x̂ − b + λ1 u + λ2
∥x̂∥2
 
λ2
⇐⇒ b − λ1 u = 1 + x̂
∥x̂∥2

For elements where xi = 0 it means |bi | ≤ λ1 . Namely ∀i ∈ {j | x̂j = 0} , bi −λ1 v =


0 ⇐⇒ |bi | ≤ λ1 . This comes from the fact vi ∈ [−1, 1].
This makes the left hand side of the equation to be a Threhsolding Operator,
hence: As written in notes Under the assumption ∀i, sign (x̂i ) = sign (bi ) the
above becomes:
 
λ2
Sλ1 (b) = 1 + x̂
∥x̂∥2

Looking on the L2 Norm of both equation sides yields:


 
λ2
∥Sλ1 (b)∥2 = 1 + ∥x̂∥2 ⇒ ∥x̂∥2 = ∥Sλ1 (b)∥2 − λ2
∥x̂∥2

Page 9 of 11 – Foundations of Information Systems Engineering (IEE 605)


Plugging this into the above yields:
 
Sλ1 (b) λ2
x̂ = λ2
= 1− Sλ1 (b)
1 + S (b) ∥Sλ1 (b)∥2
∥ λ1 ∥2 2−λ

Remembering that in this case it is guaranteed that λ2 < ∥Sλ1 (b)∥2 hence the
term in the braces is positive as needed.
Summary The solution is given by:

0  if ∥Sλ1 (b)∥2 ≤ λ2
x̂ = proxλ1 ∥·∥1 +λ2 ∥·∥2 (b) = λ2
 1− Sλ1 (b) if ∥Sλ1 (b)∥2 > λ2
∥Sλ1 (b)∥2
!
λ2 .
= 1−  Sλ1 (b)
max ∥Sλ1 (b)∥2 , λ2
 
= proxλ2 ∥·∥2 proxλ1 ∥·∥1 (b)

6. The proximal minimization algorithm is a method used to solve optimization problems,


particularly useful for non-differentiable or complex functions. The algorithm updates
the solution iteratively by: x (k+1) = proxh,t x (k) where proxh,t (v) is the proximal op-
erator defined for a function h and a parameter t > 0, and is given by: proxh,t (v) =
1
∥x − v∥2

arg minx h(x) + 2t
Given h(x) = 12 xT Ax − bT x, where A ∈ S n (symmetric matrices), we want to show that
applying the proximal minimization algorithm to this $h(x)$ results in the iterative refine-
ment algorithm:
 
x(k+1) = x(k) + (A + ϵI)−1 b − Ax(k)

where ϵ > 0 is a constant. To do this, we first compute the proximal operator for h(x).
Step 1: Compute the proximal operator for h(x) The objective of the proximal operator
for h(x) is:
 
1 T T 1 (k)
2
arg min x Ax − b x + x−x
x 2 2t
Expanding the quadratic term, we get:
 
1 T T 1  (k)
T 
(k)
arg min x Ax − b x + x−x x−x
x 2 2t
 
1 T T 1 T 1 (k)T 1 (k)T (k)
= arg min x Ax − b x + x x − x x+ x x
x 2 2t t 2t

Step 2: Differentiate and set to zero to solve for $x$ Taking the derivative with respect to
x and setting it to zero gives:
1 1
Ax − b + x − x(k) = 0
t t
Rearrange and solve for x :
 
1 1
A + I x = b + x(k)
t t
 −1  
1 1 (k)
x= A+ I b+ x
t t

Page 10 of 11 – Foundations of Information Systems Engineering (IEE 605)


Step 3: Show equivalence to the iterative refinement algorithm To show this is equivalent
to the iterative refinement algorithm, let ϵ = 1t , then we get:
 
x = (A + ϵI)−1 b + ϵx(k)
 
x(k+1) = x(k) + (A + ϵI)−1 b − Ax(k)

This is achieved by rearranging the equation to match the iterative refinement form, demon-
strating the equivalence between the proximal minimization algorithm applied to h(x) and
the iterative refinement algorithm.
Comment on the usefulness for rank-deficient A The iterative refinement algorithm, and
hence the proximal minimization algorithm, is particularly useful for solving problems
where A is rank-deficient (i.e., does not have full rank). In such cases, directly solving
Ax = b may be problematic due to the lack of a unique solution or numerical instability.
The addition of ϵI (where I is the identity matrix) to A in the term (A + ϵI)−1 ensures
that the matrix to be inverted is always full rank and thus invertible. This regularization
technique stabilizes the solution process and allows for the computation of a solution even
when A is rankdeficient, by effectively adding a small amount to the diagonal elements
ofA, improving its condition number.

——— End of Examination ———

Page 11 of 11 – Foundations of Information Systems Engineering (IEE 605)


Page 12 of 11 – Foundations of Information Systems Engineering (IEE 605)
Page 13 of 11 – Foundations of Information Systems Engineering (IEE 605)
Page 14 of 11 – Foundations of Information Systems Engineering (IEE 605)
Page 15 of 11 – Foundations of Information Systems Engineering (IEE 605)

You might also like