Professional Documents
Culture Documents
IEE 605
1. One page note sheet (double sided) are allowed. Calculators without internet capacity are
allowed.
2. Write your answer on this exam paper. DO NOT use your own paper. If you need more
space, write on the back side of this exam paper.
3. The test is worth 100 points. The assigned points for each question are given beside it.
There is an extra 10 points that you can earn.
4. When time is called, DO NOT enter any more answers on your exam. You must have
a photo ID to take an exam. Please place your student ID card on the desk. Failure to
present a photo ID for an exam results in an automatic 10% deduction. In addition, until
a valid photo ID is presented to the instructor, your exam score is “0”
5. Cell phones, electronic readers, computers, smart watches, and other internet-capable de-
vices are not allowed. If your cell phone or other internet-capable device is seen out in any
way during an exam, you will be given a zero for the exam and reported for a violation of
ASU’s Academic Integrity Policy.
1. Logistic Regression
X
L(θ) = −yi log(σ(θT xi )) − (1 − yi ) log(1 − σ(θT xi ))
i
1 d
Here, given that σ(x) = 1+exp(−x) and dx σ(x) = σ(x)(1 − σ(x)).
X
L(θ) = −yi log(σ(θT xi )) − (1 − yi ) log(1 − σ(θT xi ))
i
!
∂L(θ) ∂ X T T
= −yi log(σ(θ xi )) − (1 − yi ) log(1 − σ(θ xi ))
∂θ ∂θ
i
X ∂ ∂
T T
= −yi log(σ(θ xi )) − (1 − yi ) log(1 − σ(θ xi ))
∂θ ∂θ
i
X yi ∂ T (1 − yi ) ∂ T
= − σ(θ xi ) + σ(θ xi )
σ(θT xi ) ∂θ (1 − σ(θT xi )) ∂θ
i
X (1 − yi ) yi
∂
= T
− T
σ(θT xi )
(1 − σ(θ xi )) σ(θ xi ) ∂θ
i
X (1 − yi ) yi
= − σ(θT xi )(1 − σ(θT xi ))xi
(1 − σ(θT xi )) σ(θT xi )
i
X
(1 − yi )σ(θT xi ) − yi (1 − σ(θT xi )) xi
=
i
X
σ(θT xi ) − yi σ(θT xi ) − yi + yi σ(θT xi ) xi
=
i
X
σ(θT xi ) − yi xi
=
i
Because the derivative of sums is the sum of derivatives, the gradient of θ is simply
the sum of this term for each training datapoint.
(b) (10 pionts) Please compute the Hessian matrix . and answer if the problem is convex?
∂ 2 L(θ)
Given that for each term xi xTi ⪰ 0 and σ(θT xi )(1 − σ(θT xi )) > 0, ∂θ2
⪰0
(c) (7 pionts) Please answer if the problem is strongly convex? If not, what method you
would like to use to make the method strongly convex?
The problem is not strongly convex. To make it strongly convex, L2 regularization
can be added.
(d) (7 pionts) What is the computational complexity of the gradient method.
To compete the gradient, we need to compute
i. ŷi = σ(θT xi ), i = 1, · · · , n which takes a inner product a length p vector, which
is O(p) for each i and O(np) for all i
ii. ri = ŷi − yi , which is also O(1) for each i and O(n) for all i
iii. ∂L(θ)
P
∂θ = i ri xi , takes the summation n length p vectors, O(np)
iv. Gradient update θ(k+1) = θ(k) − c ∂L(θ)
∂θ , takes O(p) if the gradient is known.
Total complexity is O(np).
(e) Suppose that we like to add L1 penalty to the Logistic regression as
X
L(θ) = −yi log(σ(θT xi )) − (1 − yi ) log(1 − σ(θT xi )) + λ∥θ∥1
i
Please write down the proximal gradient algorithm to solve this question.
The solution is given by θ(k+1) = Sλ/2 (θ(k) − c ∂L(θ)
∂θ ), where the Sλ (x) is the soft-
thresholding operator.
2. Quadratic Optimization
∇f (x) = Qx + b
Therefore, Q = 2X ⊤ X and b = 2X ⊤ y. c = y ⊤ y.
1 1
⊤ 1 1 1 3 7 6 14
Here Q = 2X X = 2 1 2 =2 = . (3pts)
1 2 4 7 21 14 42
1 4
0
⊤ 1 1 1 3 6
b = 2X y = 2 1 =2 = . (3pts)
1 2 4 10 20
2
0
c = 0 1 2 1 = 5. (1 pts)
2
THE SOLUTION IS NOT UNIQUE. As long as you have Q, b, c propotional to the
solution within a constance, the solution is still correct. For example, if you have
3 7 3
Q= ,b = , c = 2.5.
7 21 10
The solution is also correct. Finally, the c actually doesn’t change the solution so only
1 pts is given for c.
The solution is
When solving for xi , we want to set this derivative equal to zero to find the minimum.
Thus, we solve for xi :
X
Qii xi + Qij xj + bi = 0
j̸=i
However, we must also consider the box constraints l ≤ x ≤ u. This is where the
truncation (projection) operator T[l,ui ] comes into play, ensuring that xi stays within
its bounds:
P
− j̸=i Q ij x j + bi
xi = T[li ,ui ] .
Qii
(k)
where xj denotes the value of xj at iteration k, except for the current xi which
is being updated.
iii. Convergence Check: After each full cycle of updates (i.e., after updating all n
coordinates), check for convergence. This can be done by evaluating whether the
change in the objective function or the change in the variable values is below a
predetermined threshold. If the convergence criteria are met, stop the algorithm.
iv. Repeat: If the convergence criteria are not met, repeat step 2 with the updated
values of x.
(a) To address the optimization problem without setting λ = 1 for convenience and to
maintain consistency with the provided problem notation, let’s derive the proximal
map for the L2 norm, h(z) = ∥z∥2 , with the general λ. The optimization problem to
solve is:
1 2
proxλh (v) = argminx∈Rn λ∥x∥2 + ∥x − v∥2 .
2
i. Step 1: Optimization Problem Formulation Given that λ > 0, the problem can
be rewritten as:
1 2
argminx∈Rn λ∥x∥2 + ∥x − v∥2 .
2
ii. Step 2: Optimality Condition The optimality condition for this problem is derived
from the first-order condition, considering that x and v are vectors in Rn is
x
λ + x − v = 0.
∥x∥2
Simplifying, we find:
v
λ + (σ − 1)v = 0
∥v∥2
iv. Step 4: Applying the Constraint The value of σ indicates how much v is scaled
to get x. However, we must ensure x is not scaled negatively, which corresponds
to the condition σ ≥ 0 or:
1
1−λ ≥ 0 ⇒ ∥v∥2 ≥ λ.
∥v∥2
0 otherwise.
0 ∈ x̂ − b + λ1 ∂∥x̂∥1 + λ2 ∂∥x̂∥2
Where:
( (
[−1, 1] if x̂i = 0 {z | ∥z∥2 ≤ 1} if x̂ = 0
u ∈ ∂∥·∥1 (x̂i ) = , v ∈ ∂∥ · ∥2 (x) = x̂
sgn (x̂i ) ̸ 0
if x̂i = ∥x̂∥2 ̸ 0
if x̂ =
Notes - The optimization problem tries to minimize x̂ norms while keeping it close
to b. - For any element which is not zero in x̂ its sign is identical to the corre-
sponding element in b. Namely ∀i ∈ {j | x̂j ̸= 0} , sgn (x̂i ) = sgn(b). The reason is
simple, If sgn (x̂i ) ̸= sgn(b) then by setting x̂i = −x̂i one could minimize the distance
to b while keeping the norms the same which is a contradiction to the x̂ being optimal.
i. Case x̂ = 0 In this case the above suggests:
b = λ1 u + λ2 v ⇐⇒ b − λ1 u = λ2 v
Since ui ∈ [−1, 1] and ∥v∥2 ≤ 1 one could see that as long as∥b − λ1 u∥2 ≤ λ2 one
could set x̂ = 0 while equality of the constraints hold. Looking for the edge cases
(With regard to b ) is simple since it cam be done element wise between $b$ and
$u$. It indeed happens when v = sign(b) which yields:
Remembering that in this case it is guaranteed that λ2 < ∥Sλ1 (b)∥2 hence the
term in the braces is positive as needed.
Summary The solution is given by:
0 if ∥Sλ1 (b)∥2 ≤ λ2
x̂ = proxλ1 ∥·∥1 +λ2 ∥·∥2 (b) = λ2
1− Sλ1 (b) if ∥Sλ1 (b)∥2 > λ2
∥Sλ1 (b)∥2
!
λ2 .
= 1− Sλ1 (b)
max ∥Sλ1 (b)∥2 , λ2
= proxλ2 ∥·∥2 proxλ1 ∥·∥1 (b)
where ϵ > 0 is a constant. To do this, we first compute the proximal operator for h(x).
Step 1: Compute the proximal operator for h(x) The objective of the proximal operator
for h(x) is:
1 T T 1 (k)
2
arg min x Ax − b x + x−x
x 2 2t
Expanding the quadratic term, we get:
1 T T 1 (k)
T
(k)
arg min x Ax − b x + x−x x−x
x 2 2t
1 T T 1 T 1 (k)T 1 (k)T (k)
= arg min x Ax − b x + x x − x x+ x x
x 2 2t t 2t
Step 2: Differentiate and set to zero to solve for $x$ Taking the derivative with respect to
x and setting it to zero gives:
1 1
Ax − b + x − x(k) = 0
t t
Rearrange and solve for x :
1 1
A + I x = b + x(k)
t t
−1
1 1 (k)
x= A+ I b+ x
t t
This is achieved by rearranging the equation to match the iterative refinement form, demon-
strating the equivalence between the proximal minimization algorithm applied to h(x) and
the iterative refinement algorithm.
Comment on the usefulness for rank-deficient A The iterative refinement algorithm, and
hence the proximal minimization algorithm, is particularly useful for solving problems
where A is rank-deficient (i.e., does not have full rank). In such cases, directly solving
Ax = b may be problematic due to the lack of a unique solution or numerical instability.
The addition of ϵI (where I is the identity matrix) to A in the term (A + ϵI)−1 ensures
that the matrix to be inverted is always full rank and thus invertible. This regularization
technique stabilizes the solution process and allows for the computation of a solution even
when A is rankdeficient, by effectively adding a small amount to the diagonal elements
ofA, improving its condition number.