You are on page 1of 7

1 Task 1

To show the equivalence in (1), we start from the left-hand side and manipulate
it to arrive at the right-hand side.

Z
arg max Ep(x,y) [log pθ (y|x)] = arg max p(x, y) log pθ (y|x) dx
y∈Y y∈Y
Z
= arg max p(x)p(y|x) log pθ (y|x) dx
y∈Y

= arg max Ep(x) [p(y|x) log pθ (y|x)]


y∈Y

= arg min Ep(x) [−p(y|x) log pθ (y|x)]


y∈Y

= arg min Ep(x) [DKL (p(y|x)||pθ (y|x))]


y∈Y

The key steps are:


1. Expand the expectation using the definition of p(x, y) = p(x)p(y|x). 2.
Move the integral inside as an expectation over p(x). 3. Flip the sign to change
max to min and negate the expression. 4. Recognize the definition of the
KL-divergence given in (2).
Thus, we have shown the equivalence stated in (1), i.e., maximizing the
expected log-likelihood is equivalent to minimizing the expected KL-divergence
between the true conditional distribution p(y|x) and the model’s conditional
distribution pθ (y|x).

2 Task 2
To show that for any choice of θ, there exists γ such that pθ (y|x) = pγ (y|x):
First, note that from equation (5), we have:

exp(x⊤ wy + by )
pγ (y|x) = Pk

i=1 exp(x wi + bi )
exp( 2σ1 2 (x − µy )⊤ (x − µy ) + Z −1 (σ))
= Pk 1 ⊤ −1 (σ))
i=1 exp( 2σ 2 (x − µi ) (x − µi ) + Z

where the second equality follows from equation (7).


Now, for any given θ = {π1 , w2 , ..., wk , µ1 , µ2 , ..., µk , σ}, define γ = {w1 , w2 , ..., wk , b1 , b2 , ..., bk }
where:
1
wi = (x − µi )
2σ 2
1
bi = Z −1 (σ) − 2 µ⊤ µi + log πi
2σ i

1
Substituting these into the expression for pγ (y|x):

exp( 2σ1 2 (x − µy )⊤ (x − µy ) + Z −1 (σ) − 2σ1 2 µ⊤y µy + log πy )


pγ (y|x) = Pk 1 1

i=1 exp( 2σ 2 (x − µi ) (x − µi ) + Z
−1 (σ) − 2σ2 µ⊤ i µi + log πi )
πy exp( 2σ1 2 (x − µy )⊤ (x − µy ) − 2σ1 2 µ⊤y µy )
= Pk 1 1
⊤ ⊤
i=1 πi exp( 2σ 2 (x − µi ) (x − µi ) − 2σ 2 µi µi )
πy exp(− 2σ1 2 (x − µy )⊤ (x − µy ))
= Pk 1 ⊤
i=1 πi exp(− 2σ 2 (x − µi ) (x − µi ))
= pθ (y|x)

Therefore, for any θ, we can find a γ such that pθ (y|x) = pγ (y|x), showing
the equivalence between the logistic regression and naive Bayes models.

3 Task 3
(a) Without any conditional independence assumptions, the total number of in-
dependent
Qn parameters needed to describe the joint distribution over (X1 , ..., Xn )
is i=1 ki , where ki = |val(Xi )| is the number of possible values for variable Xi .
(b) Under the independence assumptions that (X1 , ..., Xn ) are indepen-
dent
Pnrandom variables, the total number of independent parameters needed
is i=1 (ki − 1). For each variable Xi , we need ki − 1 parameters to specify its
marginal distribution, since the probabilities must sum to 1.
(c) Let 1, 2, ..., n denote the topological sort for a Bayesian network over the
random variables X1 , X2 , ..., Xn . Mathematically, we impose the independence
assumptions:

p(Xi |Xi−1 , Xi−2 , ..., X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , ..., Xi−m ) (1)

for i > m. For i ≤ m, no conditional independence is imposed on Xi with


respect to its ancestors.
To derive the total number of independent parameters, we can use the chain
rule of probability and the given independence assumptions:
n
Y
p(X1 , ..., Xn ) = p(Xi |Xi−1 , ..., X1 )
i=1
Yn
= p(Xi |Pa(Xi ))
i=1

where Pa(Xi ) denotes the parents of Xi in the Bayesian network.


For each variable Xi , we need to specify the conditional probability distribu-
tion p(Xi |Pa(X
Q i )). The number of independent parameters for this distribution
is (ki − 1) Xj ∈Pa(Xi ) kj .

2
Therefore, the total number of independent parameters is:
n
X Y
(ki − 1) kj
i=1 Xj ∈Pa(Xi )

This can be computed by summing the number of parameters for each condi-
tional probability distribution in the Bayesian network.

4 Task 4
For the case where n = 2, we have:

pf (x1 , x2 )
pf (x1 |x2 ) =
pf x2
N (x1 |µ1 (x2 ), σ12 (x2 )) · N (x2 |µ2 , σ22 )
=
N (x2 |µ2 , σ22 )
= N (x1 |µ1 (x2 ), σ12 (x2 ))

which is a mixture of truncated Gaussians whose mixture weights depend on ϵ.


To prove the (*) or (**) statements, consider the forward factorization:

pf (x1 ) = N (x1 |0, 1)


pf (x2 |x1 ) = N (x2 |µ2 (x1 ), ϵ)

where (
0 if x1 ≤ 0
µ2 (x1 ) =
1 otherwise
The construction makes pf (x2 ) a mixture of two distinct Gaussians, which
pr (x2 ) cannot match since pf (x2 ) is strictly Gaussian. Any counterexample of
this form makes pf (x2 ) non-Gaussian, sufficing for full-credit.
Interestingly, we can also intuit about the distribution pf (x1 |x2 ). If one
chooses a very small positive ϵ, then the corresponding pf (x1 |x2 ) will approach
a truncated Gaussian distribution, which cannot be approximated by the Gaus-
sian pr (x1 |x2 ).
Optionally, we can prove (*) and a variant of (**) which states that for any
ϵ > 0, the distribution:
pf (x1 , x2 )
pf (x1 |x2 ) =
pf x2
is a mixture of truncated Gaussians whose mixture weights depend on ϵ.

3
5 Task 5
a) For the case n = 2, we have:
2
Y
pf (x1 , x2 ) = pf (xi |x<i )
i=1
2
Y
= N (xi |µi (x<i ), σi2 (x<i ))
i=1
= pf (x1 ) · pf (x2 |x1 )
= N (x1 |µ1 , σ12 ) · N (x2 |µ2 (x1 ), σ22 (x1 ))

where µ1 and σ12 are constants, and µ2 (x1 ) and σ22 (x1 ) are functions of x1
represented by the neural networks.
p (x1 ,x2 )
So pf (x1 |x2 ) = fpf (x2)
is a mixture of truncated Gaussians whose mixture
weights depend on ϵ.
b) To show the models cover the same hypothesis space of distributions, we
need to prove that given any choice of {µi , σi }ni=1 in the forward model, there
always exists a choice of {µ̃i , σ̃i }ni=1 in the reverse model such that pf = pr .
The key is to show that any conditional distribution pf (xi |x<i ) in the forward
model can be matched by a corresponding conditional pr (xi |x>i ) in the reverse
model.
In the forward model, we have:

pf (xi |x<i ) = N (xi |µi (x<i ), σi2 (x<i ))

In the reverse model, we need to choose µ̃i and σ̃i as functions of x>i such
that:

pr (xi |x>i ) = N (xi |µ̃i (x>i ), σ̃i2 (x>i ))


= N (xi |µi (x<i ), σi2 (x<i ))

This can be achieved by setting:

µ̃i (x>i ) = µi (x<i )


σ̃i2 (x>i ) = σi2 (x<i )

Since the neural networks are universal function approximators, they can
represent these mappings from x>i to µi (x<i ) and σi2 (x<i ).
Therefore, the forward and reverse models cover the same hypothesis space.
A concrete counterexample or mathematical definitions are not needed - the
universality of the neural networks is sufficient to prove they can match any
distribution in the hypothesis space.

4
6 Task 6
6.1 a)
To find an efficient bit representation for the 50257 tokens, we want to represent
each token as (a1 , a2 , ..., an ), where ai ∈ {0, 1}, ∀i = 1, 2, ..., n. The minimal n
we can use is: n = ⌈log2 (50257)⌉ = 16
So the minimal number of bits needed to represent each token is 16.

6.2 b)
If the number of possible tokens increases from 50257 to 60000, the increase in
the number of parameters is:
The number of parameters in the GPT-2 model is 4 × 768 × 50257 =
154, 440, 192.
If the number of tokens increases to 60000, the new number of parameters
would be: 4 × 768 × 60000 = 184, 320, 000
The increase is: 184, 320, 000 − 154, 440, 192 = 29, 879, 808 parameters.

6.3 c)
To generate paper abstracts using the GPT-2 model:
1. Implement the sampling procedure for GPT-2 as described in the image.
2. Choose 5 sentences from the abstracts of some NeurIPS papers.
3. For each chosen sentence:
(a) Tokenize the sentence into a sequence of token IDs. (b) Feed this sequence
into the trained GPT-2 model. (c) Sample the next token from the softmax
probability distribution over the vocabulary outputted by GPT-2. (d) Append
the sampled token to the sequence and repeat from (b) until an end-of-sequence
token is generated or a maximum length is reached.
4. Detokenize the generated token sequences back into text to obtain the
generated abstract continuations.

6.4 f)
To perform temperature scaling, we pick a scalar temperature T > 0 and divide
the next token logits by T :

pT (xi |x<i ) ∝ elog p(xi |x<i )/T


The model p is the GPT-2 model and pT is the temperature-scaled model.
For T < 1, pT induces a sharper distribution than p, making likely tokens even
more likely.

5
6.5 g)
In the previous question, temperature scaling was only performed over the next
token, i.e. with T < 1 we made likely next-tokens even more likely. To make
likely sentences even more likely, consider joint temperature scaling:

pjoint
T (x0 , x1 , ..., xM ) ∝ elog p(x0 ,x1 ,...,xM )/T
Applying chain rule with single-token temperature scaling does not recover
joint temperature scaling. In other words, the following equation does not hold
for arbitrary T :
M
?
Y
pT (xi |x<i ) = pjoint
T (x0 , x1 , ..., xM )
i=0

Hint: Suppose we have an autoregressive model p(x, y) parameterized as:

p(x) ∝ f (x)
p(y) ∝ g(x, y)

with non-negative functions f, g. We cannot assume p(x, y) ∝ f (x)g(x, y) be-


g(x,y)
cause we must write p(y|x) = R g(x,y)dy , where the normalizing constant depends
on x. If we write the joint probability this way, we incorrectly assume the nor-
malizing constant is the same for all x’s.

Detailed

The equation does not hold for arbitrary T because applying the chain rule
with single-token temperature scaling does not recover joint temperature scaling.
To see this, let’s expand both sides:

M M
p(x |x )1/T
P i ′<i 1/T
Y Y
pT (xi |x<i ) =
i=0 i=0 x′ p(xi |x<i )
i

p(x0 , x1 , ..., xM )1/T


pjoint
T (x0 , x1 , ..., xM ) =P ′ ′ ′ 1/T
x′ ,...,x′ p(x0 , x1 , ..., xM )
0 M

The denominators are different: in the chain rule, each denominator is a sum
over a single variable x′i , while in joint temperature scaling, the denominator is
a sum over all possible sequences. The normalizing constants are not the same,
so the equation does not hold in general.
Intuitively, joint temperature scaling directly modifies the joint probability,
while applying single-token temperature scaling at each step does not necessarily
have the same effect on the joint probability.

6
6.6 h)
Next, we will implement temperature scaling over more than one token (for
simplicity, we will do temperature scaling over two tokens):

pjoint-2
T (xi , xi+1 |x<i ) ∝ elog p(xi ,xi+1 |x<i )/T
X joint-2
pjoint-2
T (xi |x<i ) = pT (xi , xi+1 = aj |x<i )
j

Detailed

To implement temperature scaling over two tokens, we first compute the


joint probability of the next two tokens given the context:

pjoint-2
T (xi , xi+1 |x<i ) ∝ p(xi , xi+1 |x<i )1/T
This can be done by taking the logits for the next two tokens, applying
softmax with temperature T , and reshaping the result into a matrix.
To compute the marginal probability of the next token xi , we sum over all
possible values of xi+1 :
X joint-2
pjoint-2
T (xi |x<i ) = pT (xi , xi+1 = aj |x<i )
j

where aj ranges over the vocabulary. This can be implemented by summing


over the columns of the probability matrix.
Finally, to sample xi , we use the marginal probabilities pjoint-2
T (xi |x<i ). Af-
ter sampling xi , we can condition on it and use the conditional probabilities
pjoint-2
T (xi+1 |xi , x<i ) to sample xi+1 .

You might also like