Professional Documents
Culture Documents
all_tasks
all_tasks
To show the equivalence in (1), we start from the left-hand side and manipulate
it to arrive at the right-hand side.
Z
arg max Ep(x,y) [log pθ (y|x)] = arg max p(x, y) log pθ (y|x) dx
y∈Y y∈Y
Z
= arg max p(x)p(y|x) log pθ (y|x) dx
y∈Y
2 Task 2
To show that for any choice of θ, there exists γ such that pθ (y|x) = pγ (y|x):
First, note that from equation (5), we have:
exp(x⊤ wy + by )
pγ (y|x) = Pk
⊤
i=1 exp(x wi + bi )
exp( 2σ1 2 (x − µy )⊤ (x − µy ) + Z −1 (σ))
= Pk 1 ⊤ −1 (σ))
i=1 exp( 2σ 2 (x − µi ) (x − µi ) + Z
1
Substituting these into the expression for pγ (y|x):
Therefore, for any θ, we can find a γ such that pθ (y|x) = pγ (y|x), showing
the equivalence between the logistic regression and naive Bayes models.
3 Task 3
(a) Without any conditional independence assumptions, the total number of in-
dependent
Qn parameters needed to describe the joint distribution over (X1 , ..., Xn )
is i=1 ki , where ki = |val(Xi )| is the number of possible values for variable Xi .
(b) Under the independence assumptions that (X1 , ..., Xn ) are indepen-
dent
Pnrandom variables, the total number of independent parameters needed
is i=1 (ki − 1). For each variable Xi , we need ki − 1 parameters to specify its
marginal distribution, since the probabilities must sum to 1.
(c) Let 1, 2, ..., n denote the topological sort for a Bayesian network over the
random variables X1 , X2 , ..., Xn . Mathematically, we impose the independence
assumptions:
p(Xi |Xi−1 , Xi−2 , ..., X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , ..., Xi−m ) (1)
2
Therefore, the total number of independent parameters is:
n
X Y
(ki − 1) kj
i=1 Xj ∈Pa(Xi )
This can be computed by summing the number of parameters for each condi-
tional probability distribution in the Bayesian network.
4 Task 4
For the case where n = 2, we have:
pf (x1 , x2 )
pf (x1 |x2 ) =
pf x2
N (x1 |µ1 (x2 ), σ12 (x2 )) · N (x2 |µ2 , σ22 )
=
N (x2 |µ2 , σ22 )
= N (x1 |µ1 (x2 ), σ12 (x2 ))
where (
0 if x1 ≤ 0
µ2 (x1 ) =
1 otherwise
The construction makes pf (x2 ) a mixture of two distinct Gaussians, which
pr (x2 ) cannot match since pf (x2 ) is strictly Gaussian. Any counterexample of
this form makes pf (x2 ) non-Gaussian, sufficing for full-credit.
Interestingly, we can also intuit about the distribution pf (x1 |x2 ). If one
chooses a very small positive ϵ, then the corresponding pf (x1 |x2 ) will approach
a truncated Gaussian distribution, which cannot be approximated by the Gaus-
sian pr (x1 |x2 ).
Optionally, we can prove (*) and a variant of (**) which states that for any
ϵ > 0, the distribution:
pf (x1 , x2 )
pf (x1 |x2 ) =
pf x2
is a mixture of truncated Gaussians whose mixture weights depend on ϵ.
3
5 Task 5
a) For the case n = 2, we have:
2
Y
pf (x1 , x2 ) = pf (xi |x<i )
i=1
2
Y
= N (xi |µi (x<i ), σi2 (x<i ))
i=1
= pf (x1 ) · pf (x2 |x1 )
= N (x1 |µ1 , σ12 ) · N (x2 |µ2 (x1 ), σ22 (x1 ))
where µ1 and σ12 are constants, and µ2 (x1 ) and σ22 (x1 ) are functions of x1
represented by the neural networks.
p (x1 ,x2 )
So pf (x1 |x2 ) = fpf (x2)
is a mixture of truncated Gaussians whose mixture
weights depend on ϵ.
b) To show the models cover the same hypothesis space of distributions, we
need to prove that given any choice of {µi , σi }ni=1 in the forward model, there
always exists a choice of {µ̃i , σ̃i }ni=1 in the reverse model such that pf = pr .
The key is to show that any conditional distribution pf (xi |x<i ) in the forward
model can be matched by a corresponding conditional pr (xi |x>i ) in the reverse
model.
In the forward model, we have:
In the reverse model, we need to choose µ̃i and σ̃i as functions of x>i such
that:
Since the neural networks are universal function approximators, they can
represent these mappings from x>i to µi (x<i ) and σi2 (x<i ).
Therefore, the forward and reverse models cover the same hypothesis space.
A concrete counterexample or mathematical definitions are not needed - the
universality of the neural networks is sufficient to prove they can match any
distribution in the hypothesis space.
4
6 Task 6
6.1 a)
To find an efficient bit representation for the 50257 tokens, we want to represent
each token as (a1 , a2 , ..., an ), where ai ∈ {0, 1}, ∀i = 1, 2, ..., n. The minimal n
we can use is: n = ⌈log2 (50257)⌉ = 16
So the minimal number of bits needed to represent each token is 16.
6.2 b)
If the number of possible tokens increases from 50257 to 60000, the increase in
the number of parameters is:
The number of parameters in the GPT-2 model is 4 × 768 × 50257 =
154, 440, 192.
If the number of tokens increases to 60000, the new number of parameters
would be: 4 × 768 × 60000 = 184, 320, 000
The increase is: 184, 320, 000 − 154, 440, 192 = 29, 879, 808 parameters.
6.3 c)
To generate paper abstracts using the GPT-2 model:
1. Implement the sampling procedure for GPT-2 as described in the image.
2. Choose 5 sentences from the abstracts of some NeurIPS papers.
3. For each chosen sentence:
(a) Tokenize the sentence into a sequence of token IDs. (b) Feed this sequence
into the trained GPT-2 model. (c) Sample the next token from the softmax
probability distribution over the vocabulary outputted by GPT-2. (d) Append
the sampled token to the sequence and repeat from (b) until an end-of-sequence
token is generated or a maximum length is reached.
4. Detokenize the generated token sequences back into text to obtain the
generated abstract continuations.
6.4 f)
To perform temperature scaling, we pick a scalar temperature T > 0 and divide
the next token logits by T :
5
6.5 g)
In the previous question, temperature scaling was only performed over the next
token, i.e. with T < 1 we made likely next-tokens even more likely. To make
likely sentences even more likely, consider joint temperature scaling:
pjoint
T (x0 , x1 , ..., xM ) ∝ elog p(x0 ,x1 ,...,xM )/T
Applying chain rule with single-token temperature scaling does not recover
joint temperature scaling. In other words, the following equation does not hold
for arbitrary T :
M
?
Y
pT (xi |x<i ) = pjoint
T (x0 , x1 , ..., xM )
i=0
p(x) ∝ f (x)
p(y) ∝ g(x, y)
Detailed
The equation does not hold for arbitrary T because applying the chain rule
with single-token temperature scaling does not recover joint temperature scaling.
To see this, let’s expand both sides:
M M
p(x |x )1/T
P i ′<i 1/T
Y Y
pT (xi |x<i ) =
i=0 i=0 x′ p(xi |x<i )
i
The denominators are different: in the chain rule, each denominator is a sum
over a single variable x′i , while in joint temperature scaling, the denominator is
a sum over all possible sequences. The normalizing constants are not the same,
so the equation does not hold in general.
Intuitively, joint temperature scaling directly modifies the joint probability,
while applying single-token temperature scaling at each step does not necessarily
have the same effect on the joint probability.
6
6.6 h)
Next, we will implement temperature scaling over more than one token (for
simplicity, we will do temperature scaling over two tokens):
pjoint-2
T (xi , xi+1 |x<i ) ∝ elog p(xi ,xi+1 |x<i )/T
X joint-2
pjoint-2
T (xi |x<i ) = pT (xi , xi+1 = aj |x<i )
j
Detailed
pjoint-2
T (xi , xi+1 |x<i ) ∝ p(xi , xi+1 |x<i )1/T
This can be done by taking the logits for the next two tokens, applying
softmax with temperature T , and reshaping the result into a matrix.
To compute the marginal probability of the next token xi , we sum over all
possible values of xi+1 :
X joint-2
pjoint-2
T (xi |x<i ) = pT (xi , xi+1 = aj |x<i )
j