You are on page 1of 14

COMP4332 and RMBI4310 Final Exam

Date: May 21, 2021


Time: 12:30PM - 15:30PM
Instructor: Yangqiu Song

Name: Student ID:

Question Score Question Score


1 / 10 6 / 13
2 / 10 7 / 11
3 /8 8 / 10
4 /9 9 /9
5 / 10 10 / 10
Total: / 100

1
1 Yes/No Questions (10 Points)

1. Any continuous function can be approximated with arbitrary small error by a two layer
network.
2. Using the dropout can improve the generalization. For an MLP binary classifier, in-
creasing dropout rate from 0 to 1 will increase both the training F1 and test F1.
3. Increasing the depth and width of neural networks will increase expressiveness as well
as the time complexity. When we tune the neural network hyperparameters, we always
prefer deeper and wider models within the time & space budget.
4. CNNs can be better than MLP when the data contains certain important inductive
bias. For example, objects are represented by a continuous block of pixels in image
data.
5. Suppose we have Classifier-1 to conduct the sentiment classification on long reviews
containing more than 500 words. Classifier-1 has a CNN layer to encode the word
embeddings. Now Alex construct Classifier-2 by using a RNN layer to replace the
CNN layer. If Alex carefully keeps the number of parameters to be the same, then the
Classifier-1 and Classifier-2 will cost the same time to make prediction on the same long
review text.
6. Negative sampling is an important optimization technique for training Node2Vec. The
training time will significantly increase without negative sampling, but the overall per-
formance will be the same.
7. In DeepWalk, using longer walk length will include more information in each walk, then
the performance will be improved.
8. Deep Walk model is a special case of node2vec model.
9. Though user CF and Item CF are two ways of recommendation, those two ways will
recommend the same items to the specific user, if and only if we input the same data.
10. Models with stronger expressive power can always have better performance on specific
tasks.

2
2 N-Gram Feature and TF-IDF (10 Points)

Consider three sentences (all words are in lower cases):


Document 1: natural language processing is a subfield of linguistics, computer science, and
artificial intelligence concerned with the interactions between computers and human language
Document 2: the goal of natural language processing is to understand the contents of docu-
ments
Document 3: the tools can accurately extract information and insights from natural language
as well as categorize and organize the documents.

1. Write down all bi-gram features of the second documents.


2. Calculate the TF-IDF score for the following uni-grams and bi-grams “Documents”,
“is”, “natural language”, “processing” in documents 1, 2, and 3.

3
3 CNN (8 Points)

1. 1D convolution: consider an 1D input sequence with token representation:


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
and kernel with weights [1, 2, 3] and bias 1. Calculate the output of the sequence passing
through this kernel.
a. With stride two and ‘valid’ padding (2 Points).

2. 2D convolution: We have a batch of 8 input images which is a 8 × 128 × 128 × 3 tensor,


and convolve it with 20 filters of size: 5 × 5, using a stride of 3 and ‘valid’ padding. Noted
that each filter in a convolution layer which produces one and only one output channel.
a. Compute the output size (2 Points).
b. Compute the number of all parameters (2 Points).
c. Briefly describe the difference between mini-batch gradient descent and stochastic gradient
descent (2 Point).

4
4 GRU (9 Points)

Figure 1: GRU structure

1. For a single unit in a GRU network, We have following equations:


Zt = σ(WZ xt + UZ ht−1 + bZ )
rt = σ(Wr xt + Ur ht−1 + br )
ht = σ(Wh xt + Ut (rt ht−1 ) + bh )
e

ht = (1 − Zt ) ht−1 + Zt e
ht
xt ∈ RD (RD means a D-dimensional vector) is the input of step t, and ht ∈ Rd is the
hidden state of step t. Calculate the total number of parameters to be estimated in the
GRU unit (based on the dimensionalities of all vectors and matrices). Here we assume
input xt is a fixed embedding vector, so please do not count it into the parameters.
2. Then we use the GRU unit to construct a two-layer bi-directional network whose struc-
ture is shown in Figure 1. For this stack of bi-directional network, we add one more
(k)
equation to calculate each layer’s output Ot ∈ Rm :

(k)
−−→−→ (k)
←−−←(k) −
Ot = softmax(W (k) ht + W (k) ht )
Calculate the total number of parameters to be estimated in this stack of bi-directional
network.

5
5 Sentiment Analysis (10 Points)

Figure 2: One example of sentiment analysis for term “sushi rolls”

Consider a novel gated CNN model structure that performs sentiment analysis for a term
given in the input sentence. Sometimes, it is possible for an input sentence to be composed
with multiple terms and the gated CNN model is able to analyze the sentiment of each
term. Model architecture with one term example is shown in Figure 2. For simplicity, we
only consider the 3 gram case here, where the 1-D Convolution with kernel size 3 is used for
processing input sentences.
The gated CNN model consists of an embedding layer for both context and target term, a
one-dimension convolutional layer, a GTRU (Gated Tanh-ReLU Units) layer, a max-global-
pooling layer (global-pooling means taking the maximum value for a given input), and finally
a fully connected layer before softmax with cross-entropy loss.
For context embedding, the model can accept sentences of length n, and here, no input has
length ≥ n and padding will be used for input with length < n. For target embeddings, we
assume the max length for term is 4.
For context convolution, we only consider 1 convolution kernel with 2 filters (1 filter for tanh
and 1 filter for Relu) and kernel size 3, which is exactly the same shown in Figure 2. For
target convolution, we only consider 1 convolution kernel with 1 filter and kernel size 3. The
stride is one for both context convolution and target convolution with ’valid’ padding.
For the GTRU layer, it can be formulated as:
ai = ReLU(Xi:i+k · Wa + Max Pooling(va · Wv ))
si = tanh(Xi:i+k · Ws )
c i = ai × s i
where · refers to matrix multiplication, i refers to the i-th location of a sequence, Xi:i+k refers

6
to the context embedding (Xi , Xi+1 , .., Xi+k−1 ) being convoluted with stride k , Wa and Ws
refer to convolution kernels for context, va refers to term target embeddings associated with
context (for example, embeddings of [<PAD>, sushi, rolls] and [sushi,rolls,<PAD>]), Wv
refers to convolution kernel for va and ci is the output for GTRU.
Both Max Pooling in GTRU and Max Pooling layer refer to global max pooling, for example:
 
Max Pooling 1 2 3 4 5 6 = 6

All bias terms can be ignored.


For input word embedding, we assume the dimension of embedding to be d and we want to
predict 3 labels: positive, neutral and negative.
Questions:
1. For a single input sentence, write the shape of

• intermediate values after Embedding for Context and Target (1 Point)


• intermediate values after Convolution for Context and Target (1 Point)
• intermediate values after GTRU, the final Max Pooling before Fully Connected Layer,
and Fully Connected Layer (4 Points)

2. Calculate the number of trainable parameters. Here you can assume that all embed-
dings are pre-trained and frozen, and you don’t have to consider embeddings as trainable
parameters. (2 Points)
3. Previous works mostly use RNN based models like LSTM and GRU to perform sentiment
analysis. What is the advantage of proposed gated CNN compared with RNN based models?
(2 Points)

7
6 Another Type of CNN (13 Points)

A new type of CNN is proposed to tackle binary classification of sentences recently. To begin
with, we consider the single sentence input of length n ≥ 3. s = [w1 , ..., wn ].
This CNN contains 7 layers:
(1)
L1 An embedding layer that maps each word wi into the embedding hi ∈ Rd . Let the
(1) (1)
vocabulary size be V . The output of this layer is a matrix H (1) = [h1 , ...hn ]> ∈ Rn×d ,
i.e. n rows and d columns.
L2 A convolution layer that has kernel size 3, stride size 1 and c1 output channels, with the
valid padding and element-wise ReLU activation is applied to H (1) . The output of this
layer is a matrix H (2) .
L3 A transpose layer that takes the transpose of the input matrix, then we have the output
H (3) = H (2)> .
L4 Another convolution layer that has kernel size 1, stride size 1 and c2 output channels,
with the valid padding and element-wise ReLU activation is applied to H (3) . The output
of this layer is a matrix H (4) .
L5 Another transpose layer outputs H (5) = H (4)> .
L6 A max pooling layer takes the maximum value of each column, and we get the represe-
tation vector H (6) .
L7 A classifier layer that uses an MLP with single hidden layer of s neurons to conduct the
binary classification. We only need to predict the probability of true label pt , then the
probability of false label can be computed by 1 − pt .

Based on your understanding of MLP, CNN and, max pooling, Answer the following ques-
tions.

1. Write down the number of parameters at layer L1 to L7, respectively. If some layers
don’t require any parameter, write 0. (Hint: You should also consider the bias term if
possible)
2. Now consider the batched version of implementation. That is, at each time, k sentences
of length n are fed into the model. For convenience, we require the first dimensions
of all intermediate outputs indicate the batch size k. In this way, the shape of H (1) is
(k, n, d). Write down the size of each intermediate output from H (2) to H (6) (Hint, you
can consider the dimensions of intermediate outputs given single sample input first)

8
7 GNN Computation (11 Points)

Figure 3: HAG

Graph Neural Networks are based on repeated aggregations of information from nodes’ neigh-
bors. Hierarchically Aggregated computation Graphs (HAGs), a new GNN representation
technique that explicitly avoids redundancy by managing intermediate aggregation results
hierarchically and eliminates repeated computations and unnecessary data transfer in GNN
training and inference was proposed.
HAGs eliminate redundancy in the GNN-graph representation by hierarchically managing
and reusing intermediate aggregation results. A HAG Ĝ = (V̂, Ê) has nodes V̂ = V ∪ VA and
edge Ê, where V is the set of nodes in the original graph, and VA is a new set of aggregation
nodes. Each aggregation node in VA represents the intermediate aggregations result for a
(k−1)
subset of nodes (i.e., aggregation on a subset of hv ). For the example in Figure 3(c), the
new nodes AB and CD denote the aggregation results of A,B and C,D, respectively. A HAG
can obtain a multi-level aggregation hierarchy. For example, Figure 3(c) can also have a
third aggregation node BCD that depends on AB and CD. Similar to edges in GNN-graphs,
an edge (u, v) in a HAG denotes an aggregation relation - computing v’s activations requires
aggregating u’s activations.
Note: Ignore sequential aggregate and activation function in Aggregation in following dis-
cussion.
HAG search algorithm: Before computing the final representation, we need to build up
a HAG for given graph.
Algorithm1 shows the pseudocode of the HAG search algorithm. We start with an input
GNN-graph, and iteratively insert aggregation nodes into the current HAG to merge highly
redundant aggregations and remove unnecessary computation and data transfer.

1. Consider the graph in Figure 4. Please draw the HAG graph according to the algorithm
1 for 1-layer GNN graph.
2. Let’s consider a 1-layer GraphSAGE network and the representation function:
h1v = W1 · CONCAT(AGGREGATE1 (h0u , ∀u ∈ N (v), h0v ))

9
Algorithm 1 A HAG search algorithm
Input: A GNN-graph G and a GNN model M
Output: An equivalent HAG with optimized performance
1: function REDUNDANCY(v1 , v2 , Ê)
2: R = {u|(v1 , u) ∈ Ê ∧ (v2 , u) ∈ Ê}
3: return |R|
4: end function
5:
6: V ← Ø, Ê ← E
7: while |VA | < capacity do
8: (v1 , v2 ) = arg maxv1 ,v2 REDUNDANCY(v1 , v2 , Ê)
9: if REDUNDANCY(v1 , v2 , Ê)> 1 then
10: VA ← VA + {w} where w is a new node
11: Ê ← Ê + (v1 , w) + (v2 , w)
12: for u ∈ V do
13: if (v1 , u) ∈ Ê ∧ (v2 , u) ∈ Ê then
14: Ê ← Ê − (v1 , u) − (v2 , u) + (w, u)
15: end if
16: end for
17: end if
18: end while
19: return VA ∪ V, Ê

Figure 4: Input graph for Q7

Use MAX as aggregation function. Please demonstrate What intermediate representa-


tion for all elements in VA should be saved.
3. We have three different aggregators to compute the node presentations:
MEANAGGREGATEk = MEAN({hk−1
u , ∀u ∈ N (v)})

MAXAGGREGATEk = MAX({hk−1
u , ∀u ∈ N (v)})

SUMAGGREGATEk = SUM({hk−1
u , ∀u ∈ N (v)})

The HAG technique can speed up all three aggregators’ computation, please compare
the accelerating rate Ar in these settings on same graph.
Hint: Ar = T (HAG)/T (GNN), where T is computation time. You can assume all
operations’ running time is similar.

10
8 Node2Vec (10 Points)

Figure 5: Input graph

In this question, you are required to list all the paths generated from the biased random walk
process. Assume that F is the second last node which is indicated with yellow line, and D is
the last node which is indicated with blue line. Assume p is 2, q is 0.5, and the remaining
walk length is 2. You should list all possible paths and their associated probabilities.
Please write down all the necessary steps to compute the probabilities. The probability
should be in the format of fraction rather than decimal.
Hints: your path should start with node D and contains 3 nodes.

11
9 NGCF (9 Points)

Consider the following neural graph collaborative filtering:

Figure 6: NGCF

Here are some structure explanation:

1. Embedding layer: we describe user and item with a d-dimension vector. Both embed-
dings are trainable.
2. Embedding propagation layer: we build upon the message-passing architecture of GNNs
in order to capture CF signal along the graph structure and refine the embeddings of
users and items.
First-order propagation. We build upon preference basis to perform embedding
propagation between the connected users and items, formulating the process with two
major operations: message construction and message aggregation.
Message construction. For a connected user-item J pair (u, i), we define the message
0
from i to u as: mu←i = √ 1
(W1 ei + W2 (ei eu )), where W1 , W2 ∈ Rd ×d are
|Nu ||Ni |
0
the trainable weight matrices
J to distill useful information for propagation, and d is
the transformation size, denotes the element-wise product, Nu and Ni denote the
first-hop neighbors of user u and item i.
Message Aggregation. In this stage, we aggregate the messages propagated from
u’s neighborhood to refine u’s representation. Specifically, we define the aggregation
function as: !
X
e(1)
u = LeakyReLU W1 eu + mu←i
i∈Nu

12
(1)
where eu denotes the representation of user u obtained after the first embedding propa-
gation layer. W1 is the weight matrix shared with the one used in message construction.
(1)
Analogously, we can obtain the representation ei for the item i by propagating infor-
mation from its connected users using shared weight matrix.
High-order propagation. With the representations augmented by first-order con-
nectivity modeling, we can stack more embedding propagation layers to explore the
high-order connectivity information. In the l-th step, the representation of user u is
recursively formulated as:
!
X (l)
elu = LeakyReLU m(l)u←u + mu←i ,
i∈Nu

wherein the messages being propagated are defined as:


(l) 1 
(l) (l−1) (l)

(l−1)
K 
mu←i = p W1 ei + W2 ei e(l−1)
u ,
|Nu | |Ni |
(1)
m(l) (l−1)
u←u = W1 eu ,
where subscript l denotes the variable in layer l.
Model prediction: After propagating with L layers, we obtain multiple representa-
tions for user and item. We concatenate the representations learned by different layers
(0) (L) (0) (L)
to get the final embeddings: e∗u = eu || · · · ||eu ,e∗i = ei || · · · ||ei where k is the con-
catenation operation. Finally, we conduct the inner product to estimate the user’s
preference towards target item: ŷN GCF (u,i) = e∗u > e∗i .

Here are some assumptions:

1. There are 1000 unique users and 2000 unique items in the training set.
2. The dimensions of user and item embeddings are both 100.
3. The number of embedding propagation layer is 3.
4. The dimension of intermediate output of each embedding propagation layer is [80, 60,
40] respectively.

Compute the number of all trainable parameters of this model.

13
10 NMF (10 Points)

Consider the Neural Matrix Factorization (NMF) model:

Assume that:

• For both user (u) and item (i) feature vectors in the bottom, we only use one-hot
encoding
• We have 1,000 users and 500 items
• Both MF User Vector and MF Item Vector have embedding size 100
• Both MLP User Vector and MLP Item Vector have embedding size 200
• X = 3. That is, we use 3 fully connected layers with ReLU activation: MLP Layer 1
(Dense 512), MLP Layer 2 (Dense 256), MLP Layer 3 (Dense 128).
• For element-wise product between MF User Vector pu and MF Item Vector qi , we assign
learnable edge weights h to the element-wise product and activation function α to make
GMF generalized:

outGM F = α(hT (pu qi ))

• All the bias terms can be ignored.

1. Compute the number of all trainable parameters of this network. (8 Points)


2. For one-hot encoding, does the NMF model have cold start problem? If it suffers cold
start problem, can you propose any method to address the problem? Explain your answer
briefly. (Hint: you may consider other encoding methods.) (2 Points)

14

You might also like