You are on page 1of 43

Natural Language Processing

CPSC 436N
Term 1
Lecture 6: Text Cl ssi ic tion 2 - Neur l Models
Technology

Sports

Instructor: Vered Shw rtz


https://www.cs.ubc.c /~vshw rtz Politics

1
a
a
f
a
a
a
a
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).

Convolutional Neural Networks More popular in vision.


Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.

2
f
Today
Text Classi ication
• MLP classi ier
• CNN classi ier

3
f
f
f
Today
Text Classi ication
• MLP classi ier
• CNN classi ier

4
f
f
f
Binary Logistic Regression as a 1-layer Network
(we don't count the input layer in the layers!)
Output layer σ = ( ∙ + )
(σ node) (y is a scalar)

w w1 wn b (scalar)
(vector)

Input layer x1 xn +1
vector x
5
𝑦
𝜎
𝑤
𝑥
𝑏
1/28/2022
Two-Layer Network with scalar output
y = σ(W2h + b2)
Output layer
y is a scalar
W2 b2
hidden units h = tanh(W1x + b1)

b1
Could be ReLU

W1
Input layer
(vector) x1 xn +1
1/28/2022 6
Multinomial Logistic Regression as a 1-layer Network

y1 yn
Output layer s s s = softmax( + )
(softmax nodes) y is a vector
W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars
7
𝑦
𝑊
𝑥
𝑏
1/28/2022
Multinomial Two-Layer Network

y = softmax(W2h + b2)
Output layer
W2 b2
hidden units h = tanh(W1x + b1)
Could be ReLU

W1 b1
Input layer
(vector) x1 xn +1
1/28/2022 8
CPSC436N Winter 2021-22
9
How to represent a document
in a neural network for
sentiment classi ication?

1/28/2022 10
f
Neural Language Model (Lecture 4) (J&M Chapter 7)

4-gram neural model

11
Neural Net Classi ication
with embeddings as input features!

12
f
Issue: Texts Come in Different Lengths
• This assumes a ixed size length (3)!
• Kind of unrealistic.
• Some simple solutions (more sophisticated solutions later)

A. ‘dessert’ is always positive


B. Make the input the length of the longest review
C. Create a single embedding for the whole document
D. Both B and C are possible and reasonable
1/28/2022 13
f
Issue: Texts Come in Different Sizes
• This assumes a ixed size length (3)!
• Kind of unrealistic.
• Some simple solutions (more sophisticated solutions later)

A. ‘dessert’ is always positive


B. Make the input the length of the longest review
C. Create a single embedding for the whole document
D. Both B and C are possible and reasonable
1/28/2022 14
f
Issue: Texts Come in Different Lengths

15
Issue: Texts Come in Different Lengths
B. Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time

15
Issue: Texts Come in Different Lengths
B. Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
C. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings, or:
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words

15
Issue: Texts Come in Different Lengths
B. Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
C. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings, or:
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words

These are simple solutions (more sophisticated solutions next week) 15


Today
Text Classi ication
• MLP classi ier
• CNN classi ier

16
f
f
f
CNN Example: Predicting Sentiment of Sentence
w1
de = embedding
dimension

Negative
CNN Sentiment Neutral
Classi ier Positive

d ≪ de ⋅ n
wn
f
(Task Speci ic) N-gram Detectors:
Convolutional Neural Networks (CNNs)

• Identi ies indicative local predictors in a large structure


• Images: regions
• Text: ngrams that are predictive for the task (e.g.
sentiment analysis)

• Combines them to produce a ixed size vector


representation of the structure, capturing the local aspects
that are most informative for the prediction task at hand.

18
f
f
f
CNN: feature-extracting architecture
Not a standalone, useful network on its own,
• but meant to be integrated into a larger network, and to be trained to
work in tandem with it in order to produce an end result.
• CNN layer’s responsibility is to extract meaningful sub-structures that
are useful for the overall prediction task at hand.

19
CNN: feature-extracting architecture
Not a standalone, useful network on its own,
• but meant to be integrated into a larger network, and to be trained to
work in tandem with it in order to produce an end result.
• CNN layer’s responsibility is to extract meaningful sub-structures that
are useful for the overall prediction task at hand.
• When applied to images, the architecture is using 2D (grid)
convolutions.
• When applied to text, we are mainly concerned with 1D (sequence)
convolutions.

19
1D Convolution over Sentence
k=3 Each word is represented by its embedding (de = 2 here for simplicity)

Scalar
. u -> non-linear function value

……
. u -> non-linear function
……
. u -> non-linear function

. ………..
. ……….. de ⋅ k = 6 u ∈ R 6×1
• A “ ilter” is applied for each k-word sliding window:
• the input is multiplied by a vector u and a non-linear function is
applied to the result.
• The output is a scalar value for each window.
20
f
CNN: Convolution and Pooling in NLP
Each column of
W is a ilter
6 3

l such ilters can be applied (l =3 here), resulting in an l


dimensional vector (each dimension corresponding to a ilter) 21
f
f
f
Multiple Filters
u1 u2 u3
xi
wi−1 wi wi+1

tanh(xi ⋅ u2)
CNN: Convolution and Pooling in NLP
6 3

A “pooling” operation combine the vectors resulting from the different


windows into a single vector, by taking the max or the average value
observed in each of the I dimensions over the different windows. 23
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location

24
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
• Each ilter extracts a different indicator from the window

24
f
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
• Each ilter extracts a different indicator from the window
• The pooling operation zooms in on the important indicators.

24
f
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
• Each ilter extracts a different indicator from the window
• The pooling operation zooms in on the important indicators.
• The resulting l -dimensional vector is then fed further into a
network that is used for prediction.

24
f
Training

25
Training
• Gradients are back-propagated according to the loss on the task.

25
Training
• Gradients are back-propagated according to the loss on the task.
• The weights of the ilter function W are trained to highlight the
aspects of the data that are important for the task.

25
f
Training
• Gradients are back-propagated according to the loss on the task.
• The weights of the ilter function W are trained to highlight the
aspects of the data that are important for the task.
• Intuitively, when the sliding window of size k is run over a sequence,
the ilter function learns to identify informative k-grams.

25
f
f
Capturing k-grams of Varying Length

26
Capturing k-grams of Varying Length
• Several convolutional layers may be applied in parallel.

26
Capturing k-grams of Varying Length
• Several convolutional layers may be applied in parallel.
• E.g. 4 convolutional layers, each with a different window
size in the range 2,3,4,5, capturing k-grams of varying
lengths.

The quick brown fox jumped over the lazy dog

26
Capturing k-grams of Varying Length
• Several convolutional layers may be applied in parallel.
• E.g. 4 convolutional layers, each with a different window
size in the range 2,3,4,5, capturing k-grams of varying
lengths.

The quick brown fox jumped over the lazy dog

26
CNN Less Popular Today in NLP, but…
• Component in multimodal applications (vision & language)
• GCN - graph embeddings, useful in enhancing neural networks
with structured knowledge from a knowledge base

27
CNN Less Popular Today in NLP, but…
• Component in multimodal applications (vision & language)
• GCN - graph embeddings, useful in enhancing neural networks
with structured knowledge from a knowledge base

Similar ideas adopted in Longformer - Transformers for long documents

27
Summary
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).

Convolutional Neural Networks More popular in vision.


Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.

28
f
For Next Time
• Optional Reading:
• J&M: https://web.stanford.edu/~jurafsky/slp3/
• Sequence labeling - Chapter 8 until section 8.4.3

• Quiz 2 due September 28


• Assignment 1 will be out September 28

29

You might also like