Lecture6 421

Natural Language Processing
CPSC 436N
Term 1
Lecture 6: Text Cl ssi ic tion 2 - Neur l Models
Technology
Sports
Instructor: Vered Shw rtz

https://www.cs.ubc.c /~vshw rtz Politics
1
a
a
f
a
a
a
a
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
Convolutional Neural Networks More popular in vision.

Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.
2
f
Today
Text Classi ication
• MLP classi ier
• CNN classi ier
3
f
f
f
Today
Text Classi ication
• MLP classi ier
• CNN classi ier
4
f
f
f
Binary Logistic Regression as a 1-layer Network
(we don't count the input layer in the layers!)
Output layer σ = ( ∙ + )
(σ node) (y is a scalar)
w w1 wn b (scalar)
(vector)
Input layer x1 xn +1
vector x
5
𝑦
𝜎
𝑤
𝑥
𝑏
1/28/2022
Two-Layer Network with scalar output
y = σ(W2h + b2)
Output layer
y is a scalar
W2 b2
hidden units h = tanh(W1x + b1)
b1
Could be ReLU
W1
Input layer
(vector) x1 xn +1
1/28/2022 6
Multinomial Logistic Regression as a 1-layer Network
y1 yn
Output layer s s s = softmax( + )
(softmax nodes) y is a vector
W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars
7
𝑦
𝑊
𝑥
𝑏
1/28/2022
Multinomial Two-Layer Network
…
y = softmax(W2h + b2)
Output layer
W2 b2
hidden units h = tanh(W1x + b1)
Could be ReLU
W1 b1
Input layer
(vector) x1 xn +1
1/28/2022 8
CPSC436N Winter 2021-22
9
How to represent a document
in a neural network for
sentiment classi ication?
1/28/2022 10
f
Neural Language Model (Lecture 4) (J&M Chapter 7)
4-gram neural model
11
Neural Net Classi ication
with embeddings as input features!
12
f
Issue: Texts Come in Different Lengths
• This assumes a ixed size length (3)!
• Kind of unrealistic.
• Some simple solutions (more sophisticated solutions later)
A. ‘dessert’ is always positive

B. Make the input the length of the longest review
C. Create a single embedding for the whole document
D. Both B and C are possible and reasonable
1/28/2022 13
f
Issue: Texts Come in Different Sizes
• This assumes a ixed size length (3)!
• Kind of unrealistic.
• Some simple solutions (more sophisticated solutions later)
A. ‘dessert’ is always positive

C. Create a single embedding for the whole document
D. Both B and C are possible and reasonable
1/28/2022 14
f
15
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
15
C. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings, or:
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words
15
C. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings, or:
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words
These are simple solutions (more sophisticated solutions next week) 15

Today
Text Classi ication
• MLP classi ier
• CNN classi ier
16
f
f
f
CNN Example: Predicting Sentiment of Sentence
w1
de = embedding
dimension
Negative
CNN Sentiment Neutral
Classi ier Positive
d ≪ de ⋅ n
wn
f
(Task Speci ic) N-gram Detectors:
Convolutional Neural Networks (CNNs)
• Identi ies indicative local predictors in a large structure

• Images: regions
• Text: ngrams that are predictive for the task (e.g.
sentiment analysis)
• Combines them to produce a ixed size vector

representation of the structure, capturing the local aspects
that are most informative for the prediction task at hand.
18
f
f
f
CNN: feature-extracting architecture
Not a standalone, useful network on its own,
• but meant to be integrated into a larger network, and to be trained to
work in tandem with it in order to produce an end result.
• CNN layer’s responsibility is to extract meaningful sub-structures that
are useful for the overall prediction task at hand.
19
CNN: feature-extracting architecture
Not a standalone, useful network on its own,
• but meant to be integrated into a larger network, and to be trained to
work in tandem with it in order to produce an end result.
• CNN layer’s responsibility is to extract meaningful sub-structures that
are useful for the overall prediction task at hand.
• When applied to images, the architecture is using 2D (grid)
convolutions.
• When applied to text, we are mainly concerned with 1D (sequence)
convolutions.
19
1D Convolution over Sentence
k=3 Each word is represented by its embedding (de = 2 here for simplicity)
Scalar
. u -> non-linear function value
……
. u -> non-linear function
……
. u -> non-linear function
. ………..
. ……….. de ⋅ k = 6 u ∈ R 6×1
• A “ ilter” is applied for each k-word sliding window:
• the input is multiplied by a vector u and a non-linear function is
applied to the result.
• The output is a scalar value for each window.
20
f
CNN: Convolution and Pooling in NLP
Each column of
W is a ilter
6 3
l such ilters can be applied (l =3 here), resulting in an l

dimensional vector (each dimension corresponding to a ilter) 21
f
f
f
Multiple Filters
u1 u2 u3
xi
wi−1 wi wi+1
tanh(xi ⋅ u2)
CNN: Convolution and Pooling in NLP
6 3
A “pooling” operation combine the vectors resulting from the different

windows into a single vector, by taking the max or the average value
observed in each of the I dimensions over the different windows. 23
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
24
Intuition
• Each ilter extracts a different indicator from the window
24
f
Intuition
• The pooling operation zooms in on the important indicators.
24
f
Intuition
• The pooling operation zooms in on the important indicators.
• The resulting l -dimensional vector is then fed further into a
network that is used for prediction.
24
f
Training
25
Training
• Gradients are back-propagated according to the loss on the task.
25
Training
• The weights of the ilter function W are trained to highlight the
aspects of the data that are important for the task.
25
f
Training
• The weights of the ilter function W are trained to highlight the
aspects of the data that are important for the task.
• Intuitively, when the sliding window of size k is run over a sequence,
the ilter function learns to identify informative k-grams.
25
f
f
Capturing k-grams of Varying Length
26
• Several convolutional layers may be applied in parallel.
26
• E.g. 4 convolutional layers, each with a different window
size in the range 2,3,4,5, capturing k-grams of varying
lengths.
The quick brown fox jumped over the lazy dog
26
• E.g. 4 convolutional layers, each with a different window
size in the range 2,3,4,5, capturing k-grams of varying
lengths.
The quick brown fox jumped over the lazy dog
26
CNN Less Popular Today in NLP, but…
• Component in multimodal applications (vision & language)
• GCN - graph embeddings, useful in enhancing neural networks
with structured knowledge from a knowledge base
27
CNN Less Popular Today in NLP, but…
• Component in multimodal applications (vision & language)
• GCN - graph embeddings, useful in enhancing neural networks
with structured knowledge from a knowledge base
Similar ideas adopted in Longformer - Transformers for long documents
27
Summary
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
Convolutional Neural Networks More popular in vision.

Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.
28
f
For Next Time
• Optional Reading:
• J&M: https://web.stanford.edu/~jurafsky/slp3/
• Sequence labeling - Chapter 8 until section 8.4.3
• Quiz 2 due September 28

• Assignment 1 will be out September 28
29

Lecture6 421

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture6 421

Uploaded by

Copyright:

Available Formats

Natural Language Processing

Instructor: Vered Shw rtz

Convolutional Neural Networks More popular in vision.

4-gram neural model

A. ‘dessert’ is always positive

A. ‘dessert’ is always positive

These are simple solutions (more sophisticated solutions next week) 15

• Identi ies indicative local predictors in a large structure

• Combines them to produce a ixed size vector

l such ilters can be applied (l =3 here), resulting in an l

A “pooling” operation combine the vectors resulting from the different

The quick brown fox jumped over the lazy dog

The quick brown fox jumped over the lazy dog

Similar ideas adopted in Longformer - Transformers for long documents

Convolutional Neural Networks More popular in vision.

• Quiz 2 due September 28

You might also like