You are on page 1of 47

Why vectorising words?

Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Distributional vs. Distributed


Semantics
CITS4012 Natural Language Processing

A/Prof. Wei Liu

wei.liu@uwa.edu.au
Computer Science and Software Engineering
The University of Western Australia

August 11, 2021

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 1 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

What we are going to cover today


1 Why vectorising words?
2 Traditional Sparse Word Vectorisation
[see Embeddings in NLP, Chapter 1&3]
ASCII Character-Based Representation
One-Hot Word Encoding
3 Vector Space Models for Dense Representation
Intuitions behind Vector Space Models
Distributional vs. Distributed Semantics
Overview of Vector Space Modelling Approaches
4 Count-based Methods
Term Document Matrix
Word context Matrix
Pair-Pattern Matrix
5 Take-Aways
[see Embeddings in NLP, Chapter 3]
[see Deep Learning with Pytorch, Page 1036-1042]
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 2 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Why vectorising words?

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 3 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

What is an Embeddings Anyway


Word Embedding
An embedding is a continuous representation of an entity (a word, in
this case), and each one of its dimensions can be seen as an attribute
or feature.
Let’s forget about words for a moment and talk about restaurants
instead.

Discrete Representation of Restaurant Reviews

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 4 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

What is an Embeddings Anyway


Word Embedding
An embedding is a continuous representation of an entity (a word, in
this case), and each one of its dimensions can be seen as an attribute
or feature.

Although it’s fairly obvious to spot the similarities and differ-


ences among the restaurants in the table above,
it wouldn’t be so easy if there were dozens of dimensions
to compare; and
it is not objective.

Continuous Representation of Restaurant Reviews


A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 4 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Similarity Results

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 5 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Calculating Similarities and Distances


1 import torch
2 from torch . nn import functional as F
3
4 ratings = torch . as_tensor ( df . values ) . float ()
5
6
7 manhattan_dist = torch . cdist ( ratings , ratings , p =1)
8
9 euclidean_dist = torch . cdist ( ratings , ratings , p =2)
10
11 dim = ratings . shape # torch . Size ([4 , 3])
12 nrows = dim [0] # 4
13
14 cos_sims = torch . zeros ( nrows , nrows )
15 for i in range ( nrows ) :
16 for j in range ( nrows ) :
17 cos_sims [i , j ] = F . cos ine _si mila rit y ( ratings [ i ] ,
18 ratings [ j ] ,
19 dim =0)
20 cos_sims

code/dist.py

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 6 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Plotting using matplotlib

1 import matplotlib . pyplot as plt


2 import numpy as np
3
4 labels = ( ’ Restaurant 1 ’ , ’ Restaurant 2 ’ ,
5 ’ Restaurant 3 ’ , ’ Restaurant 4 ’)
6 plt . figure ()
7 plt . axes ([0 , 0 , 1 , 1])
8 plt . imshow ( cos_sims , interpolation = ’ nearest ’ ,
9 cmap = plt . cm . gnuplot2 , vmin = -1)
10 plt . xticks ( range ( nrows ) , labels , rotation =45)
11 plt . yticks ( range ( nrows ) , labels )
12 plt . colorbar ()
13 plt . tight_layout ()
14 plt . show ()

code/plot_sim.py

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 7 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Embeddings → Similarities

We can compute the cosine similarity between any two


restaurants using pair-wise distance metrics.
Imagine we can represent each word as a vector, i.e. an
embedding
The values in the table above are not real embeddings. However,
it was only an example that illustrates well the concept of
embedding dimensions as attributes.
The dimensions of the word embeddings learned by the language
models such as Word2Vec do not have clear-cut meanings like
the restaurant example.
On the bright side, though, it is possible to do arithmetic with
word embeddings!

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 8 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

KING - MAN + WOMAN = ?

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 9 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Traditional Sparse Word Vectorisation

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 10 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

ASCII Representation of Words

Computers only understand zeros and ones!

To represent the word “desk” or “table", we need to store each charac-


ter as a pattern of bits.
ASCII encoding for:

desk: 01100100 01100101 01110011 01101011


table: 01110100 01100001 01100010 01101100 01100101
Shortcomings:
1 The ASCII encoding does not represent meaning. Semantically
similar words are encoded very differently.
2 The representation is character-wise. The size depends on the
length of words. Variable sized representation cannot be readily
used for machine learning.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 11 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding

One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding

One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.


Addressed the problem of variable size.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding

One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.


Addressed the problem of variable size.
The vectors are “orthognoal”, meaning that the vectors do not
represent meaning.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding

One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.


Addressed the problem of variable size.
The vectors are “orthognoal”, meaning that the vectors do not
represent meaning.
The representation is large and sparse. A typical English
Vocabulary is about 100,000 words.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.


Addressed the problem of variable size.
The vectors are “orthognoal”, meaning that the vectors do not
represent meaning.
The representation is large and sparse. A typical English
Vocabulary is about 100,000 words.

A toy vocabulary of 5 words

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.


Addressed the problem of variable size.
The vectors are “orthognoal”, meaning that the vectors do not
represent meaning.
The representation is large and sparse. A typical English
Vocabulary is about 100,000 words.

A slightly bigger vocabulary of 3704 words

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.

Useful if we only have a vocabulary of words.


Addressed the problem of variable size.
The vectors are “orthognoal”, meaning that the vectors do not
represent meaning.
The representation is large and sparse. A typical English
Vocabulary is about 100,000 words.

A slightly bigger vocabulary of 3704 words with special tokens


A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Vector Space Models for Dense


Representation

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 13 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Vector Space Models


Vector Space Model (VSM), first proposed by Salton et al. [1975],
provides a solution to the limitations of one-hot representation.
Vector Space Model
In VSM, objects are represented as vectors in an imaginary
multi-dimensional continuous space. In NLP, the space is usually
referred to as the semantic space and the representation of the
objects are called distributed representation.

Objects can be words, documents, sentences, concepts or entities, or


any other semantic carrying item between two of which we can define
the notion of similarity.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 14 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

One-Hot vs. Vector Space


One-hot representation is a specific type of a distributed representa-
tion
each word is represented as a vector along with one of the axes
in the semantic space
the semantic space needs to have n dimensions where n is the
number of words in the vocabulary
Moving from the local and discrete nature of one-hot representation
to distributed and continuous vector spaces brings about multiple
advantages:
Distributed representation introduces the notion of similarity: the
similarity of two words (vectors) can be measured by their
distance in the space.
Many more words can fit into a low dimensional space; hence, it
can potentially address the size issue of one-hot encoding: a
large vocabulary of size m can fit in an n-dimensional vector
space, where nm.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 15 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

How to construct word Vector Space Model


The distributional hypothesis [Harris, 1954, Firth, 1957] state that:

Words that occur in the same contexts tend to have similar


meanings.

You shall know a word by the company it keeps.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 16 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

How to construct word Vector Space Model


The distributional hypothesis [Harris, 1954, Firth, 1957] state that:

Words that occur in the same contexts tend to have similar


meanings.

You shall know a word by the company it keeps.

Here is a word that you may not know: tezgüino

D1) A bottle of is on the table.


D2) Everybody likes
D3) Don’t have before you drive.
D4) We make out of corn.

Distribution hypothesis is the foundation of automatically construct-


ing word VSM to obtain distributed representation.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 16 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Distributional semantics vs. Distributed Semantics


Distributional semantics are computed from context statistics.
Distributed semantics are a related but distinct idea: that meaning
can be represented by numerical vectors rather than symbolic
structures.
Distributed representations are often estimated from distributional
statistics, as in latent semantic analysis and Word2Vec.

D1 D2 D3 D4 ...
tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0
... ... ... ... ...
Table: Term-Document Matrix

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 17 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Overview of VSM Approaches- BoW

The interpretation of the distributional hypothesis and the way


of collecting “similarity” clues and constructing the space have
gone under enormous changes.

Earlier approaches - count-based techniques that collect word statis-


tics, in terms of occurrence and co-occurrence frequency.
These representations are often large and needed some sort of
dimensionality reduction
Count-based approaches are commonly known as
“Bag-of-Words" (BoW) models because any information about
the order or structure of words in the document is discarded.
A BoW model is only concerned with whether known words occur
in the document, not where in the document.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 18 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Overview of VSM Approaches - Neural Methods

The deep learning tsunami hits the shores of NLP around 2011.
Word Embeddings: Word2Vec was one of the massive waves
from this tsunami and once again accelerated the research in
semantic representation.
Despite not being “deep”, the model was a very efficient way of
constructing compact vector representations, by leveraging (shallow)
neural networks.
Since then, the term “embedding” almost replaced “representation”
and dominated the field of lexical semantics.
But it is static in nature and capture the most popular meaning of
words, e.g. mouse only has one embedding.
Contextualised representation - allowing the embedding to adapt
itself to the context.
BERT embedding and its variants are dominating now.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 19 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Count-based Methods

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 20 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Count-based Methods

The general idea in count-based models is to construct a matrix


based on word frequencies.

Count-based models can be categorised into three general classes


based on the matrices used.
star Term Document matrix;
Word Context matrix;
Pair Pattern matrix.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 21 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Term Document Matrix


In this matrix, rows correspond to words and columns to documents.
Each cell denotes the frequency of a specific word in a given docu-
ment.
Two documents with similar patterns of numbers (similar columns)
are deemed to be having similar topics.
The term-document model is document centric; usually used for
document retrieval, classification, or similar document-based
purposes.
The value of the matrix can be
binary: 0 for not present, 1 for present
raw count: the number of times the term occurred in each
document
frequency: the raw count of the word out of the total number of
words in that document

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 22 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

BoW Example
Step 1: Collect Data
It was the best of times.
it was the worst of times.
it was the age of wisdom.
it was the age of foolishness.
Step 2: Design the vocabulary, i.e. preprocess
Step 3: Create Term-Document Matrix
D1 D2 D2 D4
it
was
the
of
best
worst
age
times
wisdom
times
Table: Term Document Raw Count Matrix
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 23 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

BoW Example
Step 1: Collect Data
It was the best of times.
it was the worst of times.
it was the age of wisdom.
it was the age of foolishness.
Step 2: Design the vocabulary, i.e. preprocess
Step 3: Create Term-Document Matrix
D1 D2 D2 D4
it 1 1 1 1
was 1 1 1 1
the 1 1 1 1
of 1 1 1 1
best 1 0 0 0
worst 0 1 0 0
age 0 0 1 1
times 1 1 0 0
wisdom 0 0 1 0
foolishness 0 0 0 1
Table: Term Document Frequency Matrix
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 23 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Scoring a document’s relevance


Raw word frequency (a.k.a. Term Frequency) suffers from a
critical problem in the information retrieval context: all terms
are considered equally important when it comes to assessing
relevancy on a query.

For example, a collection of documents in “Computer Science” will


have the word “computing” in almost every document.

Certain terms have little or no discriminating power in determin-


ing relevance.

To discriminate between documents, it is better to use a document-


level statistic (such as the number of documents containing a term)
than to use a collection-wide statistic for the term (CF).

Document Frequency
Document Frequency (dft ) is defined to be the number of documents
in the collection that contain a term t
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 24 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

IDF - Inverse Document Frequency We want words that occur only


frequently in a document but not across the entire document collection
to receive a boost in score. In other words, higher document frequency
plays against the ranking of a word.

Inverse Document Frequency


Denoting as usual the total number of documents in a collection by N,
we define INVERSE DOCUMENT the inverse document frequency
(idf) of a term t as follows:

N
idft = log2
dft

Word CF DF IDF tfD1 tf -idf D1


it 4 4 0 1 0
wisdom 1 1 2 1 2
Table: Example of IDF values. CF is frequency of the word in the entire collection.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 25 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

TFIDF

The tf -idf weighting scheme assigns to term t a weight in document d


given by
tf -idf t,d = tft,d × idft
In other words, tf -idf t,d assigns to term t a weight in document d that
is
highest when t occurs many times within a small number of
documents (thus lending high discriminating power to those
documents);
lower when the term occurs fewer times in a document, or occurs
in many documents (thus offering a less pronounced relevance
signal);
lowest when the term occurs in virtually all documents.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 26 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Word-Context Matrix

Word-context. Unlike the term-document matrix which focuses on


document representation, word-context matrix aims at representing
words.
Context spanning from neighboring words to windows of words,
grammatical dependencies, selectional preferences, or whole
documents.
Enabling different tasks, such as word similarity measurement,
word sense disambiguation, semantic role labeling, and query
expansion.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 27 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Word Co-occurrence Matrix

it was the of best worst age times wis- fool-


dom ish-
ness
it 0 4 4 4 1 1 1 2 1 1
was 4 0 4 4 1 1 1 2 1 1
the 4 4 0 4 1 1 1 2 1 1
of 4 4 4 0 1 1 1 2 1 1
best 1 1 1 1 0 0 0 1 0 0
worst 1 1 1 1 0 0 0 1 0 0
age 1 1 1 1 0 0 0 0 1 1
times 2 2 2 2 1 1 0 0 0 0
wisdom 1 1 1 1 0 0 1 0 0 0
foolishness 1 1 1 1 0 0 1 0 0 0
Table: Term Document Frequency Matrix

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 28 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Code Example -
constructing word co-occurrence matrix

1 from collections import defaultdict


2 import numpy as np
3 import pandas as pd
4
5 def co_occurrence ( sentences , window_size ) :
6 d = defaultdict ( int )
7 vocab = set ()
8 for text in sentences :
9 # preprocessing ( use tokenizer instead )
10 text = text . lower () . split ()
11 # iterate over sentences
12 for i in range ( len ( text ) ) :
13 token = text [ i ]
14 vocab . add ( token ) # add to vocab
15 next_token = text [ i +1 : i +1+ window_size ]
16 for t in next_token :
17 key = tuple ( sorted ([ t , token ]) )
18 d [ key ] += 1
19
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 28 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

20 # formulate the dictionary into dataframe


21 vocab = sorted ( vocab ) # sort vocab
22 df = pd . DataFrame ( data = np . zeros (( len ( vocab ) , len ( vocab ) )
, dtype = np . int16 ) ,
23 index = vocab ,
24 columns = vocab )
25 for key , value in d . items () :
26 df . at [ key [0] , key [1]] = value
27 df . at [ key [1] , key [0]] = value
28 return df
29
30 text = [ " It was the best of times " ,
31 " it was the worst of times " ,
32 " it was the age of wisdom " ,
33 " it was the age of foolishness " ]
34 df = co_occurrence ( text , 20)

code/word–coocur.py

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 29 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Similarity from Word Co-occurrence Matrix

cos-similarity between words

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 29 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Point-wise Mutual Information


Raw frequencies do not provide a reliable measure of association.
A “stop word” such as “the” can frequently co-occur with a given
word, but this co-occurrence does not necessarily reflect a
semantic relationship since it is not discriminative.
It is more desirable to have a measure that can incorporate the
informativeness of a co-occurring words.

Pointwise Mutual Information (PMI) [Church and Hanks, 1990]


PMI normalizes the probability of the co-occurrence of two words by
their individual occurrence probabilities.
P(w1 , w2 )
PMI(w1 , w2 ) = log2 .
P(w1 )P(w2 )
PMI is calculated from probabilities, where
P(x) is an estimate of the probability of word x, which can be
directly computed based on its frequency in a given corpus,
P(w1 , w2 ) is the estimated probability that w1 and w2
co-occur in a corpus.
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 30 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Positive Pointwise Mutual Information (PPMI)

PMI checks if w1 and w2 co-occur more than they occur independently.

A stop word has a high P value, resulting in a reduced overall PMI


value.
PMI values can range from − inf to + inf.
Negative values indicate a co-occurrence which is less likely to
happen than by chance.
Given that these associations are computed based on highly sparse
data and that they are not easily interpretable
it is hard to define what it means for two words to be very
unrelated,
we usually ignore negative values and replace them with 0, hence
Positive PMI (PPMI).

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 31 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Pair-Pattern Matrix

Pair-pattern. Rows correspond to pairs of words and columns are the


patterns in which the two have occurred.
The matrix is suitable for measuring relational similarity: the
similarity of semantic relations between pairs of words, e.g.,
linux:grep and windows:findstr.
Extended distributional hypothesis: patterns that co-occur with
similar pairs (contexts) tend to have similar meanings.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 32 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Limitations of BoW
The bag-of-words model is very simple to understand and implement
and offers a lot of flexibility for customization on your specific text
data. It has been used with great success on prediction problems like
language modeling and documentation classification.
Nevertheless, it suffers from some shortcomings, such as:
Vocabulary: The vocabulary requires careful design, most
specifically in order to manage the size, which impacts the
sparsity of the document representations.
Sparsity: Sparse representations are harder to model both for
computational reasons (space and time complexity) and also for
information reasons, where the challenge is for the models to
harness so little information in such a large representational
space.
Meaning: Discarding word order ignores the context, and in turn
meaning of words in the document (semantics). Context and
meaning can offer a lot to the model, that if modeled could tell the
difference between the same words differently arranged (‘this is
interesting’ vs ‘is this interesting’), synonyms (‘old bike’ vs ‘used
bike’), and much more.
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 33 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Take-Aways

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 34 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

Take-Aways

one-hot encoding
vectorisation
count-based methods (tf -idf , PMI, PPMI)

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 35 / 36


Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer

References

[1] Daniel Voigt Godoy. Deep Learning with PyTorch Step-by-Step:


A Beginner’s Guide. Published at: http://leanpub.com/pytorch,
http://leanpub.com/pytorch, 2021.
[2] Mohammad Taher Pilehvar and José Camacho-Collados. Em-
beddings in Natural Language Processing: Theory and Ad-
vances in Vector Representations of Meaning. Synthesis Lec-
tures on Human Language Technologies. Morgan & Claypool
Publishers, 2020. doi: 10.2200/S01057ED1V01Y202009HLT047.
URL https://doi.org/10.2200/S01057ED1V01Y202009HLT047.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 36 / 36

You might also like