You are on page 1of 90

Road Map

Part – III
• Support Vector Machine
• Vector Space Model
• Latent Semantic Analysis
SVM (Concept)
SVM
SVM
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

0
b=
+
How would you

x
w
classify this data?

w x + b<0
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these would


be fine..

..but which is best?


a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

Misclassified
to +1 class
a
Classifier Margin
x f f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Define the
margin of a
linear
classifier as
the width that
the boundary
could be
increased by
before hitting
a datapoint.
a
Maximum Margin
x f yest
1. Maximizing the margin is good according
to intuition and PAC
f(x,w,b) theory x + b)
= sign(w
denotes +1 2. Implies that only support vectors are
denotes -1 important; other training examples are
ignorable.
3. Empirically it works very very well.
The maximum
Support Vectors margin linear
are those classifier is the
datapoints that the linear classifier
margin pushes up with the, maximum
against margin.
This is the simplest
kind of SVM
Linear SVM (Called an LSVM)
Linear SVM Mathematically
zonex+
+1” M=Margin Width
l ass =
ic tC
ed
“Pr X-
zo ne
+b=
1
= -1”
wx =0 ss
Cla
+b
wx t
+ b =-1
r e dic
wx “P

What we know:  
(x  x )  w 2
• w . x+ + b = +1 M  
• w . x- + b = -1 w w
• w . (x+-x-) = 2
Linear SVM Mathematically
 Goal: 1) Correctly classify all training data
wxi  b  1 if yi = +1
wxi  b  1 if yi = -1
yi ( wxi  b)  1for all i 2
M 
2) Maximize the Margin
1 t w
same as minimize ww
2
 We can formulate a Quadratic Optimization Problem and solve for w and b
1 t
 Minimize  ( w)  2 w w

subject to yi ( wxi  b)  1 i
Solving the Optimization Problem
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
 Need to optimize a quadratic function subject to linear constraints.
 Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather intricate)
algorithms exist for solving them.
 The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every constraint in the primary
problem:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The Optimization Problem
Solution
 The solution has the form:
w =Σαiyixi b= yk- wTxk for any xk such that αk 0
 Each non-zero αi indicates that corresponding xi is a support
vector.
 Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b
 Notice that it relies on an inner product between the test point
x and the support vectors xi – we will return to this later.
 Also keep in mind that solving the optimization problem
involved computing the inner products xiTxj between all pairs
of training points.
Dataset with noise

denotes +1  Hard Margin: So far we require


all data points be classified correctly
denotes -1
- No training error
 What if the training set is
noisy?
- Solution 1: use very powerful
kernels

OVERFITTING!
Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
What should our quadratic
optimization criterion be?
e11
e2 Minimize
1 R
w.w  C  εk
2 k 1

1
b=
wx
+
0 e7
+ b=
wx b=-1
+
wx
Hard Margin v.s. Soft Margin
 The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 The new formulation incorporating slack variables:

Find w and b such that


Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
 Parameter C can be viewed as a way to control overfitting.
Linear SVMs: Overview
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they define the
hyperplane.
 Quadratic optimization algorithms can identify which training points
xi are support vectors with non-zero Lagrangian multipliers αi.
 Both in the dual formulation of the problem and in the solution
training points appear only inside dot products:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b
Non-linear SVMs
 Datasets that are linearly separable with some noise work out
great:
0 x

 But what are we going to do if the dataset is just too hard?


0 x
 How about… mapping data to a higher-dimensional space:

x2

0 x
Non-linear SVMs: Feature spaces
 General idea: the original input space can always be
mapped to some higher-dimensional feature space where
the training set is separable:

Φ: x → φ(x)
The “Kernel Trick”
 The linear classifier relies on dot product between vectors
K(xi,xj)=xiTxj
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner
product in some expanded feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
What Functions are Kernels?
 For some functions K(xi,xj) checking that
K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
 Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
 Semi-positive definite symmetric functions correspond to a
semi-positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)


K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
K=
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
Examples of Kernel Functions
 Linear: K(xi,xj)= xi Txj

 Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

 Gaussian (radial-basis function network):


2
xi  x j
K (x i , x j )  exp( 2
)
2

 Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)


Non-linear SVMs Mathematically
 Dual problem formulation:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

 The solution is:

f(x) = ΣαiyiK(xi, xj)+ b

 Optimization techniques for finding αi’s remain the same!


Nonlinear SVM - Overview
 SVM locates a separating hyperplane in the feature
space and classify points in that space
 It does not need to represent the space explicitly,
simply by defining a kernel function
 The kernel function plays the role of the dot product
in the feature space.
Properties of SVM
• Flexibility in choosing a similarity function
• Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating hyperplane
• Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the feature space
• Overfitting can be controlled by soft margin approach
• Nice math property: a simple convex optimization problem which is
guaranteed to converge to a single global solution
• Feature Selection
SVM Applications
• SVM has been used successfully in many real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Application 1: Cancer Classification
• High Dimensional
Genes
- p>1000; n<100
Patients g-1 g-2 …… g-p
P-1
• Imbalanced p-2

- less positive samples …….


p-n
n
K [ x, x ]  k ( x, x )  
N
• Many irrelevant features FEATURE SELECTION

• Noisy In the linear case,


wi2 gives the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data 
Weakness of SVM
• It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically
decrease the performance

• It only considers two classes


- how to do multi-class classification with SVM?
- Answer:
1) with output arity m, learn m SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM and
find out which one puts the prediction the furthest into the positive region.
Road Map
Part – III
• Support Vector Machine
• Vector Space Model
• Latent Semantic Analysis
VECTOR SPACE MODEL
Bag of Words (BoW) Model
• Vector representation doesn’t consider the ordering of words
in a document

John is quicker than Mary and Mary is quicker than John


have the same vectors
Term frequency - tf
• tft,d of term t in document d is defined as the number of times that
t occurs in d.

• We want to use tf when computing match scores. But how?

• Raw term frequency is not what we want:


• A document with 10 occurrences of the term is more relevant than a
document with 1 occurrence of the term.
• But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.


Log-frequency weighting
• The log frequency weight of term t in d is

1  log10 tf t,d , if tf t,d  0


wt,d 
 0, otherwise

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

• Score for a pair: sum over terms t in both q and d:


score 
 tq d
(1  log tf )t ,d

• The score is 0 if none of the terms is present in the document.


Document frequency-df
• Rare terms are more informative than frequent terms
• Recall stop words

• Consider a term in the query that is rare in the collection (e.g., arachnocentric)

• A document containing such a term is more likely to be relevant than a


document that doesn’t

• But it’s not a sure indicator of relevance.

• For frequent terms, we want high positive weights for words like high, increase,
and line
But lower weights than for rare terms.

• We will use document frequency (df) to capture this.


Inverse Document Frequency-idf
• dft is the document frequency of t: the number of documents
that contain t
• dft is an inverse measure of the informativeness of t
• dft  N
• We define the idf (inverse document frequency) of t by

• We use log (N/dft) instead of N/dft to “dampen” the effect of idf.


idf t  log10 ( N/df t )

The base of the log is immaterial.


idf example, suppose N = 1 million
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

idf t  log10 ( N/df t )


There is one idf value for each term t in a collection.
tf-idf weighting
• The tf-idf weight of a term is the product of its tf weight and its idf
weight.

w t ,d  log(1  tf t ,d )  log10 ( N / df t )
• Best known weighting scheme in information retrieval

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection


Score for a Match

Score(q,d)   tf.idf t,d


t qd

• There are many variants


• How “tf” is computed (with/without logs)
• Whether the terms in the query are also
weighted
•…
40
Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35


Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-


valued vector of tf-idf weights ∈ R|V|
Documents as vectors, Proximity
• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents are points or vectors in this space

• Very high-dimensional: tens of millions of dimensions when you


apply this to a web search engine
• These are very sparse vectors - most entries are zero.

• proximity = similarity of vectors


• proximity ≈ inverse of distance
• First cut: distance between two points
( = distance between the end points of the two vectors)
• Euclidean distance? ……a bad idea….because Euclidean distance is
large for vectors of different lengths.
Why distance is a bad idea
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and the
distribution of
terms in the
document d2 are
very similar.
Angle instead of distance
• Thought experiment: take a document d and append it to
itself. Call this document d′.
• “Semantically” d and d′ have the same content
• Euclidean distance between the two documents can be quite
large
• Angle between the two documents is 0, corresponding to
maximal similarity.
• Cosine is a monotonically decreasing function for the interval
[0o, 180o]
Length normalization
• A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:

x2 i i
x 2

• Dividing a vector by its L2 norm makes it a unit (length) vector


(on surface of unit hyper-sphere)

• Effect on the two documents d and d′ (d appended to itself) from


earlier slide: they have identical vectors after length-normalization.
• Long and short documents now have comparable weights
Cosine Similarity (for length-normalized vectors)
Dot product Unit vectors

   

V
  qd q d qi d i
cos( q , d )         i 1

qd q d
 i1 i
V 2 2 V
q
i 1 i
d
    V
cos(q, d )  q  d   qi di For length-normalized
 i1 vectors, cosine similarity is
simply the dot product (or
scalar product)
qi is the tf-idf weight of term i in the q
di is the tf-idf weight of term i in the d

cos(q,d) is the cosine similarity of q and d … or,


equivalently, the cosine of the angle between q and d.
Cosine Similarity
Cosine similarity amongst 3 documents
How similar are
the novels term SaS PaP WH

SaS: Sense and affection 115 58 20

Sensibility jealous 10 7 11

PaP: Pride and gossip 2 0 6

Prejudice, and wuthering 0 0 38

WH: Wuthering Term frequencies (counts)


Heights?

Note: To simplify this example, we don’t do idf weighting.


3 documents example contd.
Log frequency weighting After length normalization

term SaS PaP WH term SaS PaP WH


affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Problems – High Dimensionality
• Term-document matrices are very large

• But the number of topics that people


talk about is small (in some sense)
• Clothes, movies, politics, …

• Can we represent the term-document


space by a lower dimensional latent
space?
Problems - Lexical Semantics
• Ambiguity and association in natural language
• Polysemy: Words often have a multitude of
meanings and different types of usage (more
severe in very heterogeneous collections).

• The vector space model is unable to discriminate


between different meanings of the same word.
Problems - Lexical Semantics
• Synonymy: Different terms may have
identical or similar meanings (weaker:
words indicating the same topic).

• No associations between words are made


in the vector space representation.
Road Map
Part – III
• Support Vector Machine
• Vector Space Model
• Latent Semantic Analysis
Latent Semantic Analysis (LSA)
LSA
• Perform a low-rank approximation of document-term matrix (typical rank
100-300)

• General idea
• Map documents (and terms) to a low-dimensional representation.
• Design a mapping such that the low-dimensional space reflects semantic
associations (latent semantic space).
• Compute document similarity based on the inner product in this latent
semantic space

• Similar terms map to similar location in low dimensional space

• Noise reduction by dimension reduction

Latent Semantic Indexing was developed at Bellcore (now Telcordia) in the late 1980s (1988). It was patented in 1989.
http://lsi.argreenhouse.com/lsi/LSI.html
LSA
• But first:
• What is the difference between LSI and LSA???
• LSI refers to using it for indexing or information retrieval.
• LSA refers to everything else.

LSA Idea (Deerwester et al):

“We would like a representation in which a set of terms, which by itself is


incomplete and unreliable evidence of the relevance of a given document, is
replaced by some other set of entities which are more reliable indicants. We take
advantage of the implicit higher-order (or latent) structure in the association of
terms and documents to reveal such relationships.”
LSA
• Four basic steps
• Rank-reduced Singular Value Decomposition (SVD) performed on matrix
• all but the k highest singular values are set to 0
• produces k-dimensional approximation of the original matrix (in least-squares
sense)
• this is the “semantic space”
• Compute similarities between entities in semantic space (usually with cosine)
LSA
• SVD
• unique mathematical decomposition of a matrix into the product of
three matrices:
• two with orthonormal columns
• one with singular values on the diagonal
• tool for dimension reduction
• similarity measure based on co-occurrence
• finds optimal projection into low-dimensional space
• can be viewed as a method for rotating the axes in n-dimensional
space, so that the first axis runs along the direction of the largest
variation among the documents
• the second dimension runs along the direction with the second largest
variation
• and so on
• generalized least-squares method
What it is
• From term-doc matrix A, we compute the
approximation Ak.
• There is a row for each term and a column
for each doc in Ak
• Thus docs live in a space of k<<r
dimensions
• These dimensions are not the original
axes
• But why?
Performing the maps
• Each row and column of A gets mapped into the k-dimensional
LSI space, by the SVD.
• Claim – this is not only the mapping with the best (Frobenius
error) approximation to A, but in fact improves retrieval.
• A query q is also mapped into this space, by

• Query NOT a sparse vector.


T 
1
q
kqU kk
Performing the maps
• ATA is the dot product of pairs of documents
ATA ≈ AkTAk = (UkSkVkT)T (UkSkVkT)

= VkSkUkT UkSkVkT

= (VkSk) (VkSk) T

• Since Vk = AkTUkSk-1
we should transform query q to qk as follows

qk  qT U k 1
k
Singular Value Decomposition
For an M  N matrix A of rank r there exists a factorization
(Singular Value Decomposition = SVD) as follows:
T
A
U
V
MM MN V is NN

The columns of U are orthogonal eigenvectors of AAT.


The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 … r of AAT are the eigenvalues of ATA.
i 
i


  1
diag 
...
r  Singular values.
Singular Value Decomposition
• Illustration of SVD dimensions and sparseness
SVD example
1 1
Let

A  0 1 
 1 0 
Thus M=3, N=2. Its SVD is

0 2/61/3
10
 
 
1
/21/2
1
/21
/61/303
  
  1
/21
/2
1/21/61
/300
 
 
Typically, the singular values arranged in decreasing order.
Low-rank Approximation
• SVD can be used to compute optimal low-rank approximations.
• Approximation problem: Find Ak of rank k such that

A
kmin
A
XF
X
:
Frobenius norm
rank
(
X)
k

Ak and X are both mn matrices.


Typically, want k << r.
Low-rank Approximation
• Solution via SVD

A

kUdiag
( T
,...,
,
0
1k,...,
0
)
V 

set smallest r-k
singular values to zero

k 
Ai
k
uv T
1i i i

column notation: sum
of rank 1 matrices
Reduced SVD
• If we retain only k singular values, and set the rest to 0, then we
don’t need the matrix parts in red
• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N
• This is referred to as the reduced SVD
• It is the convenient (space-saving) and usual form for computational
applications

k
Computing Similarity in LSI
Approximation error
• How good (bad) is this approximation?
• It’s the best possible, measured by the Frobenius norm of the
error:

X
:
min
A
X
rank
(
X)


k
A

A
F
 k
Fk
1

where the i are ordered such that i  i+1.


• Suggests why Frobenius error drops as k increases.
SVD Low-rank approximation
• Whereas the term-doc matrix A may have M=50000, N=10
million (and rank close to 50000)
• We can construct an approximation A100 with rank 100.
• Of all rank 100 matrices, it would have the lowest Frobenius
error.

• Great … but why would we??


• Answer: Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a matrix by another of lower rank.


Psychometrika, 1, 211-218, 1936.
Example
• d1 : Romeo and Juliet.
• d2 : Juliet: O happy dagger!
• d3 : Romeo died by dagger.
• d4 : “Live free or die”, that’s the New-Hampshire’s motto.
• d5 : Did you know, New-Hampshire is in New-England.

• and a search query: dies, dagger.


Example
Example
• B = ATA is the document-document matrix.
If documents i and j have b words in common then B[i, j] = b.

• C = AAT is the term-term matrix.


If terms i and j occur together in c documents then C[i, j] = c.

Clearly, both B and C are square and symmetric;


• B is an m × m matrix, whereas C is an n × n matrix.
T
A  S U

MM MN V is NN

• S is the matrix of the eigenvectors of B


• U is the matrix of the eigenvectors of C
•  is the diagonal matrix of the singular values obtained as
square roots of the eigenvalues of B
• the singular values along the diagonal are listed in descending
order of their magnitude.

• For small SVD calculations, you can use the BlueBit calculator at
http://www.bluebit.gr/matrix-calculator
• Some of the singular values are “too small” and thus
“negligible.” ………….“too small” is usually determined
empirically.

• In LSI we ignore these small singular values and replace them


by 0.

• we only keep k singular values. Then will be all zeros except


the first k entries along its diagonal.
• we can reduce matrix  into k which is an k ×k matrix
containing only the k singular values
• reduce S and UT , into Sk and and UTk , to have k columns and
rows, respectively.
• Of course, all these matrix parts that we throw out would have
been zeroed anyway by the zeros in .
• Ak is again an m × n matrix.
• Intuitively, the k remaining ingredients of the eigenvectors in S
and U correspond to k “hidden concepts” where the terms
and documents participate. The terms and documents have
now a new representation in terms of these hidden concepts.
Namely, the terms are represented by the row vectors of the
m × k matrix
Example (Cont.)

Terms representation

Documents representation
L18LSI Prasad
L18LSI Prasad
L18LSI Prasad
Empirical evidence
• Precision at or above median TREC precision
• Top scorer on almost 20% of TREC topics
• Slightly better on average than straight vector
spaces
• Effect of dimensionality:

Dimensions Precision
250 0.367
300 0.371
346 0.374
But why is this clustering?
• We’ve talked about docs, queries,
retrieval and precision here.
• What does this have to do with
clustering?
• Intuition: Dimension reduction through
LSI brings together “related” axes in the
vector space.
Intuition from block matrices

N documents

Block 1 What’s the rank of this matrix?

0’s
Block 2
M
terms …

0’s
Block k

= Homogeneous non-zero blocks.


Intuition from block matrices

N documents

Block 1

0’s
Block 2
M
terms …

0’s
Block k

Vocabulary partitioned into k topics


(clusters); each doc discusses only one topic.
Intuition from block matrices
Likely there’s a good rank-k
approximation to this matrix.

wiper
tire Block 1
V6

Few nonzero entries


Block 2

Few nonzero entries


Block k
car 10
automobile 0 1
Simplistic picture
Topic 1

Topic 2

Topic 3
Some wild extrapolation
• The “dimensionality” of a corpus is the
number of distinct topics represented in
it.
• More mathematical wild extrapolation:
• if A has a rank k approximation of low
Frobenius error, then there are no more
than k distinct topics in the corpus.
LSI has many other
applications
• In many settings in pattern recognition and retrieval, we have
a feature-object matrix.
• For text, the terms are features and the docs are objects.
• Could be opinions and users …
• This matrix may be redundant in dimensionality.
• Can work with low-rank approximation.
• If entries are missing (e.g., users’ opinions), can recover if
dimensionality is low.
• Powerful general analytical technique
• Close, principled analog to clustering methods.

You might also like