1 SS 2011 - Graph-Based Methods For NLP - UKP Lab - Wolfgang Stille

SEMINAR: GRAPH-BASED METHODS FOR NLP
Organisatorisches:
•  Seminar findet komplett im Mai statt

•  Seminarausarbeitungen bis 15. Juli (?)
•  Hilfen Seminarvortrag / Ausarbeitung auf der Webseite
•  Tucan number for registra1on: 20-‐00-‐0596-‐se
SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 1

Fahrplan

Mo#va#on for graph representa#on
§  Graphs are an intui1ve and natural way to encode en##es (e.g. language units)
as nodes and their rela#ons (e.g. similari1es) as edges (directed / undirected)
§  feature-‐based representa1on can be transformed into a graph via a similarity measure
§  graphs may not necessarily be transformed back into a feature representa1on (at least
not a unique one). Think of e.g. points in n-‐dimensional space.

Graph isomorphism

Graph representa#ons
Adjacency
Matrix
ì
î
Adjacency
List
Additional information such as

weights might be saved easily.

Mo#va#on for graph representa#on
There exist efficient algorithms that directly operate on graphs

Efficient Algorithms?
P = NP

?


Efficient Algorithms!
There are efficient (polynomial) algorithms for the exact solu1on of many problems on
graphs, e.g.
•  Graph Traversal (DFS, Shortest Paths, Max-‐Capacity Paths, …)
•  Op1mal Trees and Branchings (MST, MAX-‐FOREST, MAX-‐BRANCHING, …)
•  Graph Clustering (Min-‐Cut, Markow Clustering, Chinese Whispers, …)
•  Graph Ranking (PageRank, Random Walks, Markow Chain Theory)
•  Graph Distances (local: Paths, global: Graph Edit Distance, …)
•  Flows on Graphs (MAX-‐FLOW, MIN-‐COST FLOW, …)
•  Matching and Assignment (Hungarian Method, Edmond’s Algorithm)
•  many more

Efficient Algorithms!
There are efficient approxima#on algorithms and heuris#cs for the approximate solu1on
of many graphs problems, e.g.
•  Subgraph Problems (Dense Subgraphs, Minors, …)
•  Op1mal Tour Problems (TSP, PCTSP, VRP, …)
•  Steiner Trees
•  many more
There are simple heuris#cs that oên yield

quite good results, such as for example k-‐OPT
for the Euclidean TSP.


Why efficiency is crucial
Graphs are usually large-‐scale

•  In 2008, English Wikipedia used to have 2.301.486 ar1cles* with 55.550.003 links in
between
Graphs are usually dense and strongly connected
•  The largest "strongly-‐connected-‐component" of Wikipedia has 2.111.480 ar1cles.
Remember from the last lecture
•  Graphs in NLP are usually scale-‐free and have the small world property (high clustering
coefficient)
à  Problem solu1ons oên consider only small subgraphs (local neighborhoods), but an a
priori par11oning is usually not possible (this yields small 1me complexity but full space
complexity)
* by today there are almost 4 million ar1cles

PageRank
§  First-‐genera1on Google global ranking algorithm (1998)
§  Measure the (query-‐independent) importance of Web page based solely on the
link structure.
§  Assign each node a numerical score between 0 and 1, its PageRank.
§  Rank Web pages based on PageRank values.

General Idea:
every page has a number of in-‐links (back links) and out-‐links (forward links)
§ 
§  pages with more in-‐links are more important
§  in-‐links from important pages are more important


PageRank

Defini#on of PageRank
u: a web page, R(u) its page rank
Fu: set of pages u points to (forward links) R(v) (1$ d)
R(u) = " |F | # d +
Bu: set of pages that point to u (backw. Links) v !Bu v
N
|Fu|: the number of links from u
N: total number of pages
page
d: damping factor, default d=0.85
B
§  The equa1on is recursive, but it may be computed page page
by star1ng with any set of ranks and itera1ng the A C
computa1on un1l it converges.
§  Rank sink problem: cycle of pages that page
accumulates rank within the cycle, but never D
page
distributes rank outside X
§  Need damping: uniform rank distribu1on for all
pages
Random Surfer Model
§  When normalizing PageRank over all pages to 1, R(u) can be thought of as the
probability that a random surfer looks at a page u.
§  Damping corresponds to “teleporta1on”: With some probability d, the random
surfer is teleported to some other page
page
B
page
X page page
A C
page
D

Computation of PageRank
§  Numeric: Simulate a lot of random surfers: The Power method of

Eigenvector computation
§  initialize all pages with the same rank
§  repeat until convergence:
§  for all pages u: compute Rt+1(u) on the basis of Rt(v)
§  t:=t+1
input : matrix size N , error tolerance ϵ
output: eigenvector p
p0 = 1/N 1
t=0;
repeat until δ < ϵ:
t=t+1;
pt = MTpt−1 ;
δ = ||pt − pt−1 ||;
return pt ;
LexRank:
Applica#on to Mul#-‐Document Summariza#on
Mul2-‐document summariza2on task:
1.  iden1fy important topics of the documents to be summarized
2.  iden1fy sentences belonging to a certain topic
3.  from these sentences belonging to the same topic, select the ones that best
describe the topic
4.  concatenate sentences from different topics and make sure they fit together
Consider sub-‐problem 3:

Input: Sentences that talk about more or less the same thing

Output: Scores for those sentences that reflect how well a single sentence
represents that topic

Solu#on idea: use measures on sentence similarity graph

From Sentences to TF*IDF vectors
TF: count w1..wn TF*IDF
w w w w w w w w
Sentence 1 2 3 n 1 2 3 n
.27
3 0 2 0 0 0 0
This is a sentence that
talks about some topic.
.24 .21 feature vector

5 0 3 1 0 0
And here is another
sentence that talks abot
of the second
something slightly sentence
different.
… …
This is the same as
7 4 0 0 0 .62
0 0 the vector space
model for
And here is yet another
one of these notorious
sentences
Informa1on
Retrieval
! total number of sentences $
IDF(w) = log#
DF 3 1 2 … 1 " DF(w)
&
%
From TF*IDF vectors to sentence similarity graph
§  Sentence similarity graph:

§  nodes: sentences
§  edges: cosine similarity between sentence feature vectors
§  Can apply threshold on similarity or use similarity as edge weight

Measures: Centroid, Degree and Centrality
§  Centroid
§  Idea: select an average sentence. Compute average point of sentence vectors
(centroid)
§  select sentence that is most similar to the centroid for summariza1on
§  Degree Centrality
§  Idea: sentences that cover most of the content have a high node degree (number of
edges): since word overlap is responsible for edges, node degree measures word
overlap with the overall set of sentences
§  for summariza1on, choose the sentence with the highest degree
§  LexRank Centrality
§  Idea: it does not suffice to be similar to many sentences: similarity to important
sentences counts more.
§  normalize the adjacency sentence similarity to make it a stochas1c matrix
§  run PageRank to obtain scores that are used for ranking the sentences
§  for summariza1on, choose sentence with highest score
Evalua#on of graph-‐based mul#-‐document
summariza#on
§  Scores: ROUGE metric: similar to BLEU, between manual summaries and system summaries
§  random baseline: select any sentence from set by chance
§  lead-‐based: select based on posi1on of sentence within document
è LexRank is a simple method for genng high scores. It uses the whole structure of the
graph, as opposed to Centroid or Degree.
This technique also works well for single-‐document summariza1on.

TextRank for Keyword Extrac#on
§  Keyword extrac#on: find the most salient keywords for a document
§  Keyword extrac#on with PageRank:
§  preprocess document: iden1fy adjec1ves and nouns as targets
§  target co-‐occurrence graph: targets co-‐occurring within a window of 2-‐10 words
§  apply PageRank to get ranking scores on nodes
§  select highest scoring keywords, possibly concatenate ADJ-‐NOUN-‐NOUN sequences if
present in the text

Keyword Extrac#on Evalua#on
§  Comparison: Supervised system that is trained on manually assigned keywords,
using frequency and contextual features
§  Note that TextRank is unsupervised: no training necessary

Graph Clustering
§  Task: Find meaningful groups of nodes in graph by cunng edges
§  Intui1on: Connectedness within a cluster is higher than between clusters
§  Many graph clustering algorithms
find the number of clusters
automa1cally
3 3 3
http://elisa.dyndns-web.com/~elisa/publications/

Clustering by Min-‐Cut / Max-‐Flow
§  MinCut algorithm: hierarchical top-‐down clustering
§  compute the minimum cut: leaving out a set of edges, which results in disconnec1ng a
set of nodes from another, with the smallest edge weight sum
§  recursively apply to the components that got disconnected
§  Finding the minimum cut is equivalent to finding the maximum flow in a
network
§  Advantage: Efficient. Fastest known algorithm of per-‐cut complexity
O(|E|+log3(|V|)
§  Disadvantage:
§  Unbalanced cuts
§  when to stop?
http://scienceblogs.com/goodmath/2007/08/maximum_flow_and_minimum_cut_1.php

Markov Chain Clustering
http://micans.org/mcl/
§  Clustering based on random walks: MCL is the parallel simula1on of all possible
random walks up to a finite length on a graph G
§  Idea: a random walker on the graph is more likely to stay within the same cluster
than to end up in a different cluster a[er a small number of steps
§  Algorithm: can show convergence to a limit T
Add loops: transition matrix T= column-normalize (AG + I)
MCL process: alternate between
T=Tt // expansion: raise T to its power of t
T=inflate(T) // inflation: increase contrast within
columns by raising values to their power
of s (s>0) and normalize column-wise
Interpret T as a clustering: use strongest connection as label
Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.

Expansion step: simulate the random walk
§  (stochas1c) adjacency matrix T: probabili1es to walk from node in column to
node in row in a single step.
§  T2: probabili1es to walk from A to B in 2 steps.
AG
loops
added
T T2

Infla#on Step: only keep a]ractors
x2
x2 norm
alize
x2
§  Inflate the differences within a column by taking the k-‐th power of the value, then normalize
to ensure stochas1c property. k regulates the cluster sizes
§  Clustering: Highest entry in column vector is cluster label variants:
§  Could add small random noise to break 1es
§  Op1miza1on: Only keep K largest values, only keep values over threshold
Chinese Whispers Graph Clustering
§  MCL: keep only a few strong neighbors

§  Chinese Whispers: only propagate strongest label in neighborhood
initialize: 
"forall vi in V: class(vi)=i;"
§  Nodes have a class and
while changes:"
communicate it to their
forall v in V, randomized order:" adjacent nodes
"class(v)=highest ranked   §  A node adopts one of the
class in neighborhood of v;" the majority class in its
D

neighbourhood
B

L4
8
5 L2
§  Nodes are processed in

deg=2 A
deg=1 random order for some

L1
itera1ons

3 E

6
C
deg=4 L3
§  Node weigh1ng schemes
L3

deg=3
deg=5
Disambigua#on using Resource Graphs

Disambigua#on of Named En##es
using Resource Graphs
Wikipedia Link Graph

(Shortest) paths are one possibility

Disambigua#on of Named En##es
using Resource Graphs
(Shortest) paths are one possibility. What else?
•  maximum capacity paths (capaci1es needed, e.g. coherence, probabili1es, ...)
•  maximum flows (Aten1on: Small world graph! Path length must be bounded!)
•  apply PageRank to weight nodes

Semantic enrichment:
•  Use the nodes on the paths / flows for enriching to overcome the knowledge
acquisition bottleneck

Summary on Graph Methods in NLP
§  Graph representa1on is a natural representa1on of en11es and their rela1ons
§  We might use well-‐known (efficient) graph algorithms for the solu1on of
specific NLP problems
§  Taking the overall structure into account some NLP tasks might be improved
(enriching seman1cs)
§  Graph clustering algorithms solve unsupervised NLP tasks without the need to
specify the number of clusters
§  We can enrich informa1on by walks on graphs


1 SS 2011 - Graph-Based Methods For NLP - UKP Lab - Wolfgang Stille

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 SS 2011 - Graph-Based Methods For NLP - UKP Lab - Wolfgang Stille

Uploaded by

Copyright:

Available Formats

SEMINAR: GRAPH-BASED METHODS FOR NLP

• Seminar findet komplett im Mai statt

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 1

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 2

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 4

Additional information such as

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 5

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 6

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 7

There are simple heuris#cs that o^en yield

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 9

Graphs are usually large-­‐scale

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 11

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 12

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 14

§ Numeric: Simulate a lot of random surfers: The Power method of

Consider sub-­‐problem 3:

Solu#on idea: use measures on sentence similarity graph

TF: count w1..wn TF*IDF

.24 .21 feature vector

§ Sentence similarity graph:

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 18

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 20

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 21

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 22

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 23

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 24

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 25

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 26

§ MCL: keep only a few strong neighbors

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 29

Wikipedia Link Graph

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 30

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 31

SS 2011 | Graph-based Methods for NLP | UKP Lab - Wolfgang Stille | 32

You might also like

•  Seminar findet komplett im Mai statt

Graphs are usually large-‐scale

§  Numeric: Simulate a lot of random surfers: The Power method of

Consider sub-‐problem 3:

§  Sentence similarity graph:

§  MCL: keep only a few strong neighbors