You are on page 1of 2

SimRank: A Measure of Structural-Context Similarity

Introduction

Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other
kind of dataset).

Link to the paper

SimRank

Input: A directed graph G = (V, E) where V represents vertices and E represents edges.

SimRank defines similarity between 2 vertices (or nodes) i and j as the average of the similarity between
their in-neighbours decayed by a constant factor C.

Any node is maximally similar to itself (with similarity = 1).

PageRank analyses the individual vertices of the graph with respect to the global structure, while
SimRank analyses the relationship between a pair of vertices (edges).

SimRank scores are symmetric and can be defined between all pair of vertices.

G2 is defined as the node pair graph such that each node in G2 corresponds to an ordered pair of nodes
of G and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c)
and (b, d).

In G2, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of
similarity.

Variants

Minimax Variant

Defines similarity of nodes i and j as the minimum of maximum similarity between i and any in-
neighbour of j and between j and any in-neighbour of i.

Computing SimRank

A naive solution can be obtained by iteration to a fixed point.

Space complexity is O(n2) and time complexity is *O(kn2d) where k is the number of iterations, n is the
number of vertices and d is the average of product of indegrees of pair of vertices.

Optimisations can be made by setting the similarity between far off nodes as 0 and considering only
nearby nodes for an update.

Different Interpretations
Co-citation Score

The first iteration of SimRank produces results same as co-citation score between a pair of vertices.

Successive iterations improve these initial scores.

Random Surfer-Pairs Model

SimRank s(a, b) can be interpreted as the measure of how soon two random surfers are expected to
meet at the same node if they start at nodes a and b and walk the graph backwards.

Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required
before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.

Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.

Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative
monotonic function) to bound the expected distance to a finite interval.

Common choice for f is f(z) = Cz where C ε (0, 1)

The SimRank score for two nodes, with parameter C, is the expected-f meeting distance travelling back-
edges with f(z) = Cz

Evaluation

Experiments on 2 datasets:

Corpus of scientific research papers from ResearchIndex.

Transcripts of undergrad students at Stanford.

Domain specific properties used to measure similarity and compared with SimRank scores.

Results show improvement over co-citation scores.

You might also like