Can you explain the difference between the Jaccard similarity coefficient and the pointwise mutual information (PMI) measure? It would

be great if you could add a few examples.

1 Answer

These two are quite different. Still, let us try to "bring them to a common denominator", to see

the difference. Both Jaccard and PMI could be extended to a continuous data case, but we'll

observe the primeval binary data case.

Y

1 0

-------

1 | a | b |

X -------

0 | c | d |

-------

a = number of cases on which both X and Y are 1

b = number of cases where X is 1 and Y is 0

c = number of cases where X is 0 and Y is 1

d = number of cases where X and Y are 0

a+b+c+d = n, the number of cases.

a

we know that Jaccard[X, Y ] =

a+b+c

.

P (X,Y )

PMI by Wikipedia definition is PMI[X, Y ] = log .

P (X)P (Y )

Let us first forget about "log" - because Jaccard implies no logarithming. Then plug a,b,c,d

notation into PMI formula to obtain:

a

= = = =

a+b a+c

P (X)P (Y ) (a + b)(a + c) a+b a+c gm[P (X), P (Y )]

n n

n n

where "gm" is geometric mean of the two probabilities, and Ochiai similarity between X and Y

vectors is just another name for cosine similarity in case of binary data: a a

a+c

.

a+b

So, you can see that PMI (without logarithm) is Ochiai coefficient further "normalized" (or I'd

say, de-normalized) by the overall probability of the two-way positive (eventful) data.

But Jaccard and Ochiai are comparable. Both are association measures ranging from 0 to 1.

They differ in the accents they put on the potential discrepancy between frequencies b and c.

I've described it in the answer "Ochiai" above links to. To cite:

Because product (seen in Ochiai) increases weaker than sum (seen in Jaccard) when only

one of the terms grows, Ochiai will be really high only if both of the two proportions

(probabilities) are high, which implies that to be considered similar by Ochiai the two

vectors must share the great shares of their attributes/elements. In short, Ochiai curbs

similarity if b and c are unequal. Jaccard does not.

