You are on page 1of 9

Introduction to Information

Retrieval

Cosine similarity illustrated

1

3 Introduction to Information Retrieval Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice. 6.Sec. we don’t do idf weighting. . and WH: Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Term frequencies (counts) Note: To simplify this example.

6.335 0 0.515 × 0.69 Why do we have cos(SaS.405 0 0 2.06 2.335 × 0.PaP) > cos(SaS.WH) ≈ 0.588 wuthering cos(SaS.04 jealous 0.Sec.PaP) ≈ 0.555 0. Log frequency weighting term SaS PaP WH After length normalization term SaS PaP WH affection 3.76 2.85 2.515 0.78 gossip 0.465 gossip 1.30 0 1.79 cos(PaP.832 0.0 + 0.555 + 0.832 + 0.WH)? .789 0.58 wuthering 0 0 0.789 × 0.0 × 0.94 cos(SaS.3 Introduction to Information Retrieval 3 documents example contd.0 ≈ 0.00 1.30 affection 0.524 jealous 2.WH) ≈ 0.

Introduction to Information Retrieval Computing cosine scores Sec.3 . 6.

6.Introduction to Information Retrieval Sec.4 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial? .

6. no normalization … . no idf and cosine normalization A bad idea?  Query: logarithmic tf (l in leftmost column). documents  SMART Notation: denotes the combination in use in an engine.Introduction to Information Retrieval Sec. idf (t in second column).ltc  Document: logarithmic tf (l as first character).qqq.4 Weighting may differ in queries vs documents  Many search engines allow for different weightings for queries vs. with the notation ddd. using the acronyms from the previous table  A very standard weighting scheme is: lnc.

tf-wt raw df auto 0 0 best 1 1 50000 car 1 insurance 1 idf 5000 2.53 1000 Exercise: what is N.3 Document wt n’lize tf-raw tf-wt Prod wt n’lize 0 0 1 1 1 0.27 1 3.68 0.32  1.3 1. 6.52 0.52 0 1.0 0.34 0 0 0 0 0 1 10000 2.4 Introduction to Information Retrieval tf-idf example: lnc.78 2 1.8 . the number of docs? Doc length = 12  0 2  12  1.0 0.92 Score = 0+0+0.0 3.3 0.3 1.3 0.27+0.53 = 0.52 1 1 1 0.ltc Document: car insurance auto insurance Query: best car insurance Term Query tf.0 2.Sec.

K = 10) to the user .Introduction to Information Retrieval Summary – vector space ranking  Represent the query as a weighted tf-idf vector  Represent each document as a weighted tfidf vector  Compute the cosine similarity score for the query vector and each document vector  Rank documents with respect to the query by score  Return the top K (e..g.

3  http://www. 6 Resources for today’s lecture  IIR 6.miislita.Introduction to Information Retrieval Ch.4.html  Term weighting and cosine similarity tutorial for SEO folk! .2 – 6.com/information-retrieva l-tutorial/cosine-similarity-tutorial.