You are on page 1of 54

Reconsidering Relevance

Daniel Tunkelang
Chief Scientist, Endeca

© 2009 Endeca Technologies, Inc. All rights reserved.


howdy!

• 1988 – 1992

• 1993 – 1998

• 1999 -

2 © 2009 Endeca Technologies, Inc. All rights reserved.


overview

what is relevance?

what’s wrong with relevance?

what are the alternatives?

3 © 2009 Endeca Technologies, Inc. All rights reserved.


but first let’s set the stage

4 © 2009 Endeca Technologies, Inc. All rights reserved.


iconic businesses of the 20th and 21st centuries

I’m Feeling Lucky

5 © 2009 Endeca Technologies, Inc. All rights reserved.


process and scale orchestration

6 © 2009 Endeca Technologies, Inc. All rights reserved.


but there’s a dark side

7 © 2009 Endeca Technologies, Inc. All rights reserved.


users are satisfied

8 © 2009 Endeca Technologies, Inc. All rights reserved.


an interesting contrast

“Search on the internet is solved.


I always find what I need.
But why not in the enterprise?
Seems like a solution waiting to
happen.”

- a Fortune 500 CTO

9 © 2009 Endeca Technologies, Inc. All rights reserved.


the real questions

• What is “search on the internet” and why is it


perceived a solved problem?

• What is “search in the enterprise” and why is it


perceived as an unsolved problem?

• And what does this have to do with relevance?

10 © 2009 Endeca Technologies, Inc. All rights reserved.


easy vs. hard search problems

• easy
where to buy Ender in Exile?

• hard
good novel to read on the beach?

• easy
proof that sorting has n log n lower bound?

• hard
algorithm to sort partially ordered set, given a
constant-time comparator?

11 © 2009 Endeca Technologies, Inc. All rights reserved.


what is relevance?

what’s wrong with relevance?

what are the alternatives?

12 © 2009 Endeca Technologies, Inc. All rights reserved.


defining relevance

Relevance is defined as a measure of information


conveyed by a document relative to a query.

It is shown that the relationship between the


document and the query, though necessary, is
not sufficient to determine relevance.

William Goffman, On relevance as a measure, 1964.

13 © 2009 Endeca Technologies, Inc. All rights reserved.


we need more definitions

14 © 2009 Endeca Technologies, Inc. All rights reserved.


let’s work top-down

• information retrieval (IR) =

study of retrieval of information (not data) from


collection of written documents

retrieved documents aim at satisfying user


information need

15 © 2009 Endeca Technologies, Inc. All rights reserved.


IR assumes information needs

• user information need =

natural language declaration of informational


need of user

• query =

expression of user information need in input


language provided by information system

16 © 2009 Endeca Technologies, Inc. All rights reserved.


relevance drives IR modeling

• modeling =

studies algorithms used for ranking documents


according to system assigned likelihood of
relevance

• model =

a set of premises and an algorithm for ranking


documents with regard to a user query

17 © 2009 Endeca Technologies, Inc. All rights reserved.


a relevance-centric approach

tf-idf PageRank
SYSTEM:

rank using IR model

USER:

information Need query select from results

18 © 2009 Endeca Technologies, Inc. All rights reserved.


what is relevance?

what’s wrong with relevance?

what are the alternatives?

19 © 2009 Endeca Technologies, Inc. All rights reserved.


our first communication problem

information need query

• 2 words?
• natural language?
• telepathy?

20 © 2009 Endeca Technologies, Inc. All rights reserved.


and the game of telephone continues

query rank using IR model

• cumulative error
• relevance is subjective
• what Goffman said

21 © 2009 Endeca Technologies, Inc. All rights reserved.


and hopefully users feel lucky

rank using IR model select from results

• selection bias
• inefficient channel
• backup plan?

22 © 2009 Endeca Technologies, Inc. All rights reserved.


queries are misinterpreted

Results 1-10 out of about 344,000,000 for ir

23 © 2009 Endeca Technologies, Inc. All rights reserved.


ranked lists are inefficient

24 © 2009 Endeca Technologies, Inc. All rights reserved.


assumptions of relevance-centric approach

• self-awareness

• self-expression

• model knows best

• answer is a document

• one-shot query

25 © 2009 Endeca Technologies, Inc. All rights reserved.


can we do better?

26 © 2009 Endeca Technologies, Inc. All rights reserved.


what is relevance?

what’s wrong with relevance?

what are the alternatives?

27 © 2009 Endeca Technologies, Inc. All rights reserved.


human-computer information retrieval

“Toward Human-Computer
Information Retrieval”

Gary Marchionini

• don’t just guess the user’s intent


– optimize communication

• increase user responsibility and control


– require and reward human intellectual effort

28 © 2009 Endeca Technologies, Inc. All rights reserved.


human computer information retrieval

29 © 2009 Endeca Technologies, Inc. All rights reserved.


a concrete use case

• Colleague:

Hey Daniel! You should check out what this guy


Steve Pollitt’s been researching. Sounds right up
your alley.

• Daniel:

Sure thing, I’ll look into it.

30 © 2009 Endeca Technologies, Inc. All rights reserved.


google him!

31 © 2009 Endeca Technologies, Inc. All rights reserved.


google scholar him?

32 © 2009 Endeca Technologies, Inc. All rights reserved.


rexa him?

33 © 2009 Endeca Technologies, Inc. All rights reserved.


getting better

34 © 2009 Endeca Technologies, Inc. All rights reserved.


hcir-inspired interface

35 © 2009 Endeca Technologies, Inc. All rights reserved.


tags provide summarization and guidance

36 © 2009 Endeca Technologies, Inc. All rights reserved.


my information need evolves as i learn

37 © 2009 Endeca Technologies, Inc. All rights reserved.


hcir – implementing the vision

38 © 2009 Endeca Technologies, Inc. All rights reserved.


scatter/gather: a search for “star”

39 © 2009 Endeca Technologies, Inc. All rights reserved.


faceted search

40 © 2009 Endeca Technologies, Inc. All rights reserved.


practical considerations

• which facets to show

• which facet values to show

• when to suggest faceted refinement

• how to automate faceted classification

41 © 2009 Endeca Technologies, Inc. All rights reserved.


showing the right facets: microwaves

42 © 2009 Endeca Technologies, Inc. All rights reserved.


showing the right facets: ceiling fans

43 © 2009 Endeca Technologies, Inc. All rights reserved.


query-driven clarification before refinement

Matching Categories include:

Appliances > Small Appliances > Irons & Steamers

Appliances > Small Appliances > Microwaves & Steamers


Bath > Sauna & Spas > Steamers
Kitchen > Bakeware & Cookware > Cookware >
Open Stock Pots > Double Boilers & Steamers
Kitchen > Small Appliances > Steamers

44 © 2009 Endeca Technologies, Inc. All rights reserved.


results-driven clarification before refinement

Search: storage

45 © 2009 Endeca Technologies, Inc. All rights reserved.


crowd-sourcing to tag documents

46 © 2009 Endeca Technologies, Inc. All rights reserved.


hcir cheats the precision / recall trade-off

precision

recall
47 © 2009 Endeca Technologies, Inc. All rights reserved.
set retrieval 2.0

• set retrieval that responds to queries with


– overview of the user's current context
– organized set of options for exploration

• contextual summaries of document sets


– optimize system’s communication with user

• query refinement options


– optimize user’s communication with system

48 © 2009 Endeca Technologies, Inc. All rights reserved.


hcir using set retrieval 2.0

emphasize set summaries over ranked lists

establish a dialog between the user and the data

enable exploration and discovery

49 © 2009 Endeca Technologies, Inc. All rights reserved.


think outside the (search) box

• relevance-centric search solves many use cases

• but not some of the most valuable ones

• support interaction, exploration

• human-computer information retrieval

50 © 2009 Endeca Technologies, Inc. All rights reserved.


one more thing


51 © 2009 Endeca Technologies, Inc. All rights reserved.
“Google's mission is to organize the

world's information and make it

universally accessible and useful.”

52 © 2009 Endeca Technologies, Inc. All rights reserved.


organizer or referee?

53 © 2009 Endeca Technologies, Inc. All rights reserved.


thank you

communication 1.0
email: dt@endeca.com

communication 2.0
blog: http://thenoisychannel.com
twitter: http://twitter.com/dtunkelang

54 © 2009 Endeca Technologies, Inc. All rights reserved.