You are on page 1of 39

How Search Engines Work: A Technology Overview

Avi Rappoport Search Tools Consulting www.searchtools.com consult1@searchtools.com UC Berkeley SIMS class 202 September 16, 2004

Purpose of Search Engines


Helping people find what theyre looking for
Starts with an information need Convert to a query Gets results

In the materials available


Web pages Other formats Deep Web
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search is Not a Panacea


Search cant find whats not there
The content is hugely important

Information Architecture is vital Usable sites have good navigation and structure

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Looks Simple

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

But It's Not


Index ahead of time
Find files or records Open each one and read it Store each word in a searchable index

Provide search forms


Match the query terms with words in the index Sort documents by relevance

Display results
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Processing

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search is Mostly Invisible


Like an iceberg, 2/3 below water user interface

content

search functionality

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Text Search vs. Database Query


Text search works for structured content Keyword search vs. SQL queries Approximate vs. exact match Multiple sources of content Response time and database resources Relevance ranking, very important Works in the real world (e.g. EBay)
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search is Only as Good as the Content


Users blame the search engine
Even when the content is unavailable

Understand the scope of site or intranet


Kinds of information Divided sites: products / corporate info Dates Languages Sources and data silos: CMSs, databases... Update processes
9

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Making a Searchable Index


Store text to search it later Many ways to gather text
Crawl (spider) via HTTP Read files on file servers Access databases (HTTP or API) Data silos via local APIs Applications, CMSs, via Web Services

Security and Access Control


UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

10

Robot Indexing Diagram

Sour

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

11

What the Index Needs


Basic information for document or record
File name / URL / record ID Title or equivalent Size, date, MIME type

Full text of item More metadata


Product name, picture ID Category, topic, or subject Other attributes, for relevance ranking and display

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

12

Simple Index Diagram

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

13

More Complex Index Processing

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

14

Index Issues
Stopwords Stemming Metadata
Explicit (tags) Implicit (context)

Semantics
CMS and Database fields XML tags and attributes
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

15

Search Query Processing


What happens after you click the search button, and before retrieval starts. Usually in this order
Handle character set, maybe language Look for operators and organize the query Look for field names or metadata Extract words (just like the indexer) Deal with letter casing

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

16

Search and Retrieval


Retrieval: find files with query terms Not the same as relevance ranking
Recall: find all relevant items Precision: find only relevant items Increasing one decreases the other
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
17

Retrieval = Matching
Single-word queries
Find items containing that word

Multi-word queries: combine lists


Any: every item with any query word All: only items with every word Phrases: find only items with all words in order

Boolean and complex queries


Use algorithm to combine lists
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

18

Why Searches Fail


Empty search Nothing on the site on that topic (scope) Misspelling or typing mistakes Vocabulary differences Restrictive search defaults Restrictive search choices Software failure
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

19

LII.org No-Matches Page

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

20

Relevance Ranking
Theory: sort the matching items, so the most relevant ones appear first Can't really know what the user wants Relevance is hard to define and situational Short queries tend to be deeply ambiguous
What do people mean when they type bank?

First 10 results are the most important The more transparent, the better
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
21

Relevance Processing
Sorting documents on various criteria Start with words matching query terms Citation and link analysis
Like old library Citation Indexes Ted Nelson - not only hypertext, but the links Google PageRank
Incoming links Authority of linkers

Taxonomies and external metadata


UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
22

TF-IDF Ranking Algorithm


Term frequency in the item Inverse document frequency of term
Rare words are likely to be more important
wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Dj N = number of Documents in collection n = number of Documents where term Tj occurs at least once From Salton 1989

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

23

Other Algorithms
Vector space Probabilistic (binary interdependence) Fuzzy set theory Bayesian statistical analysis Latent semantic indexing Neural networks Machine learning All require sophisticated queries See MIR, chapter 2
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting
24

Relevance Heuristics
Heuristics are rules of thumb
Not algorithms, not math

Search Relevance Ranking Heuristics


Documents containing all search words Search words as a phrase Matches in title tag Matches in other metadata

Based on real-word user behavior

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

25

Search Results Interface


What users see after they click the Search button The most visible part of search Elements of the results page
Page layout and navigation Results header List of results items Results footer

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

26

Many Experiments in Interface

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

27

Back to Simplicity

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

28

Search Suggestions (aka Best Bets)


Human judgment beats algorithms Great for frequent, ambiguous searches
Use search log to identify best candidates

Recommend good starting pages


Product information, FAQs, etc.

Requires human resources


That means money and time

More static than algorithmic search


UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

29

MSU Keywords

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

30

Siemens Results

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

31

Cooks.com Results

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

32

Salon.com Results

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

33

Faceted Metadata Search & Browse


Leverage content structure
database fields (i.e. cruise amenities) document metadata (news article bylines)

Provide both search and browse


Support information foraging Integrate navigation with results Not just subject taxonomies Display only fruitful paths, no dead ends

Supported by academic research


Marti Hearst, UCB SIMS, flamenco.berkeley.edu
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

34

Faceted Search: Information

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

35

Faceted Search: Online Catalog

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

36

Search Metrics and Analytics


Metrics
Number of searches Number of no-matches searches Traffic from search to high-value pages Relate search changes to other metrics

Search Log Analysis


Top 5% searches: phrases and words Top no-matches searches
Use as market research
UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

37

Search Will Never Be Perfect


Search engines cant read minds
User queries are short and ambiguous

Some things will help


Design a usable interface Show match words in context Keep index current and complete Adjust heuristic weighting Maintain suggestions and synonyms Consider faceted metadata search
38

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Engines, sorta Rocket Science


Questions and discussion Contact me
consult1@searchtools.com www.searchtools.com

This presentation:
www.searchtools.com/slides/sims/202-04/

UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

39