You are on page 1of 135

Search Engines

Sudarsun Santhiappan., M.Tech.,


Director – R & D,
Burning Glass Technologies
Today's Coverage

Introduction
Types of Search Engines
Components of a Search Engine
Semantics and Relevancy
Search Engine Optimization

Copyleft (ɔ) 2009 Sudarsun Santhiappan 2


What is a Search Engine ?
What is a Search ?
Why do we need a Search Engine ?
What are we searching against ?
How good is a Search Engine ?
What is Search on Search (Meta SE) ?
Compared Search Engines Side-by-Side ?
How are Images and Videos searched ?
Apart from Web Search, what else ?
Copyleft (ɔ) 2009 Sudarsun Santhiappan 3
Introduction

Web Search Engine is a software program that


searches the Internet (bunch of websites) based
on the words that you designate as search terms
(query words).
Search engines look through their own databases of
information in order to find what it is that you are
looking for.
Web Search Engines are a good example for
massively sized Information Retrieval Systems.
Tried “Similar pages” Link in Google result set ?
Copyleft (ɔ) 2009 Sudarsun Santhiappan 4
Dictionary Definitions
Search
COMPUTING (transitive verb) to examine a computer file,
disk, database, or network for particular information

Engine
something that supplies the driving force or energy to a
movement, system, or trend

Search Engine
a computer program that searches for particular keywords
and returns a list of documents in which they were found,
especially a commercial service that scans documents on
the Internet

Copyleft (ɔ) 2009 Sudarsun Santhiappan 5


About definition of search
engines
oh well … search engines do not
search only for keywords, some search
for other stuff as well
and they are really not “engines” in the
classical sense
but then mouse is not a “mouse”

Copyleft (ɔ) 2009 Sudarsun Santhiappan 6


use of search engines
… among others

Copyleft (ɔ) 2009 Sudarsun Santhiappan 7


Types of Search Engines

Text Search Engines


General: AltaVista, AskJeeves, Bing, Google
Specialized: Google Scholar, Scirus, Citeseer
Intranet vs Internet Search Engines
Image Search Engines
How can we search on the Image content ?
Video Search Engines
Image Search with Time dimension !!
Copyleft (ɔ) 2009 Sudarsun Santhiappan 8
Types of Search Engine

Crawler Powered Indexes


Guruji.com, Google.com
Human Powered Indexes
www.dmoz.org
Hybrid Models
Submitted URLs to a search engine ?
Semantic Indexes
Hakia.com, Copyleft (ɔ) 2009 Sudarsun Santhiappan 9
Have you tried Hakia ?

What is Semantic Search ?


How's it different from Keyword Search?
What is categorized search ?
Side-by-Side comparison with Google!!
Have you compared Bing with Google ?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 10


Copyleft (ɔ) 2009 Sudarsun Santhiappan 11
Copyleft (ɔ) 2009 Sudarsun Santhiappan 12
Directories

www.dmoz.org
Website classified into a Taxonomy
Website are categorically arranged
Searching vs Navigation
Instead of Query, you Click and navigate
Accurate search always! (if data is
available)
Problem: Mostly
CopyleftManually created
(ɔ) 2009 Sudarsun Santhiappan 13
Copyleft (ɔ) 2009 Sudarsun Santhiappan 14
Copyleft (ɔ) 2009 Sudarsun Santhiappan 15
How does a Search Engine work ?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 16


How Search Engines Work
(Sherman 2003)

Crawler
U
RL1
U
RL2

Indexer The Web

U U
RL3 RL4

Search Al l Abou
Eggs - 90% t
Engine You
EggoEggs
- r81%
Eggs? Brows
Ego-by40% er
Database Huh?
S. I. - Am
10%
Eggs.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 17


how do search engines work?
elaboration
crawlers, spiders: go out to find
content
in various ways go through the web looking
for new & changed sites
periodic, not for each query
no search engine works in real time
some search engines do it for themselves,
others not
buy content from companies such as Inktomi
for a number of reasons crawlers do not
cover all of the web – just a fraction
what is not covered is “invisible web” ?
Copyleft (ɔ) 2009 Sudarsun Santhiappan 18
Elaboration …
organizing content: labeling, arranging
indexing for searching – automatic
keywords and other fields
arranging by URL popularity - PageRank as Google
classifying as directory
mostly human handpicked & classified
as a result of different organization we
have basically two kinds of search engines:
search – input is a query that is searched & displayed
directory – classified content – a class is displayed
and fused: directories have search capabilities & vice versa

Copyleft (ɔ) 2009 Sudarsun Santhiappan 19


Elaboration (cont.)
databases, caches: storing content
humongous files usually distributed over many computers

query processor: searching, retrieval,


display
takes your query as input
engines have differing rules on how they are handled
displays ranked output
some engines also cluster output and provide visualization
some engines provide categorically structured results

at the other end is your browser

Copyleft (ɔ) 2009 Sudarsun Santhiappan 20


Similarities & Differences
All search engines have these basic parts in
common
BUT the actual processes – methods how
they do it – are based on various algorithms
and they significantly differ
most are proprietary (patented) with details kept
mostly secret (or protected) but based on well
known principles from information retrieval or
classification
to some extent Google is an exception – they
published their method
Copyleft (ɔ) 2009 Sudarsun Santhiappan 21
Google Search
In the beginning it ran on Stanford
computers
Basic approach has been described in their
famous paper
“The Anatomy of a Large-Scale Hypertextual
Web Search Engine”
well written, simple language, has their pictures
in acknowledgement they cite the support by NSF’s Digital Library
Initiative i.e. initially, Google came out of government sponsored
research
describe their method PageRank - based on ranking hyperlinks as in
citation indexing
“We chose our system name, Google, because it is a common spelling
of googol, or ten on hundredth power”

Copyleft (ɔ) 2009 Sudarsun Santhiappan 22


coverage differences
no engine covers more than a fraction of
WWW
estimates: none more than 16%
hard (even impossible) to discern & compare coverage, but they differ
substantially in what they cover

in addition:
many national search engines
own coverage, orientation, governance
many specialized or domain search engines
own coverage geared to subject of interest
many comprehensive sources independent of search engines
some have compilations of evaluated web sources
Copyleft (ɔ) 2009 Sudarsun Santhiappan 23
searching differences
substantial differences among search
engines on searching, retrieval display
need to know how they work & differ in respect to
defaults in searching a query
searching of phrases, case sensitivity, categories
searching of different fields, formats, types of resources
advance search capabilities and features
possibilities for refinement, using relevance feedback
display options
personalization options

Copyleft (ɔ) 2009 Sudarsun Santhiappan 24


Copyleft (ɔ) 2009 Sudarsun Santhiappan 25
Copyleft (ɔ) 2009 Sudarsun Santhiappan 26
Limitations
every search engine has limitation as to
Coverage: meta engines just follow coverage
limitations & have more of their own search
capabilities
finding quality information
some have compromised search with
economics
becoming little more than advertisers
but search engines are also many times
victims of spamdexing
affecting what is included and how ranked
Copyleft (ɔ) 2009 Sudarsun Santhiappan 27
Spamming a search
engine
use of techniques that push rankings
higher than they belong is also called
spamdexing
methods typically include textual as well as link-
based techniques
like e-mail spam, search engine spam is a form
of adversarial information retrieval
the conflicting goals of accurate results of search
providers & high positioning by content page rank
Copyleft (ɔ) 2009 Sudarsun Santhiappan 28
Meta Search Engines
Search on Search

Copyleft (ɔ) 2009 Sudarsun Santhiappan 29


Meta search engines
meta engines search multiple engines
getting combined results from a variety of
engines
do not have their own databases
but have their own business models affecting
results
a number of techniques used
interesting ones: clustering, statistical analysis

Copyleft (ɔ) 2009 Sudarsun Santhiappan 30


Some Meta engines - with organized
results
Dogpile : results from a number of
leading search engines; gives source, so
overlap can be compared; (has also a
(bad) joke of the day)
Surfwax : gives statistics and text
sources & linking to sources; for some
terms gives related terms to focus
Teoma : results with suggestions for
narrowing; links resources derived;
originated at Rutgers
Turbo10 : provides
Copyleft (ɔ) 2009 results in clusters; 31
Sudarsun Santhiappan

engines searched can be edited


Copyleft (ɔ) 2009 Sudarsun Santhiappan 32
Copyleft (ɔ) 2009 Sudarsun Santhiappan 33
Some Meta Engines (cont.)
Large directory
Complete Planet
directory of over 70,000 databases & specialty engines

Results with graphical displays


Vivisimo clusters results; innovative

Webbrain results in tree structure – fun to use

Kartoo results in display by topics of query

Copyleft (ɔ) 2009 Sudarsun Santhiappan 34


Domain Specific Search Engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 35


Domain Search Engines &
Catalogs
cover specific subjects & topics
important tool for subject searches
particularly for subject specialist
valued by professional searchers
selection mostly hand-picked rather
than by crawlers, following inclusion
criteria
often not readily discernable
but content more trustworthy
Copyleft (ɔ) 2009 Sudarsun Santhiappan 36
Domain Search Engines …
Open Directory Project
large edited catalog of the web – global, run by volunteers
BUBL LINK
selected Internet resources covering all academic subject
areas; organized by Dewey Decimal System – from UK
Profusion
search in categories for resources &
search engines
Resource Discovery Network – UK
“UK's free national gateway to Internet
resources for the learning, teaching
and research community”
Copyleft (ɔ) 2009 Sudarsun Santhiappan 37
Domain Engines … sample

Think Quest – Oracle Education Foundation


education resources, programs; web sites created by students
All Music Guide
resource about musicians, albums, and songs
Internet Movie Database
treasure trove of American and British movies
Genealogy links and surname search engines
well.. that is getting really specialized (and popular)
Daypop
searches the “living web” “The living web is composed of sites that
update on a daily basis: newspapers, online magazines, and weblogs”

Copyleft (ɔ) 2009 Sudarsun Santhiappan 38


Science, scholarship engines …
sample
Psychcrawler - Amer Psychological Association
web index for psychology
Entrez PubMed – Nat Library of Medicine
biomedical literature from MEDLINE & health journals
CiteSeer - NEC Research Center
scientific literature, citations index; strong in computer science
Scholar Google
searches for scholarly articles & resources
Infomine
scholarly internet research collections
Scirus
scientific information in journals & on the web

Copyleft (ɔ) 2009 Sudarsun Santhiappan 39


Science, scholarship engines …
sample commercial access
an addition to freely accessible engines
many provide search free but access to full
text paid
by subscription or per item
RUL provides access to these & many more:
ScienceDirect Elsevier: “world's largest electronic collection of science,
technology and medicine full text and bibliographic information”
ACM Portal Association for Computing Machinery: access to ACM Digital
Library & Guide to Computing

Copyleft (ɔ) 2009 Sudarsun Santhiappan 40


Search Engine Internals

Copyleft (ɔ) 2009 Sudarsun Santhiappan 41


Search Engine Internals

Crawlers
Indexers
Searching
Semantics
Ranking

Copyleft (ɔ) 2009 Sudarsun Santhiappan 42


Standard Web Search Engine Architecture
Check for
crawl duplicates,
the store the
web documents
DocIds

create an
user inverted
query index

Search
Inverte
Show results engine
To user d
server
index
s Santhiappan
Copyleft (ɔ) 2009 Sudarsun 43
Typical Search Engine

Copyleft (ɔ) 2009 Sudarsun Santhiappan 44


Copyleft (ɔ) 2009 Sudarsun Santhiappan 45
Copyleft (ɔ) 2009 Sudarsun Santhiappan 46
Crawlers

What is Crawling ?
How does Crawling happen ?
Have you tried “wget -r <url>” in Linux ?
Have you tried “DAP” to download entire
site?
Page Walk
Spidering & Crawlbots
Copyleft (ɔ) 2009 Sudarsun Santhiappan 47
Copyleft (ɔ) 2009 Sudarsun Santhiappan 48
Copyleft (ɔ) 2009 Sudarsun Santhiappan 49
Spidering the Web

Replicating the Spider's behavior of


building the Internet (web) by adding
spirals (sites)
But, can the web be fully crawled ?
By the time, one round of indexing is over,
the page might have changed already!
That's why we have cached page link in
the search result!
Copyleft (ɔ) 2009 Sudarsun Santhiappan 50
Copyleft (ɔ) 2009 Sudarsun Santhiappan 51
Crawler Bots
How to make your website Crawlable ?
White-listing and Black-listing!
Meta Tags to control the Bots
Can HTTPS pages be crawled ?
Does Sessions maintained while crawling ?
Can dynamic pages be crawled ?
URL normalization
cool.com?page=2 [crawler unfriendly]
Copyleft (ɔ) 2009 Sudarsun Santhiappan
cool.com/page/2 [norm'd and crawler friendly] 52
How to control Robots ?
<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<TITLE>...</TITLE>
</HEAD>
<BODY>

Index: This tell the spider/bot that it’s OK to index this page
Noindex: Spider/bot see this and don’t index any of the content on this page.
Follow: This let the spider/bot know that it’s OK to travel down links found on this
page.
Nofollow: It tells the spider/bot not to follow any of the links on this page.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 53


Crawling – Process Flow

Copyleft (ɔ) 2009 Sudarsun Santhiappan 54


Data Structures
Tree primarily while Crawling
Both Depth-First-Search and Breadth-First-
Search are used
Every page that the crawler visits shall be
added as a node to the Tree
Fan-out information is represented as
Children for a node (page).

Copyleft (ɔ) 2009 Sudarsun Santhiappan 55


Inverted Indexes the IR Way

Copyleft (ɔ) 2009 Sudarsun Santhiappan 56


Term Doc #

How Inverted Files


now 1
is 1
the 1

Are Created
time 1
for 1
all 1
good 1
men 1
Periodically rebuilt, static to
come
1
1
otherwise. to
the
1
1

Documents are parsed to extract aid


of
1
1

tokens. These are saved with the their


country
1
1

Document
Doc 1 ID. Doc 2
it
was
2
2

Now is the It was a dark a 2


dark 2
time and and 2

for all good stormy night in stormy


night
2
2

men the country in


the
2
2

to come to the manor. The country 2


manor 2
aid time the 2

of their country was


Copyleft past
(ɔ) 2009 Sudarsun Santhiappan
time
was
2
57
2
midnight past
midnight
2
2
Term Doc # Term Doc #
now 1 a 2

How Inverted is
the
time
1
1
1
aid
all
and
1
1
2

Files are for


all
good
1
1
1
come
country
country
1
1
2

Created men
to
come
1
1
1
dark
for
good
2
1
1
to 1 in 2

After all
the 1 is 1
aid 1 it 2
of 1 manor 2
documents have their
country
1
1
men
midnight
1
2
been parsed the it
was
2
2
night
now
2
1

inverted file is a
dark
2
2
of
past
1
2

sorted and
stormy
2
2
stormy
the
2
1

alphabetically.
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
Copyleft (ɔ) 2009past
Sudarsun Santhiappan2 was 58 2
midnight 2 was 2
Term Doc # Term Doc # Freq
a 2 a 2 1
aid 1 aid 1 1

How Inverted all


and
come
1
2
1
all
and
1
2
1
1
come 1 1
Files are Created country
country
dark
1
2
2
country
country
1
2
1
1
for 1 dark 2 1
Multiple term good
in
1
2
for
good
1
1
1
1
entries for a is
it
1
2
in
is
2
1
1
1
single manor
men
2
1
it
manor
2
2
1
1
midnight 2
document are night
now
2
1
men
midnight
1
2
1
1
merged. of
past
1
2
night
now
2
1
1
1

Within-
stormy 2
of 1 1
the 1
past 2 1
the 1
document term the
the
2
2
stormy
the
2
1
1
2

frequency their
time
1
1
the
their
2
1
2
1

information is time
to
2
1
time
time
1
2
1
1

compiled.
to 1
to 1 2
Copyleft (ɔ) 2009
wasSudarsun Santhiappan
2 59
was 2
was 2 2
How Inverted Files are
Created
Finally, the file can be split into
A Dictionary or Lexicon file
and
A Postings file

Copyleft (ɔ) 2009 Sudarsun Santhiappan 60


How Inverted Files are
Created
Term
a
Doc #
2
Freq
1 Dictionary/Lexicon
aid 1 1
all
and
1
2
1
1 a
Postings
Term N docs
1
Tot Freq
1
Doc #
2
Freq
1
come 1 1 aid 1 1 1 1
country 1 1 all 1 1 1 1
country 2 1 and 1 1 2 1
dark 2 1 come 1 1 1 1
country 2 2 1 1
for 1 1
dark 1 1 2 1
good 1 1 2 1
for 1 1
in 2 1 good 1 1 1 1
is 1 1 in 1 1 1 1
it 2 1 is 1 1 2 1
manor 2 1 it 1 1 1 1
men 1 1 manor 1 1 2 1
men 1 1 2 1
midnight 2 1
midnight 1 1 1 1
night 2 1
night 1 1 2 1
now 1 1 2 1
now 1 1
of 1 1 of 1 1 1 1
past 2 1 past 1 1 1 1
stormy 2 1 stormy 1 1 2 1
the 1 2 the 2 4 2 1
the 2 2 their 1 1 1 2
time 2 2 2 2
their 1 1
to 1 2 1 1
time 1 1 1 1
was 1 2
time 2 1
Copyleft (ɔ) 2009 Sudarsun Santhiappan 2 61 1
to 1 2 1 2
was 2 2 2 2
Inverted indexes
Permit fast search for individual terms
For each term, you get a list consisting of:
document ID
frequency of term in doc (optional)
position of term in doc (optional)
These lists can be used to solve Boolean queries:
country -> d1, d2
manor -> d2
country AND manor -> d2
Also used for statistical ranking algorithms

Copyleft (ɔ) 2009 Sudarsun Santhiappan 62


Inverted Indexes for Web
Search Engines
Inverted indexes are still used, even
though the web is so huge.
Some systems partition the indexes
across different machines. Each
machine handles different parts of the
data.
Other systems duplicate the data across
many machines; queries are distributed
among the machines.
Most do a combination ofSanthiappan
Copyleft (ɔ) 2009 Sudarsun these. 63
machines. Additionally, each partition is allocated multiple machines to handle the queries.

From description of the FAST search engine, by Knut Risvik


Copyleft (ɔ) 2009 Sudarsun Santhiappan 64
Cascading Allocation of CPUs

A variation on this that produces a cost-


savings:
Put high-quality/common pages on many
machines
Put lower quality/less common pages on fewer
machines
Query goes to high quality machines first
If no hits found there, go to other machines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 65


The Search Process

Copyleft (ɔ) 2009 Sudarsun Santhiappan 66


Searching – Process Flow

Copyleft (ɔ) 2009 Sudarsun Santhiappan 67


Google Query Evaluation
Parse the Query
Convert words to WordID
Seek to the start of the doclist in the short barrel for every
word.
Scan through the doclists until there is a document that
matches all the search terms.
Compute the rank of that document for the query.
If we are in the short barrels and at the end of any doclist,
seek to the start of the doclist in the full barrel for every
word and go to step 4.
If we are not at the end of any doclist go to step 4.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 68
Sort the documents that have matched by rank and return
the top k.
Queries
Search engines are one tool used to answer info needs
Users express their information needs as queries
Usually informally expressed as two or three words (we call
this a ranked query)
A recent study showed the mean query length was 2.4
words per query with a median of 2
Around 48.4% of users submit just one query in a session,
20.8% submit two, and about 31% submit three or more
Less than 5% of queries use Boolean operators (AND, OR,
and NOT), and around 5% contain quoted phrases

Copyleft (ɔ) 2009 Sudarsun Santhiappan 69


Queries...
About 1.28 million different words were used in queries in
the Excite log studied (which contained 1.03 million
queries)
Around 75 words account for 9% of all words used in
queries. The top-ten non-trivial words occurring in 531,000
queries are “sex” (10,757), “free” (9,710), “nude” (7,047),
“pictures” (5,939), “university” (4,383), “pics” (3,815),
“chat” (3,515), “adult” (3,385), “women” (3,211), and “new”
(3,109)
16.9% of the queries were about entertainment, 16.8%
about sex, pornography, or preferences, and 13.3%
concerned commerce, travel, employment, and the
economy Copyleft (ɔ) 2009 Sudarsun Santhiappan 70
Answers
What is a good answer to a query?
One that is relevant to the user’s information need!
Search engines typically return ten answers-per-page, where
each answer is a short summary of a web document
Likely relevance to an information need is approximated by
statistical similarity between web documents and the
query
Users favour search engines that have high precision, that
is, those that return relevant answers in the first page of
results

Copyleft (ɔ) 2009 Sudarsun Santhiappan 71


Approximating Relevance
Statistical similarity is used to estimate the relevance of a
query to an answer
Consider the query “Richardson Richmond Football”
A good answer contains all three words, and the more
frequently the better; we call this term frequency (TF)
Some query terms are more important—have better
discriminating power—than others. For example, an
answer containing only “Richardson” is likely to be better
than an answer containing only “Football”; we call this
inverse document frequency (IDF)

Copyleft (ɔ) 2009 Sudarsun Santhiappan 72


Ranking
To improve the accuracy of search engines:
Google Inc. use their patented PageRank(tm) technology.
Google ranks a page higher if it links to pages that are an
authorative source, and a link from an authorative source
to a page ranks that page higher
Relevance feedback is a technique that adds words to a
query based on a user selecting a more like this option
Query expansion adds words to a query using thesaural or
other techniques
Searching within categories or groups to narrow a search

Copyleft (ɔ) 2009 Sudarsun Santhiappan 73


Resolving Queries
Queries are resolved using the inverted index
Consider the example query “Cat Mat Hat”. This is evaluated
as follows:
Select a word from the query (say, “Cat”)
Retrieve the inverted list from disk for the word
Process the list. For each document the word occurs in, add weight to
an accumulator for that document based on the TF, IDF, and
document length
Repeat for each word in the query
Find the best-ranked documents with the highest weights
Lookup the document in the mapping table
Retrieve and summarize the docs, and present to the user
Copyleft (ɔ) 2009 Sudarsun Santhiappan 74
Fast Search Engines

Inverted lists are stored in a compressed format. This allows


more information per second to be retrieved from disk, and
it lowers disk head seek times
As long as decompression is fast, there is a beneficial trade-
off in time
Documents are stored in a compressed format for the same
reason
Different compression schemes are used for lists (which are
integers) and documents (which are multimedia, but
mostly text)

Copyleft (ɔ) 2009 Sudarsun Santhiappan 75


Fast Search Engines
Sort disk accesses to minimise disk head movement when
retrieving lists or documents
Use hash tables in memory to store the vocabulary; avoid
slow hash functions that use modulo
Pre-calculate and store constants in ranking formulae
Carefully choose integer compression schemes
Organise inverted lists so that the information frequently
needed is at the start of the list
Use heap structures when partial sorting is required
Develop a query plan for each query

Copyleft (ɔ) 2009 Sudarsun Santhiappan 76


Search Engine Architecture

Copyleft (ɔ) 2009 Sudarsun Santhiappan 77


Search Engine architecture
The inverted lists are divided amongst a number of
servers, where each is known as a shard
If an inverted list is required for a particular range of
words, then that shard server is contacted
Each shard server can be replicated as many times as
required; each server in a shard is identical
Documents are also divided amongst a number of
servers
Again, if a document is required within a particular
range, then the appropriate document server is
contacted
Copyleft (ɔ) 2009 Sudarsun Santhiappan 78
Each document server can also be replicated as many
times as required
Google, Case Study

Copyleft (ɔ) 2009 Sudarsun Santhiappan 79


Google Architecture

Copyleft (ɔ) 2009 Sudarsun Santhiappan 80


Components
URL Server: Bunch of URLs (white-list)
Crawler: Fetch the page
Store Server: To store the fetched pages
Repository: Compressed pages are put
here
Every unique page has a DocID
Anchor: Page transition [to, from]
information
URLResolver: Copyleft
Relative URLSanthiappan
(ɔ) 2009 Sudarsun to Absolute 81

URL
Indexer

Parses the document


Build Word-Frequency table {word, position,
font, capitalization} [hits]
Pushes the hits to barrels as partially sorted
forward index
Identifies anchors (page transition out info)

Copyleft (ɔ) 2009 Sudarsun Santhiappan 82


Searcher

Forward Index to Inverted Index


Maps keywords to DocIds
DocIds mapped to URLs
Reranker
Uses Anchor information to rank the pages for
the given query keyword.
Thumbrule: Fan In increases page rank

Copyleft (ɔ) 2009 Sudarsun Santhiappan 83


Reranking

Copyleft (ɔ) 2009 Sudarsun Santhiappan 84


What about Ranking?
Lots of variation here
Often messy; details proprietary and fluctuating
Combining subsets of:
IR-style relevance: Based on term frequencies,
proximities, position (e.g., in title), font, etc.
Popularity information
Link analysis information
Most use a variant of vector space ranking
to combine these. Here’s how it might
work:
Make a vector of weights for each feature
Copyleft (ɔ) 2009 Sudarsun Santhiappan 85

Multiply this by the counts for each feature


Relevance: Going Beyond IR

Page “popularity” (e.g., DirectHit)


Frequently visited pages (in general)
Frequently visited pages as a result of a query
Link “co-citation” (e.g., Google)
Which sites are linked to by other sites?
Draws upon sociology research on
bibliographic citations to identify
“authoritative sources”

Copyleft (ɔ) 2009 Sudarsun Santhiappan 86


Link Analysis for Ranking
Pages
Assumption: If the pages pointing to this
page are good, then this is also a good
page.
References: Kleinberg 98, Page et al. 98
Draws upon earlier research in sociology
and bibliometrics.
Kleinberg’s model includes “authorities”
(highly referenced pages) and “hubs” (pages
containing good reference lists).
Google model is a version with no hubs, and is
closely related to work on influence weights by
Pinski-Narin (1976).
Copyleft (ɔ) 2009 Sudarsun Santhiappan 87
Link Analysis for Ranking
Pages
Why does this work?
The official Toyota site will be linked to by lots
of other official (or high-quality) sites
The best Toyota fan-club site probably also
has many links pointing to it
Less high-quality sites do not have as many
high-quality sites linking to them

Copyleft (ɔ) 2009 Sudarsun Santhiappan 88


PageRank
Let A1, A2, …, An be the pages that
point to page A. Let C(P) be the # links
out of page P. The PageRank (PR) of
page A is defined as:
PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )

PageRank is principal eigenvector of the


link matrix of the web.
Can be computed as the fixpoint of the
above equation.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 89
PageRank: User Model
PageRanks form a probability distribution over web
pages: sum of all pages’ ranks is one.
User model: “Random surfer” selects a page, keeps
clicking links (never “back”), until “bored”: then
randomly selects another page and continues.
PageRank(A) is the probability that such a user visits A
d is the probability of getting bored at a page
Google computes relevance of a page for a given
search by first computing an IR relevance and then
modifying that by taking into account PageRank for
the top pages.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 90


Search Engine Optimization

Copyleft (ɔ) 2009 Sudarsun Santhiappan 91


How Search Engines Rank Pages?

Location, Location, Location...and Frequency


Tags (<title>, <meta>, <b>, top of the page)
How close words (from the query) are to each other on the
website
Quality of links going to and from a page
Penalization for "spamming“, when a word is repeated
hundreds of times on a page, to increase the frequency and
propel the page higher in the listings.
Off the Page ranking criteria:
By analyzing how pages link to each other.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 92


Why do results differ ?
Some search engines index more web pages
than others.
Some search engines also index web
pages more often than others.
The result is that no search engine has the
exact same collection of web pages to
search through.
Different algorithms to compute relevance
of the page to a particular query
Copyleft (ɔ) 2009 Sudarsun Santhiappan 93
Search Engine Placement Tips

Why is it important to be on the first page of the


results?
Most users do not go beyond the first page.
How to optimize your website?
Pick your target keywords: How do you think people will
search for your web page? The words you imagine them
typing into the search box are your target keywords.
Pick target words differently for each page on your
website.
Your target keywords should always be at least two or
Copyleft (ɔ) 2009 Sudarsun Santhiappan 94
more words long.
Position your Keywords
Make sure your target keywords appear in the crucial
locations on your web pages. The page's HTML
<title> tag is most important.
The titles should be relatively short and attractive.
Several phrases are enough for the description.
Search engines also like pages where keywords
appear "high" on the page: headline, first paragraphs
of your web page.
Keep in mind that tables and large JavaScript sections
can make your keywords less relevant because they
appear lower on the page.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 95
Have Relevant Content

Keywords need to be reflected in the page's content.


Put more text than graphics on a page
Don't use frames
Use the <ALT….> tag
Make good use of <TITLE> and <H1>
Consider using the <META> tag
Get people to link to your page

Copyleft (ɔ) 2009 Sudarsun Santhiappan 96


Hiding Web pages

You may wish to have web pages that are not indexed
(for example, test pages).
It is also possible to hide web content from robots,
using the Robots.txt file and the robots meta tag.
Not all crawlers will obey this, so this is not foolproof.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 97


Submitting To Search Engines

Search engines should find you naturally,


but submitting helps speed the process
and can increase your representation
Look for Add URL link at bottom of home
page
Submit your home page and a few key
“section” pages
Turnaround from a few days to 2 months

Copyleft (ɔ) 2009 Sudarsun Santhiappan 98


Deep Crawlers

AltaVista, Inktomi, Northern Light will add


the most, usually within a month
Excite, Go (Infoseek) will gather a fair
amount; Lycos gathers little
Index sizes are going up, but the web is
outpacing them…nor is size everything
Here are more actions to help even the
odds…

Copyleft (ɔ) 2009 Sudarsun Santhiappan 99


“Deep” Submit

A “deep” submit is directly submitting pages


from “inside” the web site – can help improve
the odds these will get listed.
At Go, you can email hundreds of URLs. Consider
doing this.
At HotBot/Inktomi, you can submit up to 50
pages per day. Possibly worth doing.
At AltaVista, you can submit up to 5 pages per
day. Probably not worth the effort.
Elsewhere, not worth doing a “deep” submit.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 100
Big Site? Split It Up

Expect search engines to max out at


around 500 pages from any particular site
Increase representation by subdividing
large sites logically into subdomains
Search engines will crawl each subsite to more
depth
Here’s an example...

Copyleft (ɔ) 2009 Sudarsun Santhiappan 101


Subdomains vs. Subdirectories

I ns tead of Do thi s
gold.ac.uk/science/ science.gold.ac.uk
gold.ac.uk/english/ english.gold.ac.uk
gold.ac.uk/admin/ admin.gold.ac.uk

Copyleft (ɔ) 2009 Sudarsun Santhiappan 102


I Was Framed

Don't use them. Period.


If you do use them, search engines will
have difficulty crawling your site.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 103


Dynamic Roadblocks

Dynamic delivery systems that use ? symbols in


the URL string prevent search engines from
getting to your pages
http://www.nike.com/ObjectBuilder/ObjectBuilder.iwx ?
ProcessName=IndexPage&Section_Id=17200&
NewApplication=t
Eliminate the ? symbol, and your life will be rosy
Look for workarounds, such as Apache rewrite or
Cold Fusion alternatives
Before you move to a dynamic delivery system,
check out any potential problems.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 104
How Directories Work

Editors find sites, describe them, put them


in a category
Site owners can also submit to be listed
A short description represents the entire
web site
Usually has secondary results from a
crawler-based search engine

Copyleft (ɔ) 2009 Sudarsun Santhiappan 105


The Major Directories

Yahoo
The Open Directory
(Netscape, Lycos, AOL Search, others)
LookSmart
UK Plus
Snap

Copyleft (ɔ) 2009 Sudarsun Santhiappan 106


Submitting To Directories

Directories probably won't find you or may list


you badly unless you submit
Find the right category (more in a moment), then
use Add URL link at top or bottom of page
Write down who submitted (and email address),
when submitted, which category submitted to
and other details
You’ll need this info for the inevitable resubmission
attempt – it will save you time.

Copyleft (ɔ) 2009 Sudarsun Santhiappan 107


Submitting To Directories

Take your time and submit to these right


Write 3 descriptions: 15, 20 and 25 words long,
which incorporate your key terms
Search for the most important term you want to
be found for and submit to first category that's
listed which seems appropriate for your site
Be sure to note the contact name and email
address you provided on the submit form
If you don't get in, keep trying
Copyleft (ɔ) 2009 Sudarsun Santhiappan 108
Subdomain Advantage

Directories tend not to list subsections of a


web site.
In contrast, they do tend to see
subdomains as independent web sites
deserving their own listings
So, another reason to go with subdomains
over subdirectories

Copyleft (ɔ) 2009 Sudarsun Santhiappan 109


How to do Search ?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 110


What do we search ?
Information
Reviews, news
Advice, methods
Bugs
Education stuff
Examples:
Access Violation 0xC0000005
Search Engine ppt
Copyleft (ɔ) 2009 Sudarsun Santhiappan 111
Copyleft (ɔ) 2009 Sudarsun Santhiappan 112
Main Steps

Make a decision about the search


Formulate a topic. Define a type of resources that you
are looking for
Find relevant words for description
Find websites with information
Choose the best out of them
Feedback: How did you search?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 113


Main Problems

Why is it difficult to search?


Know the problem, don’t know what to look
for
Lose focus (go to interesting but non-relevant
sites)
Perform superficial (shallow) search
Search Spam

Copyleft (ɔ) 2009 Sudarsun Santhiappan 114


Typical Problems

Links are often out of date


Usually too many links are returned
Returned links are not very relevant
The Engines don't know about enough pages
Different engines return different results
Political bias

Copyleft (ɔ) 2009 Sudarsun Santhiappan 115


Typical Mistakes

Unnecessary words in a query


Unsuitable choice of keywords
Not enough flexibility in changing keywords (Ses)
Divide the time devoted to search and evaluation of
search results
“Your search did not match any documents. ” – Bad
Query!

Copyleft (ɔ) 2009 Sudarsun Santhiappan 116


Search Tricks
What can we search for?
Thematic resource (http://www.topicmaps.org)
Community
Collection of articles
Forum
Catalog of resources, links
File (file types)
Encyclopedia article
Digital library
Contact information (i.e. email)
Copyleft (ɔ) 2009 Sudarsun Santhiappan 117
Improving Query Results

To look for a particular page use an unusual phrase


you know is on that page
Use phrase queries where possible
Check your spelling!
Progressively use more terms
If you don't find what you want, use another Search
Engine!

Copyleft (ɔ) 2009 Sudarsun Santhiappan 118


Useful words
download
pdf, ppt, doc, zip, mp3
forum, directory, links
faq, for newbies, for beginners, guide, rules, checklist
lecture notes, survey, tutorials
how, where, correct, howto
Copy-pasting the exact error message
Have you tried http://del.icio.us/ ?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 119


Search Engine Features

Copyleft (ɔ) 2009 Sudarsun Santhiappan 120


Features

Indexing features
Search features
Results display
Costs, licensing and registration requirements
Unique features (if any)

Copyleft (ɔ) 2009 Sudarsun Santhiappan 121


Indexing Features
File/document formats supported: HTML, ASCII, PDF, SQL, Spread
Sheets, WYSIWYG (MS-Word, WP, etc.)
Indexing level support: File/directory level, multi-record files
Standard formats recognized: MARC, Medline, etc
Customization of document formats
Stemming: If yes, is this an optional or mandatory feature?
Stop words support: If yes, is this an optional or
mandatory features ?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 122


Searching Features
Boolean Searching: Use of Boolean operators AND, OR and NOT as search
term connectors
Natural Language: Allows users to enter the query in natural language
Phrase: Users can search for exact phrase
Truncation/wild card: Variations of search terms and plural forms can be
searched
Exact match: Allows users to search for terms exactly as it is entered
Duplicate detection: Remove duplicate records from the retrieved records
Proximity: With connectors such as With , Near, ADJacent one can specify
the position of a search terms w.r.t to others

Copyleft (ɔ) 2009 Sudarsun Santhiappan 123


Searching Features
Field Searching: Query for a specific field value in the database
Thesaurus searching: Search for Broader or Narrower or Related terms or Related
concepts
Query by example: Enables users to search for similar documents
Soundex searching: Search for records with similar spelling as the search term
Relevance ranking: Ranking the retrieved records in some order
Search set manipulation: Saving the search results as sets and allowing users to
view search history

Copyleft (ɔ) 2009 Sudarsun Santhiappan 124


Results Display
Formats supported: Can it display in native format or just HTML; Display in
different formats, Display number of records retrieved
Relevancy ranking: If the retrieved records are ranked, how the relevance score is
indicated
Keyword-in-context: KWIC or highlighting of matching search terms
Customization of results display: allow users to select different display formats
Saving options: Saving in different formats; number of records that can be saved at a
time

Copyleft (ɔ) 2009 Sudarsun Santhiappan 125


Evaluation of Search Engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 126


CRITICAL EVALUATION
Why Evaluate What You Find on the Web?

Anyone can put up a Web page


about anything
Many pages not kept up-to-date
No quality control
most sites not “peer-reviewed”
less trustworthy than scholarly publications
no selection guidelines for search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 127


Web Evaluation Techniques
Before you click to view the
page...
Look at the URL - personal page or site ?
~ or % or users or members

Domain name appropriate for the content ?


edu, com, org, net, gov, ca.us, uk, etc.

Published by an entity that makes sense ?


News from its source?
www.nytimes.com
Advice from valid agency?
www.nih.gov/
www.nlm.nih.gov/
www.nimh.nih.gov/

Copyleft (ɔ) 2009 Sudarsun Santhiappan 128


Web Evaluation Techniques
Scan the perimeter of the page
Can you tell who wrote it ?
name of page author
organization, institution, agency you recognize
e-mail contact by itself not enough
Credentials for the subject matter ?
Look for links to:
“About us” “Philosophy” “Background” “Biography”
Is it recent or current enough ?
Look for “last updated” date - usually at bottom

If no links or other clues...


truncate back the URL
http://hs.houstonisd.org/hspva/academic/Science/Thinkquest/gail/text/ethics.html

Copyleft (ɔ) 2009 Sudarsun Santhiappan 129


Web Evaluation Techniques
Indicators of quality
Sources documented
links, footnotes, etc.
As detailed as you expect in print publications ?
do the links work ?
Information retyped or forged
why not a link to published version instead ?
Links to other resources
biased, slanted ?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 130


Web Evaluation Techniques
What Do Others Say ?

Search the URL in alexa.com


Who links to the site? Who owns the domain?
Type or paste the URL into the basic search box
Traffic for top 100,000 sites

See what links are in Google’s Similar pages


Look up the page author in Google

Copyleft (ɔ) 2009 Sudarsun Santhiappan 131


Copyleft (ɔ) 2009 Sudarsun Santhiappan 132
Web Evaluation Techniques
STEP BACK & ASK: Does it all add up ?
Why was the page put on the Web ?
inform with facts and data?
explain, persuade?
sell, entice?
share, disclose?
as a parody or satire?

Is it appropriate for your purpose?

Copyleft (ɔ) 2009 Sudarsun Santhiappan 133


Try evaluating some sites...
Search a controversial topic in Google:
"nuclear armageddon"
prions danger
“stem cells” abortion
Scan the first two pages of results
Visit one or two sites
try to evaluate their quality and reliability

Copyleft (ɔ) 2009 Sudarsun Santhiappan 134


Ufff, The End

Have you learned something today ?


Try whatever we've discussed today!
If you need help, let me know at
sudarsun@gmail.com

Copyleft (ɔ) 2009 Sudarsun Santhiappan 135