Professional Documents
Culture Documents
08 Web Search and Web Crawling
08 Web Search and Web Crawling
Selected aspects of
Web Search
and
Web Crawling
Motivation • Billions of interconnected pages
• Everybody can publish
• Various formats and languages
• A lot of low-quality content
• A lot of dynamic content
Search
engine logic
Results and
• How are relevant
algorithms
pages determined?
Queries • How are results The search engines have to bring
sorted? some sort of semantics to the
semantic-free structure of the web in
order to deliver desired results in an
efficient manner
user slide 2
General search engine architecture
Query
Indexing Module module
1010
0101
1100
Ranking
Indexes module
• Web crawling
– Web crawler architecture
– General crawler functionalities
– Web data extraction
– Robots Exclusion Standard
– Sitemaps
• Web search
– Preprocessing for index generation
– Content based indexes
– Structure based indexes
• PageRank
• HITS
slide 4
Crawler
slide 5
Universal Crawler architecture
Initial Set of URIs
Start (e.g. given by the
Seed URIs user)
(Evaluate content)
Extract URIs and
add new ones to
Frontier
Stored for further
processing
Store page Page
repository
Status is reached if
e.g. a specified
no yes number of pages
Done? Stop
has been crawled or
if the frontier is slide 6
empty
Fetching of pages
URI
Scheduler Multithreaded
Multithreaded downloader
downloader
Multithreaded
Multithreadeddownloader
downloader
slide 7
General crawling strategies
Breadth-first Depth-first
• Always follow all links to • Always follow the first
adjacent pages of the unvisited link of the current
current page first page
• Add all found URIs of • Declare this visited page as
adjacent pages to a queue the current page and start
• If all links of the current over
page have been visited, • Realize backtracking at dead
take the first page of the ends or after reaching a
queue and start over predefined depth
0 0
1 2 3 1 5 6
4 5 6 2 4 7
7 3
slide 8
Link extraction
Uncomfortable
access DOM representation Easy access
(X)HTML methods
representation
slide 10
Focused crawlers
slide 11
Robots Exclusion Standard
../robots.txt
• Disallows all
cooperative web User-agent: *
Disallow: /cgi-bin/ Google Yahoo
crawlers to visit the image blog
Disallow: /tmp/
three mentioned search search
Disallow: /private/
subdirectories crawler crawler
slide 14
Sitemaps protocol
../index.html
• Standardized way that allows
to inform web crawlers about ../Sitemap.xml.gz
available URIs and some of
their properties on a website ../fotos.html
• Useful if some pages are not
…
well linked or e.g. are only
reachable via AJAX mechanism
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/index.html</loc>
<lastmod>1970-06-07</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/fotos.html</loc>
<changefreq>weekly</changefreq>
</url>
...
slide 15
</urlset>
Indexing User expects efficient
delivery of results
Indexing Results
Document 2 Module
Indexes
term document
quote D1, D2, D3
friedrich D1, D3
Due to intersection of the document
schiller D1, D3 sets for the terms "friedrich" and
appearance D1 "schiller" document 1 and document
rule D1 3 are part of the result set for the
world D1, D2 query "friedrich schiller"
nietzsche D3
writing D3 What if the user only expects results
human D3 that contain the phrase "friedrich
puppet D3 schiller"?
arthur D2
schopenhauer D2
This information can not be
expressed by the simple inverted
nothing D2
index structure
will D2
idea D2
create D3
D1 = Document 1
D2 = Document 2 slide 18
D3 = Document 3
Inverted Index structure with term positions
term document
quote (D1,1), (D2,1), (D3,1) By extending the stored information
friedrich (D1,3), (D3,3) with term position data the query for
schiller (D1,4), (D3,8) the phrase "Friedrich Schiller" can
appearance (D1,5) be processed properly
rule (D1,6)
(D1,8), (D2,6)
Besides regular content information,
world
the inverted index can be extended
nietzsche (D3,4)
by document structure information
writing (D3,7)
such as directly surrounding HTML
human (D3,12)
tags of a term (e.g. storage of <h1>
puppet (D3,14) makes it possible to search for
arthur (D2,3) headings etc.)
schopenhauer (D2,4)
nothing (D2,8) Open question:
will (D2,10) How are documents sorted if more
idea (D2, 12)
then one document fits for a query?
create (D3, 11)
D1 = Document 1
D2 = Document 2 slide 19
D3 = Document 3
Term Frequency Inverse Document Frequency
Number of
occurrences of term TFIDFt TFt * IDFt If a term is very rare, a
document that contains it
t in document j is more important
nt , j
TFt
Number of all terms Nj Number of documents
in document j in the document set
D
IDFt Number of documents
Dt that contain term t slide 20
Term Frequency Inverse Document Frequency
Sum of both TFIDF for the query: Rank is higher and thus is presented on first position
1*1/8 + 1.5*1/8 = 0,313 of result set to the user
Content Score
+ Overall Score
Structure score
= Popularity Score slide 22
PageRank overview
PageRank?
slide 23
PageRanks? PageRanks? PageRanks?
Calculated based on
PageRank details
slide 27
PageRank details
Prestige assigned to p by q
d
PageRank ( p) (1 d ) ( PageRank (q) / outdegree (q))
|T | ( q , p )E
Authority: Hub:
• Page with many inlinks • Page with many out-links
• The idea has been • Central idea:
mentioned before: – A hub serves as an
– A page has good content organizer of information
on some topic and thus on a particular topic and
many people trust it and points to many pages
link to it connected to this topic
• BUT: How can it be assured • BUT: How can it be assured
that the linking pages are of that it really points to
high quality and thus their “good” pages?
“vote” is really valuable? slide 29
HITS - Hubs and Authorities
• Circular theses:
"A page is a good hub (and, therefore, deserves a high hub
score) if it points to good authorities, and a page is a good
authority (and, therefore, deserves a high authority score) if
it is pointed to by good hubs"
Mutual reinforcement
slide 30
HITS
• HITS is query-dependent
– Initially a set of documents is determined that fits a given user query
– Based on this set HITS is used to determine the hub and authority
value for each included document two ranking lists are finally
presented to the user
– Due to the definition of hubs and authorities the calculation can be
carried out in an iterative manner (comparable to PageRank)
Web crawler
• Crawling strategies
• Link extraction
• Parsing
• Robots Exclusion standard / Sitemaps
Queries
User
interface
Good results sorted by relevance
slide 32
References
slide 33