You are on page 1of 14

9/20/2019

Recap: Information Retrieval – IR là gì?

TRUY VẤN THÔNG TIN


ĐA PHƯƠNG TIỆN
INFORMATION RETRIEVAL

CRAWLER và MỘT SỐ MÔ HÌNH TRONG INFORMATION RETRIEVAL 1 2

Recap: Kiến trúc hệ thống IR và Search Recap: Kiến trúc hệ thống IR và Search
Crawler and indexer
PARSING & INDEXING
Doc Query query
Rep Rep
Repository
User
SEARCH
Query parser Ranking results
APPLICATIONS
LEARNING
Evaluation judgments
FEEDBACK
Ranking model Nội dung trong môn học:
1) Search engine architecture; 2)Retrieval models;
3) Retrieval evaluation; 4) Relevance feedback;
5) Link analysis; 6) Search applications.
CS@UVa
Document Analyzer 3 CS@UVa 4

1
9/20/2019

Recap: Lĩnh vực IR


Nội dung
Applications
Mathematics

Web Applications,
1. Search và các thành phần của IR
Bioinformatics…
Machine Learning 2. Một số mô hình trong IR
Pattern Recognition Library & Info

Natural
Information Science 2.1 Boolean model
Statistics Retrieval
Language Databases
Optimization
Processing 2.2 Vector space model
Data Mining Software engineering
Computer systems 2.3 Probabilistic model
Algorithms
Systems

CS@UVa 5

Kiến trúc của một Search engine User input Result display

• “The Anatomy of a Large-Scale Hypertextual Web Search


Engine” - Sergey Brin and Lawrence Page, Computer
networks and ISDN systems
Crawler 30.1 (1998): 107-117.
and indexer

Citation count: 12197


(as of Aug 27, 2014) Result post-
Query parser
processing
Citation count: 13727
(as of Aug 30, 2015)
Query parser Ranking model
Domain specific
Crawler & Indexer database

Ranking model Document analyzer


& auxiliary database
CS@UVa 7 CS@UVa 8

Document Analyzer

2
9/20/2019

Luồng xử lý của Search Engine


Indexed corpus
Các thành phần cơ bản của IR
Crawler
Ranking procedure • Thông tin - Information need
• “an individual or group's desire to locate and obtain
information to satisfy a conscious or unconscious
Research attention
need” – wiki
• Một hệ thống IR cố gắng “satisfy” một users’
Feedback Evaluation information need
Doc Analyzer • Câu truy vấn - Query
(Query)
Query Rep User • Một cách biểu diễn users’ information need
Doc Representation • Có nhiều cách: bằng ngôn ngữ tự nhiên, ….

Indexer Index Ranker results

CS@UVa 9 CS@UVa 10

Các thành phần cơ bản của IR Một số thành phần của a search engine

• Dữ liệu - Document • Web crawler


• Biểu diễn các thông tin có thể là câu trả lời cho
users’ information need • A automatic program that systematically
• DạngOne sentence
của dữ about
liệu có thể là: vănIRbản,
- “rank
image, video,
audio, … browses the web for the purpose of Web
documents by their relevance to
• Liên quan - Relevance
the information need” content indexing and updating
• Sự liên quan giữa các dữ liệu với users’ information
need • Document analyzer & indexer
• Dựa trên nhiều khía cạnh: topical, semantic,
temporal, spatial,….. • Manage the crawled web content and provide
efficient access of web documents

CS@UVa 11 CS@UVa 12

3
9/20/2019

Indexed corpus
Một số thành phần của a search engine
Crawler
• Query parser
Ranking procedure
• Compile user-input keyword queries into managed
Research attention system representation

• Ranking model
Feedback Evaluation
Doc Analyzer • Sort candidate documents according to it relevance
(Query)
to the given query
Doc Representation
Query Rep User
• Result display

Ranker results • Present the retrieved results to users for satisfying their
Indexer Index
information need
CS@UVa 13 CS@UVa 14

Indexed corpus
Một số thành phần của a search engine
Crawler
Ranking procedure • Retrieval evaluation

Research attention
• Assess the quality of the return results

• Relevance feedback
Feedback Evaluation
Doc Analyzer • Propagate the quality judgment back to the
(Query)
Query Rep User system for search result refinement
Doc Representation

Indexer Index Ranker results

CS@UVa 15 CS@UVa 16

4
9/20/2019

Indexed corpus
Một số thành phần của search engine
Crawler
Ranking procedure • Search query logs

Research attention
• Record users’ interaction history with search
engine
Feedback Evaluation
Doc Analyzer • User modeling
(Query)
Query Rep User • Understand users’ longitudinal information need
Doc Representation
• Assess users’ satisfaction towards search engine
Indexer Index Ranker results output

CS@UVa 17 CS@UVa 18

Indexed corpus
Browsing v.s. Querying
Crawler
• Browsing – what Yahoo did • Querying – what Google does
Ranking procedure before • A user enters a (keyword) query,
• The system organizes information and the system returns a set of
with structures, and a user relevant documents
Research attention
navigates into relevant • Works well when the user knows
information by following a path exactly what query to use for
enabled by the structures expressing her information need
Feedback Evaluation • Works well when the user wants
Doc Analyzer to explore information or doesn’t
(Query) know what keywords to use, or
Doc Representation
Query Rep User can’t conveniently enter a query
(e.g., with a smartphone)

Indexer Index Ranker results

CS@UVa 19 CS@UVa 20

5
9/20/2019

2. Hệ thống IR 2. Hệ thống IR

21 22

2. Hệ thống IR 2. Text

23 24

6
9/20/2019

2. Audio 2. Face retrieval

25 26

2. Video Indexed corpus


Crawler
Ranking procedure

Research attention

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

27 CS@UVa 28

7
9/20/2019

2. Crawler dữ liệu 2. Crawler - cách thức hoạt động


Web Crawler - A automatic program that systematically
browses the web for the purpose of Web content Mã giả:
indexing and updating Def Crawler(entry_point) {
• Synonyms: spider, robot, bot URL_list = [entry_point]
while (len(URL_list)>0) { Which page to visit next?
URL = URL_list.pop();
if (isVisited(URL) or !isLegal(URL) or !checkRobotsTxt(URL))
Is it visited already? continue;
Is the access granted?
Or shall we visit it again?HTML = URL.open();
for (anchor in HTML.listOfAnchors()) {
URL_list .append(anchor);
}
setVisited(URL);
insertToIndex(HTML);
}
}
CS@UVa 29 CS@UVa 30

2. Crawler - Một số chiến thuật thu thập 2. Crawler - Duyệt ưu tiên


• Prioritize the visiting sequence of the web
• Duyệt theo chiều rộng - Breadth first • The size of Web is too large for a crawler (even Google) to completely cover
• Not all documents are equally important
• Uniformly explore from the entry page
• Emphasize more on the high-quality documents
• Memorize all nodes on the previous level • Maximize weighted coverage
• As shown in pseudo code
• In 1999, no search engine indexed more
• Duyệt theo chiều sâu - Depth first than 16% of the Web
• Explore the web by branch • In 2005, large-scale search engines index no
• Biased crawling given the web is not a tree structure more than 40-70% of the indexable Web
• Duyệt theo chủ đề - ưu tiên - Focused crawling
Importance of page p
• Prioritize the new links by predefined strategies
Weighted coverage Pages crawled till time t
till time t

CS@UVa 31 CS@UVa CS4501: Information Retrieval 32

8
9/20/2019

2. Crawler - Duyệt ưu tiên 2. Crawler - Duyệt ưu tiên

• Prioritize by in-degree [Cho et al. WWW’98] • Prioritize by topical relevance


• The page with the highest number of incoming hyperlinks • In vertical search, only crawl relevant pages [De et al. WWW’94]
from previously downloaded pages is downloaded next
• E.g., restaurant search engine should only crawl restaurant pages
• Prioritize by PageRank [Abiteboul et al. WWW’07, Cho and Uri VLDB’07]
• Breadth-first in early stage, then compute/approximate • Estimate the similarity to current page by anchortext or text near
PageRank periodically anchor [Hersovici et al. WWW’98]
• More consistent with search relevance [Fetterly et al. SIGIR’09] • User given taxonomy or topical classifier [Chakrabarti et al. WWW’98]

CS@UVa 33 CS@UVa 34

2. Crawler - Tránh trùng lặp 2. Crawler - Một số quy định khi lấy thông tin

• Given web is a graph rather than a tree, avoid loop in crawling is Crawlers can retrieve data much quicker and in
important
greater depth than human searchers
• What to check
• URL: must be normalized, not necessarily can avoid all duplication • Costs of using Web crawlers
• http://dl.acm.org/event.cfm?id=RE160&CFID=516168213&CFTOK • Network resources
EN=99036335
• http://dl.acm.org/event.cfm?id=RE160 • Server overload
• Page: minor change might cause misfire • Robots exclusion protocol
• Timestamp, data center ID change in HTML
• Examples: CNN, UVa
• How to check
• trie or hash table

CS@UVa 35 CS@UVa 36

9
9/20/2019

2. Crawler - Một số config của web 2. Crawler - Re-visit web

• Exclude specific directories:


• The Web is very dynamic; by the time a Web crawler has
finished its crawling, many events could have happened,
User-agent: *
Disallow: /tmp/
including creations, updates and deletions
Disallow: /cgi-bin/ • Keep re-visiting the crawled pages
Disallow: /users/paranoid/ • Maximize freshness and minimize age of documents in the
• Exclude a specific robot: collection
User-agent: GoogleBot
Disallow: / • Strategy
• Allow a specific robot: • Uniform re-visiting
User-agent: GoogleBot • Proportional re-visiting
Disallow:
• Visiting frequency is proportional to the page’s update
User-agent: *
frequency
Disallow: /

CS@UVa 37 CS@UVa 38

2. Crawler - Cách phân tích một webpage 2. Crawler - Cách phân tích một webpage

• What you care from the crawled web pages • What machine knows from the crawled web pages

CS@UVa 39 CS@UVa CS4501: Information Retrieval 40

10
9/20/2019

2. Crawler - Cách phân tích một webpage 2. Crawler - HTML parsing


• Generally difficult due to the free style of HTML
• Needs to analyze and index the crawled web pages • Solutions
• Extract informative content from HTML • Shallow parsing
• Build machine accessible data representation • Remove all HTML tags
• Only keep text between <title></title> and <p></p>
• Automatic wrapper generation [Crescenzi et al. VLDB’01]
• Wrapper: regular expression for HTML tags’ combination
• Inductive reasoning from examples
• Visual parsing [Yang and Zhang DAR’01]
• Frequent pattern mining of visually similar HTML blocks

CS@UVa 41 CS@UVa 42

2. Crawler - HTML parsing 2. Crawler - Biểu diễn thông tin tài liệu
• Represent by a string?
• jsoup • No semantic meaning
• Java-based HTML parser • Represent by a list of sentences?
• scrape and parse HTML from a URL, file, or string to DOM tree • Sentence is just like a short document (recursive definition)
• Find and extract data, using DOM traversal or CSS selectors
• children(), parent(), siblingElements() • Represent by a list of words?
• getElementsByClass(), getElementsByAttributeValue() • Tokenize it first
• Python version: Beautiful Soup • Bag-of-Words representation!

CS@UVa CS4501: Information Retrieval 43 CS@UVa 44

11
9/20/2019

2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu
Tách từ - Tokenization Giải pháp Tách từ - Tokenization
• Break a stream of text into meaningful units • Regular expression
• Tokens: words, phrases, symbols • [\w]+: so-called -> ‘so’, ‘called’
• Input: It’s not straight-forward to perform so-called • [\S]+: It’s -> ‘It’s’ instead of ‘It’, ‘’s’
“tokenization.” • Statistical methods
• Output(1): 'It’s', 'not', 'straight-forward', 'to', • Explore rich features to decide where is the boundary of a word
'perform', 'so-called', '“tokenization.”' • Apache OpenNLP (http://opennlp.apache.org/)
• Output(2): 'It', '’', 's', 'not', 'straight', '-', 'forward, 'to', • Stanford NLP Parser (http://nlp.stanford.edu/software/lex-parser.shtml)
'perform', 'so', '-', 'called', ‘“', 'tokenization', '.', '”‘ • Online Demo
• Stanford (http://nlp.stanford.edu:8080/parser/index.jsp)
• Definition depends on language, corpus, or even context • UIUC (http://cogcomp.cs.illinois.edu/curator/demo/index.html)

CS@UVa 45 CS@UVa 46

2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu

• Bag-of-Words representation
• Doc1: Information retrieval is helpful for everyone.
•Bag-of-Words representation
• Doc2: Helpful information is retrieved for you. • Assumption: word is independent from each
other
• Pros: simple
information retrieval retrieved is helpful for you everyone • Cons: grammar and order are missing
Doc1
Doc2
1
1
1
0
0
1
1
1
1
1
1
1
0
1
1
0
• The most frequently used document
representation
• Image, speech, gene sequence

Word-document adjacency matrix

CS@UVa 47 CS@UVa 48

12
9/20/2019

2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu

CS@UVa CS4501: Information Retrieval 49 CS@UVa 50

2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu
Stemming
Chuẩn hóa dữ liêu: Normalization
• Reduce inflected or derived words to their root form
• Convert different forms of a word to normalized form in the • Plurals, adverbs, inflected word forms
vocabulary
• E.g., ladies -> lady, referring -> refer, forgotten -> forget
• U.S.A -> USA, St. Louis -> Saint Louis
• Bridge the vocabulary gap
• Solution
• Risk: lose precise meaning of the word
• Rule-based
• Delete periods and hyphens • E.g., lay -> lie (a false statement? or be in a horizontal
• All in lower case position?)
• Dictionary-based • Solutions (for English)
• Construct equivalent class
• Car -> “automobile, vehicle”
• Porter stemmer: pattern of vowel-consonant sequence
• Mobile phone -> “cellphone” • Krovetz Stemmer: morphological rules

CS@UVa 51 CS@UVa 52

13
9/20/2019

2. Crawler - Biểu diễn thông tin tài liệu Abstraction of search engine architecture
Stopwords
Indexed corpus 1. Visiting strategy
• Useless words for query/document analysis Crawler 2. Avoid duplicated visit
• Not all words are informative 3. Re-visit policy
• Remove such words to reduce vocabulary size
• No universal definition
• Risk: break the original meaning and structure of text
• E.g., this is not a good option -> option
to be or not to be -> null
1. HTML parsing
2. Tokenization
Doc Analyzer 3. Stemming/normalization
4. Stopword/controlled vocabulary filter
Doc Representation
BagOfWord
representation!

The OEC: Facts about the language


CS@UVa 53 CS@UVa CS4501: Information Retrieval 54

Automatic text indexing


Query: “to be or not to be”
• In modern search engine
• No stemming or stopword removal, since computation and storage are no
longer the major concern
• More advanced NLP techniques are applied
• Named entity recognition
• E.g., people, location and organization
• Dependency parsing

CS@UVa 55

14

You might also like