Professional Documents
Culture Documents
Tuan2
Tuan2
Recap: Kiến trúc hệ thống IR và Search Recap: Kiến trúc hệ thống IR và Search
Crawler and indexer
PARSING & INDEXING
Doc Query query
Rep Rep
Repository
User
SEARCH
Query parser Ranking results
APPLICATIONS
LEARNING
Evaluation judgments
FEEDBACK
Ranking model Nội dung trong môn học:
1) Search engine architecture; 2)Retrieval models;
3) Retrieval evaluation; 4) Relevance feedback;
5) Link analysis; 6) Search applications.
CS@UVa
Document Analyzer 3 CS@UVa 4
1
9/20/2019
Web Applications,
1. Search và các thành phần của IR
Bioinformatics…
Machine Learning 2. Một số mô hình trong IR
Pattern Recognition Library & Info
Natural
Information Science 2.1 Boolean model
Statistics Retrieval
Language Databases
Optimization
Processing 2.2 Vector space model
Data Mining Software engineering
Computer systems 2.3 Probabilistic model
Algorithms
Systems
CS@UVa 5
Kiến trúc của một Search engine User input Result display
Document Analyzer
2
9/20/2019
CS@UVa 9 CS@UVa 10
Các thành phần cơ bản của IR Một số thành phần của a search engine
CS@UVa 11 CS@UVa 12
3
9/20/2019
Indexed corpus
Một số thành phần của a search engine
Crawler
• Query parser
Ranking procedure
• Compile user-input keyword queries into managed
Research attention system representation
• Ranking model
Feedback Evaluation
Doc Analyzer • Sort candidate documents according to it relevance
(Query)
to the given query
Doc Representation
Query Rep User
• Result display
Ranker results • Present the retrieved results to users for satisfying their
Indexer Index
information need
CS@UVa 13 CS@UVa 14
Indexed corpus
Một số thành phần của a search engine
Crawler
Ranking procedure • Retrieval evaluation
Research attention
• Assess the quality of the return results
• Relevance feedback
Feedback Evaluation
Doc Analyzer • Propagate the quality judgment back to the
(Query)
Query Rep User system for search result refinement
Doc Representation
CS@UVa 15 CS@UVa 16
4
9/20/2019
Indexed corpus
Một số thành phần của search engine
Crawler
Ranking procedure • Search query logs
Research attention
• Record users’ interaction history with search
engine
Feedback Evaluation
Doc Analyzer • User modeling
(Query)
Query Rep User • Understand users’ longitudinal information need
Doc Representation
• Assess users’ satisfaction towards search engine
Indexer Index Ranker results output
CS@UVa 17 CS@UVa 18
Indexed corpus
Browsing v.s. Querying
Crawler
• Browsing – what Yahoo did • Querying – what Google does
Ranking procedure before • A user enters a (keyword) query,
• The system organizes information and the system returns a set of
with structures, and a user relevant documents
Research attention
navigates into relevant • Works well when the user knows
information by following a path exactly what query to use for
enabled by the structures expressing her information need
Feedback Evaluation • Works well when the user wants
Doc Analyzer to explore information or doesn’t
(Query) know what keywords to use, or
Doc Representation
Query Rep User can’t conveniently enter a query
(e.g., with a smartphone)
CS@UVa 19 CS@UVa 20
5
9/20/2019
2. Hệ thống IR 2. Hệ thống IR
21 22
2. Hệ thống IR 2. Text
23 24
6
9/20/2019
25 26
Research attention
Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User
27 CS@UVa 28
7
9/20/2019
8
9/20/2019
CS@UVa 33 CS@UVa 34
2. Crawler - Tránh trùng lặp 2. Crawler - Một số quy định khi lấy thông tin
• Given web is a graph rather than a tree, avoid loop in crawling is Crawlers can retrieve data much quicker and in
important
greater depth than human searchers
• What to check
• URL: must be normalized, not necessarily can avoid all duplication • Costs of using Web crawlers
• http://dl.acm.org/event.cfm?id=RE160&CFID=516168213&CFTOK • Network resources
EN=99036335
• http://dl.acm.org/event.cfm?id=RE160 • Server overload
• Page: minor change might cause misfire • Robots exclusion protocol
• Timestamp, data center ID change in HTML
• Examples: CNN, UVa
• How to check
• trie or hash table
CS@UVa 35 CS@UVa 36
9
9/20/2019
CS@UVa 37 CS@UVa 38
2. Crawler - Cách phân tích một webpage 2. Crawler - Cách phân tích một webpage
• What you care from the crawled web pages • What machine knows from the crawled web pages
10
9/20/2019
CS@UVa 41 CS@UVa 42
2. Crawler - HTML parsing 2. Crawler - Biểu diễn thông tin tài liệu
• Represent by a string?
• jsoup • No semantic meaning
• Java-based HTML parser • Represent by a list of sentences?
• scrape and parse HTML from a URL, file, or string to DOM tree • Sentence is just like a short document (recursive definition)
• Find and extract data, using DOM traversal or CSS selectors
• children(), parent(), siblingElements() • Represent by a list of words?
• getElementsByClass(), getElementsByAttributeValue() • Tokenize it first
• Python version: Beautiful Soup • Bag-of-Words representation!
11
9/20/2019
2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu
Tách từ - Tokenization Giải pháp Tách từ - Tokenization
• Break a stream of text into meaningful units • Regular expression
• Tokens: words, phrases, symbols • [\w]+: so-called -> ‘so’, ‘called’
• Input: It’s not straight-forward to perform so-called • [\S]+: It’s -> ‘It’s’ instead of ‘It’, ‘’s’
“tokenization.” • Statistical methods
• Output(1): 'It’s', 'not', 'straight-forward', 'to', • Explore rich features to decide where is the boundary of a word
'perform', 'so-called', '“tokenization.”' • Apache OpenNLP (http://opennlp.apache.org/)
• Output(2): 'It', '’', 's', 'not', 'straight', '-', 'forward, 'to', • Stanford NLP Parser (http://nlp.stanford.edu/software/lex-parser.shtml)
'perform', 'so', '-', 'called', ‘“', 'tokenization', '.', '”‘ • Online Demo
• Stanford (http://nlp.stanford.edu:8080/parser/index.jsp)
• Definition depends on language, corpus, or even context • UIUC (http://cogcomp.cs.illinois.edu/curator/demo/index.html)
CS@UVa 45 CS@UVa 46
2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu
• Bag-of-Words representation
• Doc1: Information retrieval is helpful for everyone.
•Bag-of-Words representation
• Doc2: Helpful information is retrieved for you. • Assumption: word is independent from each
other
• Pros: simple
information retrieval retrieved is helpful for you everyone • Cons: grammar and order are missing
Doc1
Doc2
1
1
1
0
0
1
1
1
1
1
1
1
0
1
1
0
• The most frequently used document
representation
• Image, speech, gene sequence
CS@UVa 47 CS@UVa 48
12
9/20/2019
2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu
2. Crawler - Biểu diễn thông tin tài liệu 2. Crawler - Biểu diễn thông tin tài liệu
Stemming
Chuẩn hóa dữ liêu: Normalization
• Reduce inflected or derived words to their root form
• Convert different forms of a word to normalized form in the • Plurals, adverbs, inflected word forms
vocabulary
• E.g., ladies -> lady, referring -> refer, forgotten -> forget
• U.S.A -> USA, St. Louis -> Saint Louis
• Bridge the vocabulary gap
• Solution
• Risk: lose precise meaning of the word
• Rule-based
• Delete periods and hyphens • E.g., lay -> lie (a false statement? or be in a horizontal
• All in lower case position?)
• Dictionary-based • Solutions (for English)
• Construct equivalent class
• Car -> “automobile, vehicle”
• Porter stemmer: pattern of vowel-consonant sequence
• Mobile phone -> “cellphone” • Krovetz Stemmer: morphological rules
CS@UVa 51 CS@UVa 52
13
9/20/2019
2. Crawler - Biểu diễn thông tin tài liệu Abstraction of search engine architecture
Stopwords
Indexed corpus 1. Visiting strategy
• Useless words for query/document analysis Crawler 2. Avoid duplicated visit
• Not all words are informative 3. Re-visit policy
• Remove such words to reduce vocabulary size
• No universal definition
• Risk: break the original meaning and structure of text
• E.g., this is not a good option -> option
to be or not to be -> null
1. HTML parsing
2. Tokenization
Doc Analyzer 3. Stemming/normalization
4. Stopword/controlled vocabulary filter
Doc Representation
BagOfWord
representation!
CS@UVa 55
14