Professional Documents
Culture Documents
Email:yangjw@pku.edu.cn
1
User
The Web
Spider
Inverted
Index
Search
Engine
Library, etc.
3
(B
SE/IR
SE/IR
doc(index
terms)
5
10
S1
S2
Doc1
Doc2
Doc3
(posting list)
<doc1><doc2>
<doc1><doc3>
Index terms df
computer 3
database 2
science
system
4
1
Dj, tfj
D7, 4
D1, 3
a posting
a postings list
D2, 4
D5, 2
Optional and may be stored in a separate file
10
Index
term word
:
: :
N :
11
->
12
13
<term,doc>
<term,doc>
<term,doc>
<term,doc>
term,
docid
term,pt r
Doc1,doc2
term,ptr
Doc1,doc2
14
B-Tree-style Access
....
apple
Exact match
O(1) access
....
....
....
....
Database of Inverted Lists
....
zebra
Exact match
Range match
O(log (n)) access
15
Term
term
liber,liberal,liberalist
(MPH,
<termid,docid>
16
--
1.Parsing
,index termdftf
hashterm id(lexicon
file)
2.index termtf,df,
posting list
parsing
3.
17
18
<docid,..>
Huffman
1,3,7,11,13,14 1,2,4,4,2,1
golombbytecode
19
<d0,d1>
<d0,d1,d2>
20
VSM
AND
rank
Similarity(Q,D)=COS(Q,D)
Cos(Q, Dd )
1
Wq W d
tQ
W
tQ
d ,t
* Wq ,t
Wq W d
f d ,t * log(
f
tQ
d ,t
* idf * 1 * idf
Wq Wd
N 2
)
ft
21
VSM rank
Document-level<docid,F(d,t)>
|D|
Word-level
<docid,F(d,t),<loc1,loc2locF(d,t)>>
IN TITLE
loc .VS. wordlevel indextext interval
22
23
Smart
Smart retrieval system
ftp://ftp.cs.cornell.edu/pub/smart/
2080
Gerard
Salton
Chris Buckley
Smart11(1999)
24
Smart
( wqk wdk )
k 1
( wqk ) ( wdk ) 2
k 1
k 1
2
25
Smart
Smart
(query)
stopwordsstopwords
(stemming)
weighting
26
Smart
500MB
10GBTREC Web
Track
27
Okapi
http://www.soi.city.ac.uk/~andym/OKAPIPACK/index.html
2080
Unix
Okapi
20Okapi
TREC
Okapi
BM2500
( K tf )( k 3 qft )
T Q
1
(r 0.5) /( R r 0.5)
w log
(n r 0.5) /( N n R r 0.5)
1
29
Lemur Toolkit
http://www.lemurproject.org/
-CMU
2001
CC++Unix
Windows(C++, Java and C# APIs) 30
Lemur Toolkit
31
Best-in-class
Lucene
LuceneJava
www.lucene.com
SourceForge
LuceneDoug Cutting
/
AppleV-Twin
Excite
2001APACHEjakarta
Lucene
Lucene Java 4.2 (2013/3/11)
http://lucene.apache.org/
33
Lucene
Lucene
(Apache Solr)
/
Web
html,pdf,word,mp3
(Analyzer)
34
Lucene
nutch, Lucene
JavaLucene
Project
Web
webweb
(NDFS)
(Lucene)
Solr
35
Lucene
Lucene
Lucene.Net to the C# and .NET platform
utilizing Microsoft .NET Framework.
C++CLucene
OutlookLucene
iTunes
Chinabbs
36
Lucene
LuceneJavaJava
Lucene
Lucene
Lucene
Lucene
37
Lucene
38
Lucene
39
org.apache.Lucene.index/
org.apache.Lucene.search/
org.apache.Lucene.analysis/
org.apache.Lucene.queryParser/
org.apache.Lucene.document/
org.apache.Lucene.store/ IO/
org.apache.Lucene.util/
40
41
Exact match
O(1) access
B-Tree-style Access
....
apple
....
....
....
....
....
zebra
Exact match
Range match
O(log (n)) access
Dj, tfj
Index terms
df
computer
D7 , 4
database
D1, 3
D2, 4
D5, 2
a posting
a postings list
science
system
43
.tii(in memory)
.tis
<==>
apple foo bar
=>
apple applet
aqua
foo
.frq
.prx
44
Lucene(1)
(Term
dictionary)
Frequency data)
, ,
;
Proximity data)
, ; 45
.tis
F1:apple
F1:applet
F2:ace F2:add
Field1 end
Field1 start
Field2 start
46
(stored field)
.fdx
.fdt
doc0
doc1 doc2
47
Lucene(2)
.fnm: filed name; .fdx: field index;
.fdt: field data; .f(field num.): field norm;
.nrm: norm by fieldlength (=.f(num) v2.1) ;
.del: a segment contain deletions
segments;
commit.lock;
index.lock;
deleteable
49
Segment
segment
Segment
50
Segment classes
SegmentReader
SegmentTermEnum
(.tii, .tis)
SegmentTermDocs
(.frq)
FieldInfos
(.fnm)
FieldsReader
(.fdx .fdt)
SegmentTermPositions
(.prx)
51
Segments
segment()
.tis
NN
52
docId
segment
segment
Segment1: docId [0, 1, 99]
Segment2: docId [0, 1, 49]
Segments: Segment1 + Segment2
segment2
53
Different Field
Yes Yes Yes
Yes Yes No
No Yes Yes
No No Yes
54
55
56
IndexWriter classes
IndexWriter
DocumentWriter
SegmentMerger
57
document
DocumentWriter
document
document
hashing
token stream
Inverted
list
segment
segment
segment
segment
(SegmentMerger)
58
segment
disk
100 docs
10 docs
1 doc
1 doc
10 docs
1 doc
10 docs
1 doc
memory
59
segment
Segment1 (2 docs)
Apple
Doc0: prox1, prox2
Doc1: prox1, prox2
Banana
Doc1: prox1, prox2
Segment2 (1 doc)
Apple
Doc0: prox1, prox2
Candy
Doc0: prox1, prox2
Segment1+Segment2
60
61
Query classes
Query
TermQuery
BooleanQuery
RangeQuery
62
Query object
Select * from idx1 where f1:a and f2:b
queryParser.parse()
BooleanQuery
TermQuery
Term=f1:a
TermQuery
Term=f2:b
63
queryParser.
Parse()
query object
IndexReader
Scorer.score()
64
TermQuery
TermQueryLucene
QueryTerm
TermQuery
score
idfboost
(TermQuery)
sqrt(freq)*fieldboost/sqrt(fieldlength)
fieldboost 1.0
norm1
65
BooleanQuery
BooleanQueryQuery
Query
BooleanQuery
queryboost
queryBooleanQuery
BooleanQuery
scorej
=
coordj*i(boost i*idf i*tfi,j*idf i*fieldnorm)
/ sqrt(i (idf i *idf i *boost i *boost i))
fieldnorm = fieldboost / sqrt(fieldlength)
sqrt(i
67
Lucene
Sim ( d j , q )
dj q
|dj ||q |
i 1
w
i 1
i, j
2
i, j
wi ,q
t
2
w
i ,q
i 1
w i,j = tfi,j*idf i
69
Delta
a,b,c
Case
1: docId
docId.frq
0, 2, 3, 5, 6, 10
docId
Delta
0, 2, 1, 2, 1, 4
70
Delta
Case
2: term
.tis
Delta
VInt
Int4bytes
int
10
Delta
1byte()
byte()(
)
72
VInt
10
73
Byte: an eight-bit byte;
1: 00000001
128: 10000000 00000001
127: 01111111
16,383: 11111111 01111111
.tii(in memory)
.tis
<==>
apple foo bar
<==>
apple applet aqua
foo
.frq
.prx
75
76
Web
77
78
yhy
chpch pc h p
diannao
windows
Microsoft
79
Semantic Web
XMLOntologyWeb
Web
Web
Ontology
...
80
Web Sources
Search Engine System
Query pages
Glass
Web Sources
Reasoner
Ontology
Query pages
Glass
81
XML
82
83
84
Hash
Table
B,
B+
85
K-D-B-
J.T.Robinson SIGMOD1981
K-D-BB+
86
K-D-B-
87
R-tree
(A. Guttman SIGMOD1984)
88
R-tree
R-treeB-Tree
R-tree
R-tree
deadspace,
89
R-tree
90
R+-tree
(T. Sellis VLDB1987)
R+-treeK-D-B-tree
R+-treeR-tree
R-
91
R+-tree
(T. Sellis VLDB1987)
92
SS-Tree
(D. A. White ICDE1996 )
SS-TreeR*-Tree
SS-TreeR*-Tree
93
SS-Tree
(D. A. White ICDE1996 )
94
SS-Tree
SS-TreeR*Tree
R*Tree
95
SR-Tree
(N. Katayama SIGMOD1997)
96
SR-Tree
97
SR-Tree
MBS
MBR
SS-Tree
R*-Tree
98
R-Tree
5R-Tree
90%
high-dimension>=5)
99
X-Tree
(S. Berchtold VLDB1996)
X-TreeR-Tree
(supernode)
X-
(Super node)
(
)
100
X-Tree
101
K-D-B Tree
grid file)
R-Tree SR-Tree
X-Tree A-Tree
103
MBRs
MBSs
R-tree
(SS-Tree)
R-Tree SS-Tree
VAFile)
SR-Tree
A-Tree)
105
R-Tree X-Tree
VA-File A-Tree)
106
(LSALDA)
Lucene
107
108