You are on page 1of 108

(2013)

Email:yangjw@pku.edu.cn
1

User

The Web

Spider

Inverted
Index

Search
Engine

Library, etc.
3

(B

SE/IR

SE/IR

doc(index

terms)
5

10
S1

S2



Doc1
Doc2
Doc3

(posting list)

<doc1><doc2>

<doc1><doc3>


Index terms df
computer 3
database 2

science
system

4
1

Dj, tfj
D7, 4
D1, 3

a posting
a postings list

D2, 4
D5, 2
Optional and may be stored in a separate file

10

Index

term word

:
: :
N :

11

->

12

13

<term,doc>
<term,doc>
<term,doc>
<term,doc>

term,
docid

term,pt r

Doc1,doc2

term,ptr

Doc1,doc2
14

Hash Table Access


zebra
: :
apple

B-Tree-style Access
....
apple

Exact match
O(1) access

....

....
....
....
Database of Inverted Lists

....

zebra

Exact match
Range match
O(log (n)) access

15

Term
term

liber,liberal,liberalist
(MPH,

<termid,docid>

16

--
1.Parsing

,index termdftf
hashterm id(lexicon
file)
2.index termtf,df,
posting list
parsing

3.

17

18

<docid,..>

Huffman

1,3,7,11,13,14 1,2,4,4,2,1

golombbytecode
19

<d0,d1>


<d0,d1,d2>

20

VSM

AND

rank

Similarity(Q,D)=COS(Q,D)
Cos(Q, Dd )

1
Wq W d

tQ

W
tQ

d ,t

* Wq ,t

Wq W d
f d ,t * log(

f
tQ

d ,t

* idf * 1 * idf

Wq Wd

N 2
)
ft
21

VSM rank
Document-level<docid,F(d,t)>
|D|


Word-level
<docid,F(d,t),<loc1,loc2locF(d,t)>>

IN TITLE
loc .VS. wordlevel indextext interval

22

23

Smart
Smart retrieval system
ftp://ftp.cs.cornell.edu/pub/smart/

2080

Gerard

Salton
Chris Buckley
Smart11(1999)
24

Smart

G. Salton ed. The SMART Retrieval System


Experiments in Automatic Document
RetrievalEnglewood Cliff, NJ: Prentice Hall Inc.; 1971
G. Salton, M.J. McGill Introduction to Modern
Information Retrieval McGraw-Hill Book Co. New
York, NY, USA 1986 t
Sim(Q, D)

( wqk wdk )

k 1

( wqk ) ( wdk ) 2

k 1
k 1
2

25

Smart

Smart


(query)

stopwordsstopwords
(stemming)
weighting


26

Smart

500MB

10GBTREC Web
Track

27

Okapi
http://www.soi.city.ac.uk/~andym/OKAPIPACK/index.html

2080
Unix

nominal fee of 100 pounds sterling


the source code for Okapi for the sum of 1000 pounds
sterling
Okapi is no longer available for purchase(2008.12.10)
28

Okapi

20Okapi

TREC
Okapi

BM2500

(k1 1)tf (k 3 1)qtf


w

( K tf )( k 3 qft )
T Q
1

(r 0.5) /( R r 0.5)
w log
(n r 0.5) /( N n R r 0.5)
1

29

Lemur Toolkit
http://www.lemurproject.org/
-CMU
2001

CC++Unix
Windows(C++, Java and C# APIs) 30

Lemur Toolkit

Free and open-source software


Lemur Toolkit v 4.12 (2010/06/21)

31

Lemur Toolkit -- Indri

Indri is a text search engine developed at UMass.

It is a part of the Lemur project.


http://www.lemurproject.org/indri/
terabyte

Best-in-class

ad hoc retrieval performance


Supports popular structured query operators from
INQUERY
Open source, with a flexible BSD-inspired license
Parses PDF, HTML, XML, and TREC documents
API can be used from Java, PHP, or C++
Can be used on a cluster of machines
32

Lucene

LuceneJava
www.lucene.com
SourceForge
LuceneDoug Cutting
/
AppleV-Twin
Excite

2001APACHEjakarta
Lucene
Lucene Java 4.2 (2013/3/11)
http://lucene.apache.org/
33

Lucene
Lucene
(Apache Solr)

/
Web

html,pdf,word,mp3

(Analyzer)
34

Lucene

nutch, Lucene
JavaLucene

Project
Web

webweb

(NDFS)
(Lucene)
Solr

is the popular, blazing fast open source


enterprise search platform from the
Apache Lucene project

35

Lucene

Lucene
Lucene.Net to the C# and .NET platform
utilizing Microsoft .NET Framework.
C++CLucene

OutlookLucene
iTunes
Chinabbs
36

Lucene
LuceneJavaJava
Lucene
Lucene

Lucene
Lucene

37

Lucene

38

Lucene

39

Structure of source code

org.apache.Lucene.index/
org.apache.Lucene.search/
org.apache.Lucene.analysis/
org.apache.Lucene.queryParser/
org.apache.Lucene.document/
org.apache.Lucene.store/ IO/
org.apache.Lucene.util/

40

41

Hash Table Access


zebra
: :
: :
apple

Exact match
O(1) access

B-Tree-style Access
....
apple

....

....
....
....

....

zebra

Exact match
Range match
O(log (n)) access

Database of Inverted Lists


42


Dj, tfj

Index terms

df

computer

D7 , 4

database

D1, 3

D2, 4

D5, 2

a posting
a postings list

science
system

Optional and may be stored in a separate file

43


.tii(in memory)
.tis

<==>
apple foo bar

=>
apple applet

aqua

foo

.frq

docIds for apple

docIds for applet

.prx

proxs for apple

proxs for applet

44

Lucene(1)

.tis: term infos;

.tii: term info index;

(Term

dictionary)

.frq: each term with frequency;


(Term

Frequency data)
, ,
;

.prx: each term position with document;


(Term

Proximity data)
, ; 45


.tis
F1:apple

F1:applet

F2:ace F2:add

Field1 end
Field1 start

Field2 start

46

(stored field)

.fdx

.fdt

doc0

doc1 doc2

d0:f0 d0:f1 d0:f2 d1:f0 d1:f1

47

Lucene(2)
.fnm: filed name; .fdx: field index;
.fdt: field data; .f(field num.): field norm;
.nrm: norm by fieldlength (=.f(num) v2.1) ;
.del: a segment contain deletions
segments;
commit.lock;
index.lock;
deleteable

tis(tii) -> frq (prx) -> document ID


(nrm, del) -> fdx -> fdt
48

49

Segment
segment

.tii, .tis, .frq, .prx


: .fdx, .fdt
:
.fnm: (fieldInfos)
.f0, .f1 : normalization vector (norms)

Segment


50

Segment classes
SegmentReader

SegmentTermEnum
(.tii, .tis)
SegmentTermDocs
(.frq)

FieldInfos
(.fnm)
FieldsReader
(.fdx .fdt)

SegmentTermPositions
(.prx)
51

Segments
segment()



.tis
NN

52

docId
segment
segment
Segment1: docId [0, 1, 99]
Segment2: docId [0, 1, 49]
Segments: Segment1 + Segment2

docId [0, 1, , 99, 100, 101, , 149]


segment1

segment2
53

Different Field

Yes Yes Yes

Yes Yes No

No Yes Yes

No No Yes

54

55

56

IndexWriter classes

IndexWriter

DocumentWriter

SegmentMerger

57

document

DocumentWriter
document
document

hashing
token stream
Inverted
list
segment
segment

segment

segment

(SegmentMerger)

58

segment
disk
100 docs

10 docs

1 doc

1 doc

10 docs

1 doc

10 docs

1 doc
memory
59

segment
Segment1 (2 docs)

Apple
Doc0: prox1, prox2
Doc1: prox1, prox2
Banana
Doc1: prox1, prox2

Segment2 (1 doc)

Apple
Doc0: prox1, prox2
Candy
Doc0: prox1, prox2

Segment1+Segment2

Apple: D0: p1, p2 D1: p1, p2 D2:


Banana:
Candy:

60

61

Query classes

Query

TermQuery

BooleanQuery

RangeQuery

62

Query object
Select * from idx1 where f1:a and f2:b
queryParser.parse()
BooleanQuery
TermQuery
Term=f1:a

TermQuery
Term=f2:b

63

queryParser.
Parse()
query object

IndexReader

Scorer.score()

HitCollector <Doc1, 0.5>


Hits:
<Doc5, 0.9>
<Doc2,
0.7>
(sort)
<Doc2, 0.7>

64

TermQuery

TermQueryLucene
QueryTerm
TermQuery
score

= sqrt(freq) * idf * norm * boost


idf = ln(maxDoc/(docFreq + 1) )+ 1.0
norm = fieldboost / sqrt(fieldlength)

idfboost
(TermQuery)

sqrt(freq)*fieldboost/sqrt(fieldlength)
fieldboost 1.0

norm1

65

BooleanQuery
BooleanQueryQuery
Query
BooleanQuery

queryboost
queryBooleanQuery

3.0 2.0 1.0


66

BooleanQuery

scorej

=
coordj*i(boost i*idf i*tfi,j*idf i*fieldnorm)
/ sqrt(i (idf i *idf i *boost i *boost i))
fieldnorm = fieldboost / sqrt(fieldlength)
sqrt(i

(idf i *idf i *boost i *boost i))

67

Lucene

Sim ( d j , q )

dj q
|dj ||q |

i 1

w
i 1

i, j

2
i, j

wi ,q
t

2
w
i ,q
i 1

w i,j = tfi,j*idf i

wi,q = boost q*idf q


djsqrt(fieldlength)
68

69

Delta

a,b,c
Case

=> a, b-a, c-b

1: docId

docId.frq

0, 2, 3, 5, 6, 10
docId
Delta
0, 2, 1, 2, 1, 4
70

Delta
Case

2: term

.tis

apple, applet, application, banana

Delta

<0, apple>, <5, t>, <4, ication>,


<0, banana>
71

VInt
Int4bytes
int

10
Delta
1byte()

byte()(
)
72

VInt
10

0~127: 1byte, 0xxxxxxx


128~2^14-1: 2bytes
xxxxxxxyyyyyyy =>
1xxxxxxx, 0yyyyyyy
2^14~2^21-1: 3bytes
2^21~2^28-1: 4bytes
2^28~2^32-1: 5bytes

73


Byte: an eight-bit byte;

VInt: A variable-length format

UInt32: UInt32 --> <Byte>4

Chars : UTF-8 encoding

Uint64: UInt32 --> <Byte>8

String: String --> VInt, Chars

1: 00000001
128: 10000000 00000001

127: 01111111
16,383: 11111111 01111111

16,384: 10000000 10000000 00000001


74


.tii(in memory)
.tis

<==>
apple foo bar

<==>
apple applet aqua

foo

.frq

docIds for apple

docIds for applet

.prx

proxs for apple

proxs for applet

75

76

Web

77

78

yhy
chpch pc h p

diannao

windows
Microsoft

79

Semantic Web
XMLOntologyWeb
Web
Web

Ontology

...

80

Traditional Web Search application

Web Sources
Search Engine System

Results(maybe 20,000 in 10 secs):


Glasses for window,eye Glass,
Special kind of Material-Glass, etc

Query pages
Glass

The Semantic Web Search Application

Web Sources

Reasoner
Ontology
Query pages
Glass

You get exact Results(2,3secs):


(10 according to user requirements)
EyeGlasses,
EyeGlass shop near you,
The cost you can afford,
The current popular style

81


XML

82

83

84


Hash

Table

B,

B+

85

K-D-B-
J.T.Robinson SIGMOD1981

K-D-BB+

86

K-D-B-

87

R-tree
(A. Guttman SIGMOD1984)

88

R-tree
R-treeB-Tree

R-tree

R-tree
deadspace,

89

R-tree


90

R+-tree
(T. Sellis VLDB1987)

R+-treeK-D-B-tree

R+-treeR-tree

R-

91

R+-tree
(T. Sellis VLDB1987)

92

SS-Tree
(D. A. White ICDE1996 )
SS-TreeR*-Tree



SS-TreeR*-Tree

93

SS-Tree
(D. A. White ICDE1996 )

94

SS-Tree

SS-TreeR*Tree
R*Tree

95

SR-Tree
(N. Katayama SIGMOD1997)

96

SR-Tree

97

SR-Tree
MBS
MBR
SS-Tree

R*-Tree

98

R-Tree

5R-Tree
90%
high-dimension>=5)

99

X-Tree
(S. Berchtold VLDB1996)

X-TreeR-Tree
(supernode)

X-

(Super node)

(
)

100

X-Tree

101

K-D-B TreesJ.T.Robinson SIGMOD1981


R-treeA. Guttman SIGMOD1984
R+-tree (T. Sellis VLDB1987)
R*-TreeN. Beckmann SIGMOD1990
SS-Tree (D. A. White ICDE1996 )
SR-Tree (N. Katayama SIGMOD1997)
M-Tree (P.Ciaccia VLDB1997)
VA-File (R. Weber VLDB1998)
Pyramid-Tree(S.Berchtold SIGMOD1998)
A-Tree (Y. Sakurai VLDB2000)
102

K-D-B Tree
grid file)

R-Tree SR-Tree
X-Tree A-Tree

103

MBRs

MBSs

R-tree

(SS-Tree)

MBRs and MBSs


(SR-Tree)
104

R-Tree SS-Tree

VAFile)

SR-Tree

A-Tree)
105

R-Tree X-Tree

VA-File A-Tree)

106


(LSALDA)

Lucene

107

108

You might also like