Professional Documents
Culture Documents
net/publication/2639316
CITATIONS READS
105 134
4 authors, including:
Kyoungro Yoon
Konkuk University
72 PUBLICATIONS 552 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Kyoungro Yoon on 15 April 2014.
P. Bruce Berra
Dept. of Electrical and Computer Engineering
Syracuse University
91
contains a keyword “multimedia,” the inverted lists for m
the keyword “multimedia” will contain all the element
identifiers in the path from the root to the paragraph.
Finally, they have proposed a new scheme, called el-
ement locator scheme, for document structure indexing.
In this approach, each index term in the inverted list is
associated with a path encoding from the root to the
leaf and any necessary sibling numbers. This approach
,...-.,
also requires considerable storage overhead in the in-
verted list.
.......
In this paper, we propose new inverted index and
Figure 1: Example Document Tree
signature schemes for structured
file documents which
reduce the storage overhead considerably. Our schemes
use specially designed unique element identifiers (UID’S)
in order to reduce the number of index entries. By let- Table 1: Unique Element Identifiers
ting the UID carry information about the document
element UID element ] UID
structure, we can obtain the UID’S of the ancestors
T f 8
and deacendents directly from the UID of an element.
2 9
Thus, we do not need to store all the element identifiers
3 : 14
containing a keyword in the index unlike the previous
5 i 15
schemes.
6 j 16
I
In order to facilitate structure query processing, an the UID, the UID’S of the ancestors and descendants
index structure supporting fast element access must be can be obtained using the shift operations.
provided,
92
4 Indexing Structured Documents A
person
Since users can access any kind of document element
female
in the document tree, it is necessary to build an index
girl
structure that facilitates access to an element in the
database. We propose various kinds of new inverted in- woman
dex structures and signature file structures which sup- r
port fast access to document elements.
B c
4.1 Inverted Index
person person
female
In order to support direct element access, it is re-
quired to include all the index terms of elements in the * girl
inverted list. In the document tree, leaf nodes are as-
I woman I
5’11
sociated with data while internal nodes represent only
structure relationships between document elements. D E
Suppose that a document has three leaf nodes and
person person
each leaf element has index terms as shown in Figure 2.
female female
Here, we can consider the node A is a chapter, B and
girl woman
C are sections, and D and E are subsections.
Even though the internal nodes have no associated 4.1.2 Inverted Index for All Levels with Repli-
document data, the data at the subtrees must be con- cation (ALWR)
sidered as their data. Thus, the index terms for each
node are as follows: This scheme is the same as the ANWR except that each
element type has a separate inverted list per index term.
index (A) = {person ,f emale ,man,girl, woman} A similar scheme has also been described in [23] without
index (B) = {person, female ,girl, woman:} analysis. The inverted list for the example is as follows:
index(C) = {person, mm}
index (D) = {person, f emale, girl} invert-list (person) = {{A} ,{B ,C}, {D, E}}
index(E) = {person, f emale, woman} invert-list (female) = {{A}, {B}, {D, E}}
invert-list (man) = {{A} ,{ C}, { }}
Now the problem is how to maintain an inverted
invert -list (girl ) = {{A}, {B}, {D}}
index structure for fast access to document elements.
invert-list (woman) = {{A}, {B}, {El}
4.1.1 Inverted Index for All Nodes with Repli- The inverted list for “person” has three sub-lists for
cation (ANWR) three element types. This scheme has also considerable
storage overhead in the inverted list.
The first naive approach is to replicate all index terms of
the children to their ancestor nodes as shown in Figure
3.
93
4.1.3 Inverted Index for Leaf Nodes Only
(LNON)
B c
person
[ man Figure 5: Indices without Replication
D E
Using the ANOR inverted list, we can access any el-
person person ement in the database using the parent and child func-
female female tion.
girl wornan
4.1.5 Inverted Index for Root Node Only
(RNON)
Figure 4: Indices for Leaf Nodes Only
We can consider an extreme case which has been used
The inverted list for the example is as follows: in traditional information retrieval systems. In this
scheme, we include only the element identifier of the
invert-list (person) = {C, D, E}
root in the inverted list, that is, the document identi-
invert-list (female) = {D ,E}
fier only, as shown in Figure 6.
invert-list (man) = {C}
invert-list (girl) = {D}
A
invert-list (woman) = {E}
person
The storage requirement of this scheme is much less r fernsle
than that of the ANWR. Even though we do not main- girl
tain the UID’S of the internal nodes in the document woman
tree, the UID’S of the ancestor nodes of a node can be man
calculated by the parent function. Thus, we can access
any document element using the inverted list.
94
4.2 Signature File 5 Cost Comparison
The signature file is a useful indexing technique in We analyze the storage requirements and disk ac-
information retrieval systems because its storage uti- cess times of the ANWR, ALWR, LNON, and ANOR
lization is better than that of the inverted index [11] inverted indexing schemes. The inverted index model
[19]. In order to support structure queries efficiently, which will be used for the analysis is shown in Figure
we propose modified signature file structures. 7.
postings list
4.2.1 Signature File for All Nodes (SFAN) keyword Sddlwses doc id element id’s
.00 I
● 00
In this scheme, we maintain signatures of all elements
.00
in the document tree. However, it requires considerable . ●
.0.
1
: : I
space to maintain element signatures. Moreover, it is
:. .:
&f-
inefficient as the entire signature file must be accessed nodes ...
to process a query. (1 ~ ‘“7;
●
..
.. I
1
To reduce the space for the signature file, we maintain Table 2: Symbols and Definitions
signatures of the leaf nodes only and do not maintain -–
svmbol I definition
signatures of the internal nodes. This scheme is similar
d degree of IV-tree
to the LNON inverted indexing scheme. The signature
h height of document tree
of the internal node can be calculated by ORing the
k degree of document tree
signatures of its leaf nodes.
1 height of B+-tree
m average number of keywords in a node
4.4 Signature File for Root Node Only
ndOC total number of documents in database
(SFRN)
ninv~f.t_fi~t number of inverted lists per index term
nkey total number of keywords in database
This scheme is the one used in traditional informa-
P rate of promoted keywords from
tion retrieval systems. In this scheme, only the root
children to parent node
node of the document tree has a signature and the other
sblock block size in bytes
nodes do not have signatures. However, it is diflicult to
Sdoc-id document identifier size in bytes
support structure queries in this scheme.
Selem_id element identifier size in bytea
sentry table entry size in bytes
4.5 Signature File for Selected Levels
sk=~ average keyword size in bytes
(SFSL)
.$ptr pointer size in bytea
s#et set size of document variable
In order to reduce the time required to perform the
t~~~d~~ random disk block access time
bitwise OR operation in the SFLN, we can maintain
tseq sequential disk block access time
signatures at some selected levels of the document tree.
u rate of unique index terms
The SFLN and SFRN are the extreme cases of this
in children nodes
scheme. As an example of this scheme, we can combine
the SFLN and SFRN schemes, that is, we can maintain
signatures for the root node and leaf nodes. We can also
maintain three levels of signatures, for example, for the
root node, leaf nodes, and middle level nodes.
95
5.1 Storage Requirement is large, the document is sparse, that is, the elements of
the document are not closely related.
The storage requirement of an index structure is the Assume that the average number of index terms of
sum of those required for the J3+-tree and postings list. a leaf node is m. Then the number of index terms of a
node at level (h - 1) is
Sindez = Sbtree + Sposting. (3)
k*m*u.
The B+-tree has the leaf and internal nodes.
The number of index terms of a node at level i is
Sbtree = Sleaf + Sinternal. (4)
kh-i * m * Uh-i.
Since the total number of keywords in the database is
nkcy, the number of disk blocks required for the leaf Since the number of nodes at level i is ki-l, the total
nodes is number of index terms at level i is
If we assume that the height of the B+-tree is 1 and the Since the number of index terms of a root node is
degree (fan-out) is d, then the number of disk blocks for
kh-l*m*uh-l,
the internal nodes is
where
96
LNON:
The total number of index terms per document is
Tposting = Gandom + (Nb/.ck_pe?_ke# - 1) * tseq, (17)
N“ost_pev_doe = ~h-l * ~ (13) where the number of disk blocks of the postings list per
keyword is
since the number of leaf nodes in the tree is k“- 1.
Sponting
Nblock.per_key = (18)
ANOR: [ nk.y 1
For the analysis of the ANOR, we introduce a parameter
p which represents the probability that an index term 5.3 Analytical Results
is promoted from the child nodes to their parent node.
The range of p is We compare the storage requirements of the inverted
indexing schemes by using the parameters as shown in
O<p<l. Table 3. The disk access times are based on the Conner
CP30200 model [8]. We assume that the database has
Then, the number of index terms of a leaf is 100,000documents with 50,000 keywords.
m-m *p.
Table 3: Parameter Settings
The number of index terms of a node at level i is
I symbol I value
m *ph-i — m * ph-i+l. d 64
h 5
The total number of index terms at level i is k 5
_ ki-1 * m *ph-i+l
1 3
ki-l * m *ph-i
m 10
ndoc 100,000
because the number of nodes at level i is ki-l. And the
?Jinv..~Ji8t 5 (ALWR), 1 (others)
number of index terms of the root node is
~key 50,000
m*ph-l. P varied
sb~~~k 1024
Thus, the total number of index terms of a document is Sdoe_id 4
Sel.m.id 2
Sentry 4
IV,O,t .-,.. ~oC= ~(ki-l.m.p’-i-ki-l .m*pk-i+l) Skey 12
i=2 4
$Jptr
s*.t 50
+rn * ph-1
tra~&~ 19 msec
t seq 0.7 msec
u 0.6
= kh-1 * m
- mx~pi * (kh-i -kh-i-l) . (14) Figure 8 illustrates the storage requirements of the
i=l ANWR, ALWR, LNON, and ANOR, when u is 0.6. It
shows that the ANOR has the best performance while
5.2 Disk Access Time the ALWR has the worst. It shows that the index size of
the ANOR decreases linearly as the value of p increases.
The disk access time required to retrieve the postings Figure 9 illustrates the average disk access times of
list of a keyword is the sum of the I?+-tree access time the four schemes, when u is 0.6 and any arbitrary key-
and postings list access time, word index is accessed. It shows that the ANOR has
the best performance when p is larger than 0.1. We also
T search = Tbt.ee + Tpo.ting. (15) have obtained similar results with other values of u.
The disk access time of the I?*-tree is
5.4 Experimental Results
97
45246 ! I .—+
t
110
10s
1
, -..--Ii--
..........--.......---..* ................. ........... .............. .............~ k."- ........-"-.--.- ......u-----
....---.---
......-----9..---.--..-.--.---..-...*..-.-----..-.------.k
*W ~----...
——.-——
19+06
k
Y !
—
73 -
Sooml I lol , J
0.1 0.16 0.4 0.3s 0.3 0.33 OA 043 0s 0.1 0.16 0.2 0.26 0s 0,4 OAK 0.s
~,- w p&3rm (p)
Figure8: Index Space Requirement Figure 10: Experimental Disk Access Time
‘m~
space considerably compared to the previous schemes.
116 -
Our scheme exploits the hierarchical document struc-
ture and and uses the fact that index terms are inherited
110
between hierarchically related elements. We have eval-
I uated the storage requirements and disk access times of
1! la
the inverted index schemes. Among the inverted index-
i ing schemes, the ANOR (inverted index without repli-
!7/ ‘w . cation) has shown the best performance in the storage
i ~,0""""""""-"""""---"-0""""""'"""""""""--"""""""*--""""""---"""""-""""""-"":
.- ——— ——- +-———————-..--+ --.--- —-..
_----+— __________ requirement. It also has the least disk access time when
n----- the elements of a document have some common key-
30 - words. It means that the ANOR can be the best choice
since the difference of the storage requirements between
Ml 1
0.1 0.1s 0.2 0.26 0s 04 0.4s 0.2
the ANOR and other schemes is great. By using this
A3r* (P)
index structure, we can access document elements very
fast with much less index space.
Figure 9: Average Disk Access Time
Recently, much attention has been focused on video
databases [22] [26]. Video documents also have hierar-
perimented in a single user environment after building chical structures as with text documents. By exploit-
a partial inverted index in a 100 Mbyte disk space. We ing this structure using structure queries, users can ob-
have assumed the databsse contains 100,000 documents tain greater benefits than by using only content queries.
with 50,000 keywords and u is 0.6. Figure 10 shows the Our indexing schemes can also be applied to structured
results of the experiment obtained by accessing the in- video document management.
dex for a given keyword 100 times each and averaging In this paper, we have presented signature file schemes
the access times. The results are similar to the analyt- which can be used for structured documents. We are
ical results. The reason that the experimental results currently evaluating the storage requirements and disk
are somewhat faster is that we have experimented in access times of them.
a single user environment with a relatively small disk
space and we have not considered other aspects such as
buffer management in the analysis. References
98
[2] P. B. Berra, Y. K. Lee, and K. Yoon, “Multime- [14] D. Harman, E. Fox, R. Baeza-Yates, and W. Lee,
dia Databaee Design for Heterogeneous Distributed “Inverted Files,” In W. B. Frankes and R. Baeza-
Database Systems: 2nd Interim Report ,“ The New Yates, Ed., Information Retrieval: Data Structures
York State Center for Advanced Technology in and Algorithms, pp. 28-43, Prentice Hall, 1992.
Computer Applications and Software Engineering,
[15] 1S0 8879. Information Processing - Text and Ofice
Syracuse University, June 1995.
Systems - Standard Generalized Markup Language
[3] G. E. Blake, M. P. Consens, P. Kilpelainen, (SGML), International Organization for Standard-
P. -A. Larson, T. Snider, and F. W. Tompa, ization, 1986.
“Text/Relational Database Management ISystems:
[16] W. E. Kimber, “HyTime and SGML: Under-
Harmonizing SQL and SGML,” Proceedings of
standing the HyTime HYQ Query Language,”
the International Conference on Applications of
Available via anonymous ftp at ftp.ifi.uio.no
Databases, pp. 267-280, Vadstena, Sweden, June
/pub/SGML/HyTime, August 1993.
1994.
[17] Y, K, Lee, S. -J. Yoo, K. Yoon, and P. B.
[4] D. M. Choy, F. Barbie, R. H. Gueting, D. Ruland,
Berra, “Structured Document Management and
and R. Zicari, “Document Management and Han-
Handling,” Technical Report 9506, CASE Center,
dling; Proceedings of the IEEE ‘/?7 Ofjice Automa-
Syracuse University, June 1995.
tion Symposium, pp. 241-246, 1987.
[18] Y. K. Lee, S. -J, Yoo, K. Yoon, and P. B. Berra,
[5] V. Christophides, S. Abiteboul, S. Cluet, and
“Querying Structured Hyperdocuments,” Proceed-
M. Scholl, ‘From Structured Documents to Novel
ings of the $9th Hawaii International Conference
Query Facilities,” Proceedings of the 1994 ACM
on System Sciences, Maui, Hawaii, January 1996.
SIGikfOD Conference, pp. 313-324, Minneapolis,
Minnesota, May 1994. [19] Z. Lin and C. Faloutsos, pp. 28-43, “Frame-Sliced
Signature Files,” IEEE ‘lkansactions on Knowledge
[6] V. Christophides and A. Rizk, “Querying Struc-
and Data Engineering, vol. 4, no. 3, pp. 281-289,
tured Documents with Hypertext Links using
June 1992.
00DBMS,” Proceedings of the European Confer-
ence on Hypermedia Technology (ECHT ‘9./), PP. [20] I. A. Macleod, “Storage and Retrieval of Structured
186-197, Edinburgh, UK, September 1994. Documents,” Information Processing and Manage-
ment, vol. 26, no. 2, pp. 197-208, 1990.
[7] C. Clifton and H. Garcia-Molina, “Indexing a Hy-
pertext Database,” Proceedings of the 16th Inter- [21] I. A. Macleod, “A Query Language for Retriev-
national Conference on Very Large Data Bases, pp. ing Information from Hierarchic Text Structures,”
36-49, Brisbane, Australia, 1990. The Computer Journal, vol. 34, no. 3, pp. 254-264,
1991.
[8] Conner Periperials, Inc., “CP-30200 Specification
Summary,” 1995. [22] E. Oomoto and K. Tanaka, “OVID: Design and Im-
plementation of a Video-Object Database System,”
[9] T. H. Cormen, C. E. Leiserson, and R. L. Rivest,
IEEE Transactions on Knowledge and Data Engi-
Introduction to Algorithms, The MIT Press, 1990.
neering, vol. 5, no. 4, pp. 629-643, August 1993.
[10] C. Faloutsos, “Access Methods for Text ,“ ACM
[23] R. Sacks-Davis, T. Arnold-Moore, and J. Zobel,
Computing Surveys, vol. 17, no. 1, pp. 49-74,
“Database Systems for Structured Documents:
March 1985.
Proceedings of the International Symposium on Ad-
[11] C. Faloutsos, “Signature Files,” In W. B. IFrankes vanced Database Technologies and Their Integra-
and R. Baeza-Yates, Ed., Information Retm’eval: tion (ADTI ‘94), Nara, Japan, October 1994.
Data Structures and Algorithms, pp. 44-65, Pren-
[24] G. Salton and M. J. McGill, Introduction to Mod-
tice Hall, 1992.
ern Information Retrieval, McGraw-Hill, 1983.
[12] W. B. Frankes and R. Baeza-Yates, Ed., Ir$forma-
[25] G. Salton, Automatic Text Processing, Addison-
tion Retrieval: Data Structures and Algorithms,
Wesley, 1989.
Prentice Hall, 1992.
[26] R. Weiss, A. Duba, and D. K. Gifford, “Comp&
[13] C. F. Goldfarb, The SGML Handbook, Oxford Uni-
sition and Search with a Video Algebra,” IEEE
versity Press, Oxford, UK, 1990.
Multimedia, pp. 12-25, Spring 1995.
99