Table & Query Design For Hierarchical Data: Without CONNECT-BY - A Path Code Approach

Table & Query Design for
Hierarchical Data
without CONNECT-BY
-- A Path Code Approach
Charles Yu
Database Architect
Elance Inc.
cyu@elance.com
charles.yu@acm.org
2005-08
Background
 Node-Uniform Hierarchical (NUH for short) data can be
visualized as a tree or forest graph where every node has
the same set of attributes.
 NUH data can be naturally represented in RDBMS by

recursive tables where the parent-child relationship is
implemented in a way that if record x is a child of record y,
then the value of x’s parent_id column is the same as the
value of the id column of y’s.
 Standard SQL does not support for general query on NUH

data in basic recursive tables.
 Oracle comes with a native mechanism for general query on

NUH data and beyond, known as connect-by. For all its
elegancy and usefulness, it is short in two accounts: it is
slow in cases due to the fact that the parent-child
relationship cannot be directly indexed; and it is Oracle
dependent in that SQL queries using connect-by cannot be
easily adapted to RDBMS of other vendors.
Basic Recursive Table Design
Basic Columns
 Xid --system assigned unique id
 Parent_xid --xid of parent of this entry
 Entry_code --content unique identifier of the entry
 Normal_stuff --one or more such columns for content
values
Some variant:
Use a separate table to store the hierarchical relationship
consisting essentially of two columns: xid/child_xid and
parent_xid; and use FK to link the table to the main data
table
Basic Recursive Table Query
Mechanisms
 Oracle-native Connect by
 K-way self outer join (for up to level
k depth)
 Other??
Basic Idea of Path Code Approach
A node of a tree is fully determined
by the path from the root to itself.
 Path code as full representation of
the path can be very compact in
length, in the order of logarithmic of
total size of the tree.
 Path Code can be maintained
dynamically feasibly.
 Path code permits direct indexing.
Path-code enhanced recursive
table design
Basic Columns
 xid
 parent_xid
 path_code --code of the path for the node (detail
later)
 entry_level --level of the record in the tree the entry
belongs to
 sibling_no --sequence no of the child entry_code
with respect to the parent
 is_leaf --1/0 for being a leaf/not a leaf
 Entry_code --content unique identifier of the entry
 normal_stuff --one or more
Value Setting for H columns (I)
 Parent_xid set as usual
 Sibling_no can be set according to any ordering,
e.g. according to entry_code, starting at 1 for
each parent; the sibling_no of root entries are set
as if those roots were children of a super root;
 Entry_level can be set from top down, having
entry_level=0 for all root entries; and
X.entry_level=k+1 if X has parent Y and
Y.entry_level=k;
 Is_leaf =0/1 if there is child of the node/not so
Value Setting for H columns (II)
 Path_code
– for root entries X: X.path_code = to_char(X.sibling_no,
‘00’)
– for non-root entries X with X.parent_xid=Y.xid:
X.path_code = Y.path_code||to_char(X.sibling_no,’00’)
Path_code of a node N at level k has k+1 sections;

level j section is left-zero padded string conversion of
sibling_no of N or N’s parent at level j;
For convenience, the last (rightmost) section is called the
base section, the concatenation of all the non-last
sections is called the ancestor section.
Example of H-value setting
ORG tree LOGO format assumption
EC for entry_code Node uniform (see next)
EL for entry_level Section length =2
EC=a
EL=0 XID for xid String expression in format
XID=1 PC for path_code (Same assumption for
PC=01 later code examples)
Explanation
b2 b1 •Path_code is in the uniform format
1 1
2 3 •Path_code order is based on entry_code order but not
0102 0101 on XID order. It could be otherwise.
•Path_code of a child is the path_code of its parent plus
its base section code.
c3 c2 c1
2 2 2 •Sibling_no is not shown but assumed to be in
4 5 6 accordance with entry_code.
010203 010202 010201
•Entry_code and xid value settings can be independent
of each other.
•parent_xid, sibling_no, is_leaf and other fields are not
shown.
Variants of path_code pattern
(advanced topic)
 node uniform: every section of all path codes has equal
length (a simplest; and it is used in the previous example)
 Level uniform: every section of the same level of all
path_codes has equal length
 Parent uniform: every child node of any parent node has
equal path_code length
 Dot (or delimiter) uniform: use the same delimiter
character (e.g. dot) to separate all sections of all
path_codes
 Min uniform: the length of base section of the path_code is
always maintained to be minimum
 String/Binary/hex/ in expression and interpretation, sorting
relevant, etc.
 Sparse uniform: path_code sections each allows more
values than actually and currently needed, for easing
subsequent node insertions.
Query Patterns
 Get all children of a parent P
select * from T where path_code like
P.path_code||’%’
 Get all ancestors if a child C
select * from T where C.path_code like
path_code||’%’
 Get all siblings of a node N
select * from T where parent_xid =
N.parent_xid
DML Patterns (insert at end)
 (Insert record with path_code and sibling_no as null)
insert into T(xid,parent_xid,entry_level, entry_code,
normal_stuff) values c.xid,p.xid, p.entry_level + 1,
c.entry_code, c.normal_stuff;
 (Update sibling_no)
update T set sibling_no = (select max(sibling_no)+1 from T
where parent_xid = p.xid) where xid = c.xid;
 (Update path_code)
update T set path_code = p.path_code ||
to_char(sibling_no, '00') where xid = c.xid;
 (reset is_leaf for p, detail omit)

DML Patterns (insert in middle)
 (Insert record with path_code and sibling_no as null)
insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values
c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;
 (Update sibling_no for those siblings elder than c)

update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and
entry_code >c.entry_code;
 (Update sibling_no for c)

update T set sibling_no = (select max(sibling_no)+1 from T where
parent_xid = p.xid and entry_code < c.entry_code;
 (Update path_code for c)

update T set path_code = p.path_code || to_char(sibling_no, '00') where
xid = c.xid;
 (Update path_code for those siblings elder than c and all decendents of
those elder siblings, pcs_length stands for path_code section length)
update T set path_code = substr(path_code,1, pcs_length*entry_level) ||
to_char(sibling_no, '00')||substr(path_code,
pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’
and path_code > (select path_code from T where xid = c.xid)
DML patterns (delete)
 (Delete node C and all its decendents)
delete from T where path_code like C.path_code||’%’;
 (Shifting sibling_no for those siblings elder than C)

Update T set sibling_no = sibling_no -1 where parent_xid =
P.parent_id and sibling_no >C.sibling_no;
 (Shifting path_code for those siblings elder than C

and their decendents)
Update T set path_code = substr(path_code,1,
pcs_length*C.entry_level) || to_char(sibling_no,
'00')||substr(path_code, pcs_length*(C.entry_level+1)+1)
where path_code like P.path_code||’%’ and path_code >
C.path_code;
 (reset is_leaf of the parent P of C according to
whether P has other children)
Complexity Analysis
 Space
– length of path_code increases in order
logarithmic of the total number of rows in the
table (for non-degenerated hierarchical data).
– e.g. length of c*20 vs 1M rows
 Time
– queries execute very much based on index
range scan, usually the fastest available.
– Inserts/deletes may involve sub-tree
processing for delete or update.
Comparison with Connect-By
category Connect-By Path code approach
Data pattern General directed Strictly hierarchical

graph
Table design Recursive basic Recursive basic +
extra H-columns
Disk space cost Minimum Overhead of up to
\log n
Time Efficiency* Incapable of direct Capable of direct
indexing or unknown indexing
RDBMS Vendor No Yes
independence*
Query complexity* It depends It depends
DML complexity Minimum Substantial but in a

small set of patterns
A stretched idea on RDBMS design
 Make entry_id and parent_entry_id
relationship declarative;
 Enforce hierarchy constraint to the effect
that each node can only have zero or one
parent node;
 Create and maintain path_code,
entry_level, etc by RDBMS like creating
and maintaining functional indexes;
 Add syntax to SQL similar to Oracle’s
connect-by, but with the extra of taking
advantage of the hidden indexes.
Additional Questions and
References
 Whether and how to generalize the design

for node non-uniform hierarchical data?
 To see latest alternative approaches, e.g.

http://www.inconcept.com/JCM/May2005/
David.html (Using ANSI SQL as a Conceptual
Hierarchical Data Modeling and Processing Language for
XML, by Michael M David)

Table & Query Design For Hierarchical Data: Without CONNECT-BY - A Path Code Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Table & Query Design For Hierarchical Data: Without CONNECT-BY - A Path Code Approach

Uploaded by

Copyright:

Available Formats

Table & Query Design for

 NUH data can be naturally represented in RDBMS by

 Standard SQL does not support for general query on NUH

 Oracle comes with a native mechanism for general query on

Path_code of a node N at level k has k+1 sections;

 (reset is_leaf for p, detail omit)

 (Update sibling_no for those siblings elder than c)

 (Update sibling_no for c)

 (Update path_code for c)

 (Shifting sibling_no for those siblings elder than C)

 (Shifting path_code for those siblings elder than C

Data pattern General directed Strictly hierarchical

DML complexity Minimum Substantial but in a

 Whether and how to generalize the design

 To see latest alternative approaches, e.g.

You might also like