You are on page 1of 18

Table & Query Design for

Hierarchical Data
without CONNECT-BY
-- A Path Code Approach
Charles Yu
Database Architect
Elance Inc.
cyu@elance.com
charles.yu@acm.org

2005-08
Background
 Node-Uniform Hierarchical (NUH for short) data can be
visualized as a tree or forest graph where every node has
the same set of attributes.

 NUH data can be naturally represented in RDBMS by


recursive tables where the parent-child relationship is
implemented in a way that if record x is a child of record y,
then the value of x’s parent_id column is the same as the
value of the id column of y’s.

 Standard SQL does not support for general query on NUH


data in basic recursive tables.

 Oracle comes with a native mechanism for general query on


NUH data and beyond, known as connect-by. For all its
elegancy and usefulness, it is short in two accounts: it is
slow in cases due to the fact that the parent-child
relationship cannot be directly indexed; and it is Oracle
dependent in that SQL queries using connect-by cannot be
easily adapted to RDBMS of other vendors.
Basic Recursive Table Design
Basic Columns
 Xid --system assigned unique id
 Parent_xid --xid of parent of this entry
 Entry_code --content unique identifier of the entry
 Normal_stuff --one or more such columns for content
values

Some variant:
Use a separate table to store the hierarchical relationship
consisting essentially of two columns: xid/child_xid and
parent_xid; and use FK to link the table to the main data
table
Basic Recursive Table Query
Mechanisms
 Oracle-native Connect by
 K-way self outer join (for up to level
k depth)
 Other??
Basic Idea of Path Code Approach
A node of a tree is fully determined
by the path from the root to itself.
 Path code as full representation of
the path can be very compact in
length, in the order of logarithmic of
total size of the tree.
 Path Code can be maintained
dynamically feasibly.
 Path code permits direct indexing.
Path-code enhanced recursive
table design
Basic Columns

 xid
 parent_xid
 path_code --code of the path for the node (detail
later)
 entry_level --level of the record in the tree the entry
belongs to
 sibling_no --sequence no of the child entry_code
with respect to the parent
 is_leaf --1/0 for being a leaf/not a leaf
 Entry_code --content unique identifier of the entry
 normal_stuff --one or more
Value Setting for H columns (I)
 Parent_xid set as usual
 Sibling_no can be set according to any ordering,
e.g. according to entry_code, starting at 1 for
each parent; the sibling_no of root entries are set
as if those roots were children of a super root;
 Entry_level can be set from top down, having
entry_level=0 for all root entries; and
X.entry_level=k+1 if X has parent Y and
Y.entry_level=k;
 Is_leaf =0/1 if there is child of the node/not so
Value Setting for H columns (II)
 Path_code
– for root entries X: X.path_code = to_char(X.sibling_no,
‘00’)
– for non-root entries X with X.parent_xid=Y.xid:
X.path_code = Y.path_code||to_char(X.sibling_no,’00’)

Path_code of a node N at level k has k+1 sections;


level j section is left-zero padded string conversion of
sibling_no of N or N’s parent at level j;
For convenience, the last (rightmost) section is called the
base section, the concatenation of all the non-last
sections is called the ancestor section.
Example of H-value setting
ORG tree LOGO format assumption
EC for entry_code Node uniform (see next)
EL for entry_level Section length =2
EC=a
EL=0 XID for xid String expression in format
XID=1 PC for path_code (Same assumption for
PC=01 later code examples)

Explanation
b2 b1 •Path_code is in the uniform format
1 1
2 3 •Path_code order is based on entry_code order but not
0102 0101 on XID order. It could be otherwise.
•Path_code of a child is the path_code of its parent plus
its base section code.
c3 c2 c1
2 2 2 •Sibling_no is not shown but assumed to be in
4 5 6 accordance with entry_code.
010203 010202 010201
•Entry_code and xid value settings can be independent
of each other.
•parent_xid, sibling_no, is_leaf and other fields are not
shown.
Variants of path_code pattern
(advanced topic)
 node uniform: every section of all path codes has equal
length (a simplest; and it is used in the previous example)
 Level uniform: every section of the same level of all
path_codes has equal length
 Parent uniform: every child node of any parent node has
equal path_code length
 Dot (or delimiter) uniform: use the same delimiter
character (e.g. dot) to separate all sections of all
path_codes
 Min uniform: the length of base section of the path_code is
always maintained to be minimum
 String/Binary/hex/ in expression and interpretation, sorting
relevant, etc.
 Sparse uniform: path_code sections each allows more
values than actually and currently needed, for easing
subsequent node insertions.
Query Patterns
 Get all children of a parent P
select * from T where path_code like
P.path_code||’%’
 Get all ancestors if a child C
select * from T where C.path_code like
path_code||’%’
 Get all siblings of a node N
select * from T where parent_xid =
N.parent_xid
DML Patterns (insert at end)
 (Insert record with path_code and sibling_no as null)
insert into T(xid,parent_xid,entry_level, entry_code,
normal_stuff) values c.xid,p.xid, p.entry_level + 1,
c.entry_code, c.normal_stuff;

 (Update sibling_no)
update T set sibling_no = (select max(sibling_no)+1 from T
where parent_xid = p.xid) where xid = c.xid;

 (Update path_code)
update T set path_code = p.path_code ||
to_char(sibling_no, '00') where xid = c.xid;

 (reset is_leaf for p, detail omit)


DML Patterns (insert in middle)
 (Insert record with path_code and sibling_no as null)
insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values
c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;

 (Update sibling_no for those siblings elder than c)


update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and
entry_code >c.entry_code;

 (Update sibling_no for c)


update T set sibling_no = (select max(sibling_no)+1 from T where
parent_xid = p.xid and entry_code < c.entry_code;

 (Update path_code for c)


update T set path_code = p.path_code || to_char(sibling_no, '00') where
xid = c.xid;

 (Update path_code for those siblings elder than c and all decendents of
those elder siblings, pcs_length stands for path_code section length)
update T set path_code = substr(path_code,1, pcs_length*entry_level) ||
to_char(sibling_no, '00')||substr(path_code,
pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’
and path_code > (select path_code from T where xid = c.xid)
DML patterns (delete)
 (Delete node C and all its decendents)
delete from T where path_code like C.path_code||’%’;

 (Shifting sibling_no for those siblings elder than C)


Update T set sibling_no = sibling_no -1 where parent_xid =
P.parent_id and sibling_no >C.sibling_no;

 (Shifting path_code for those siblings elder than C


and their decendents)
Update T set path_code = substr(path_code,1,
pcs_length*C.entry_level) || to_char(sibling_no,
'00')||substr(path_code, pcs_length*(C.entry_level+1)+1)
where path_code like P.path_code||’%’ and path_code >
C.path_code;
 (reset is_leaf of the parent P of C according to
whether P has other children)
Complexity Analysis
 Space
– length of path_code increases in order
logarithmic of the total number of rows in the
table (for non-degenerated hierarchical data).
– e.g. length of c*20 vs 1M rows
 Time
– queries execute very much based on index
range scan, usually the fastest available.
– Inserts/deletes may involve sub-tree
processing for delete or update.
Comparison with Connect-By
category Connect-By Path code approach

Data pattern General directed Strictly hierarchical


graph
Table design Recursive basic Recursive basic +
extra H-columns
Disk space cost Minimum Overhead of up to
\log n
Time Efficiency* Incapable of direct Capable of direct
indexing or unknown indexing
RDBMS Vendor No Yes
independence*
Query complexity* It depends It depends

DML complexity Minimum Substantial but in a


small set of patterns
A stretched idea on RDBMS design
 Make entry_id and parent_entry_id
relationship declarative;
 Enforce hierarchy constraint to the effect
that each node can only have zero or one
parent node;
 Create and maintain path_code,
entry_level, etc by RDBMS like creating
and maintaining functional indexes;
 Add syntax to SQL similar to Oracle’s
connect-by, but with the extra of taking
advantage of the hidden indexes.
Additional Questions and
References

 Whether and how to generalize the design


for node non-uniform hierarchical data?

 To see latest alternative approaches, e.g.


http://www.inconcept.com/JCM/May2005/
David.html (Using ANSI SQL as a Conceptual
Hierarchical Data Modeling and Processing Language for
XML, by Michael M David)

You might also like