You are on page 1of 8

Why Tree Automata?

Foundations of XML Types: Tree Automata

• Foundations of XML type languages (DTD, XML Schema, Relax NG...)
Pierre Genevès • Provide a general framework for XML type languages
• A tool to define regular tree languages with an operational semantics
CNRS • Provide algorithms for efficient validation
• Basic tool for static analysis (proofs, decision procedures in logic)
Ecole Polytechnique Fédérale de Lausanne
April, 2008

Prelude: Word Automata From Words to Trees: Binary Trees

Binary trees with a even number of a’s

b b
a even
start even odd
a even a odd
b even a odd b even b even

even → odd How to write transitions?
odd → even a
(even, odd) → even
... a
(even, even) → odd
Ranked Trees? Ranked Alphabet

They come from parse trees of data (or programs)...

A ranked alphabet symbol is:

A function call f (a, b) f (g (a, b, c), h(i)) is a ranked tree
• a formalisation of a function call
• a symbol a with an integer arity(a)
f • arity(a) indicates the number of children of a

ga b
a b c i
a(k) : symbol a avec arity(a) = k

Example Ranked Tree Automata

Alphabet: {a(2) , b (2) , c (3) , #(0) }

Possible tree:
A ranked tree automaton A consists in:

Alphabet(A): finite alphabet of symbols
States(A): finite set of states
b a Rules(A): finite set of transition rules
Final(A): finite set of final states (⊆ States(A))
a c # #
where :
# # a # b a(k)
Rules(A) are of the form (q1 , ..., qk ) → q
# # a # if k = 0, we will write  → q

# #
How do they work? Terminology
Example: Boolean Expressions

∧ q1

∨ q1 ∧ q1
• Language(A) : set of trees accepted by A
∧ q0 ∨ q1 ∨ q1 ∧ q1
• For a tree automaton A, Language(A) is a regular tree language
[Thatcher and Wright, 1968, Doner, 1970]
0 q0 1 q1 1 q1 0 q0 0 q0 1 q1 1 q1 1 q1

Principle 0 1
 → q0  → q1
• Alphabet(A) = {∧, ∨, 0, 1} ∧ ∨
(q1 , q1 ) → q1 (q0 , q1 ) → q1
∧ ∨
• States(A) = {q0 , q1 } (q0 , q1 ) → q0 (q1 , q0 ) → q1
∧ ∨
• 1 accepting state at the root: (q1 , q0 ) → q0 (q1 , q1 ) → q1
∧ ∨
Final(A) = {q1 } (q0 , q0 ) → q0 (q0 , q0 ) → q0

Example Outline

Tree automaton A over {a(2) , b (2) , #(0) } which recognizes trees with a even
number of a’s
• Can we implement a tree automaton efficiently? (notion of determinism)
Alphabet(A) : {a, b, #}
States(A) : {even, odd} • Are tree automata closed under set-theoretic operations?
Final(A) : {even} • Can we check type inclusion?
Rules(A) : • Nice theory. But... what should I do with my unranked XML trees?
a b
(even, even) → odd (even, even) → even
a b
• Can we apply this for XSLT type-checking?
(even, odd) → even (even, odd) → odd
a b
(odd, even) → even (odd, even) → odd
a b
(odd, odd) → odd (odd, odd) → even
 → even
Deterministic Tree Automata Can we Make a Tree Automaton Deterministic?

Deterministic Theorem (determinisation)

does not have two rules of the form: From a given non-deterministic (bottom-up) tree automaton we can build a
a (k) deterministic tree automaton
(q1 , ..., qk ) → q
(q1 , ..., qk ) → q 0 Corollary
Non-deterministic and deterministic (bottom-up) tree automata recognize the
for two different states q and q 0 same languages.

Intuition Complexity
At most one possible transition at a given node → implementation... EXPTIME (|States(Adet )| = 2|States(A)| )

Implementing Validation Set-Theoretic Operations

Membership Checking
Given a tree automaton A and a tree t, is t ∈ Language(A)?

Remark b {q, qb , qf }
We can implement even if A is Recall
non-deterministic... b {q, qb , qf }
b {q, qb } • We have seen that neither local tree grammars nor single-type tree
Example grammars are closed under boolean operations (e.g. union)
a {q}
Automaton with Final(A) = {qf } and : • What about tree automata?
c b b
→q q → qb q→q {q, qb } b b {q, qb }
b a
qb → qf (q, q) → q {q} c c {q}

Membership-Checking is in PTIME (time linear in the size of the tree)
Closure under Set-Theoretic Operations: Overview Application for Checking Type Inclusion

Type Inclusion
Given two tree automata A1 and A2 , is Language(A1 ) ⊆ Language(A2 ) ?
Operation Cost
Union A1 ∪ A2 quadratic Theorem
Intersection A1 ∩ A2 quadratic Containment for non-deterministic tree automata can be decided in exponential
Complement A exponential time
Emptiness-Test Language(A) = ∅ polynomial
• Language(A1 ∩ A2 ) = ∅
Tree automata are closed under set-theoretic operations (see • For this purpose, we must make A2 deterministic (size: O(2|A2 | ))
[Comon et al., 1997] for details). → EXPTIME
• Essentially no better solution [Seidl, 1990]

Expressive Power of Tree Automata: Summary Unranked Trees

Ranked Tree Unranked Tree
as Tree a a
Theorem a
The following properties are equivalent for a tree language L: b b e c d a ... c d
(a) L is recognized by a bottom-up non-deterministic tree automaton
c b a a
(b) Lis recognized by a bottom-up deterministic tree automaton c
(c) L is generated by a regular tree grammar a a c b a ... c c a e
b a b c e b c b e
Proof Idea b
(a) ⇒ (b): determinisation (see [Comon et al., 1997])
(a) ⇔ (c) : ? (horizontal recursion a∗ ?) Unranked Tree Automata?
1. either we adapt ranked tree automata
2. or we encode unranked trees are ranked trees...
Unranked Tree Automata Second Option
Ranked Trees
q σ
Transitions can be described by finite sets:
δ(σ, q) = {(q1 , q2 ), (q3 , q4 ), ...} σ1 σ2
Can we encode unranked trees as ranked trees?
q1 q2
Unranked Trees b e c d a ... c d
q σ

σ1 σ2 ............ σn

b a c c a e
q1 q2 ............ qn ∈ δ(σ, q)?
e b c b e
δ(σ, q)
• For unranked trees, δ(σ, q) is a regular tree language
• δ(σ, q) may be specified by a regular expression or by a finite word
automaton [Murata, 1999]

Encoding Unranked Trees As Binary Trees Tree Automata: Summary

0 1
2 Definition
1 A tree language is regular iff it is recognized by a non-deterministic tree
2 3 automaton
3 Advantages
• Closure, decidable operations
• General tool (theoretical and algorithmic)

Bijective Encoding Limitations

• an b n
• “first child; next sibling” encoding
• Allows to focus on binary trees without loss of generality
• Results for ranked trees hold for unranked trees as well
Application for Type-Checking Application for XSLT Type-Checking

The XSLT Type-Checking Problem </xsl:stylesheet>
f ()
Given a type Tin and an XSLT stylesheet f , does f (t) ∈ Tout for all t ∈ Tin ?
t Tinf ⊆ Tout
f (t)

• Compute Tinf = {f (t)|t ∈ Tin }
• Check whether Tinf ⊆ Tout holds
• In case Tinf ⊆ Tout holds, then we know that for any t ∈ Tin , f (t) ∈ Tout

Limitation of the Approach Backward Type Inference for XSLT

Modified Approach
Tinf may not be regular:
• Compute Tinf = {f −1 (t)|t ∈ Tout )
c c
Transform into • Check whether Tin ⊆ Tinf holds
a a ... a a a ... a b b {z... b}
| {z } | {z }|
n n n
Theorem and Research Prototype
Problem Static type-checking is decidable for an XSLT fragment: “XSLT0”
[Tozawa, 2001]
• Approximation is required, e.g.: an b n approximated by a∗ b ∗
• Inference of the input tree automaton (PTIME)
• Approximation is not contained in Tout (whereas the real type is)
• Containment of tree automata (EXPTIME)
• There is no “good” approximation...
• Consequence: this approach yields static type-checkers which are not
complete: some correct transformations might be rejected.
• Only basic transformations are supported (no real XPath)
Concluding Remarks </session>

• Tree automata are part of the theoretical tools that provide the underlying
guiding principles for XML (like the relational algebra provide the A few pointers for the curious who want to learn more...
underlying principles for relational databases)
• Still a lot of research ongoing on the topic, important challenges remain • Sheaves automata [Dal-Zilio and Lugiez, 2003]
(how to model efficiently unordered content, e.g. XML attributes, or
interleaving/shuffle operator)
• Visibly pushdown automata [Alur and Madhusudan, 2004]
(beyond regular tree languages)
• A powerful and efficient modal tree logic [Genevès et al., 2007]
(how to support regular tree languages and XPath too)

Questions / discussions...?

Alur, R. and Madhusudan, P. (2004). In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Conference on
Programming Language Design and Implementation, pages 342–351, New
Visibly pushdown languages.
York, NY, USA. ACM Press.
In STOC ’04: Proceedings of the thirty-sixth annual ACM symposium on
Theory of computing, pages 202–211, New York, NY, USA. ACM. Murata, M. (1999).
Hedge automata: a formal model for XML schemata.
Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison,
S., and Tommasi, M. (1997).
Tree automata techniques and applications. Seidl, H. (1990).
Available on: Deciding equivalence of finite tree automata.
release October, 1st 2002. SIAM J. Comput., 19(3):424–437.
Dal-Zilio, S. and Lugiez, D. (2003). Thatcher, J. W. and Wright, J. B. (1968).
XML schema, tree logic and sheaves automata. Generalized finite automata theory with an application to a decision
In Nieuwenhuis, R., editor, RTA’03: Proceedings of the 14th International problem of second-order logic.
Conference on Rewriting Techniques and Applications, volume 2706 of Mathematical Systems Theory, 2(1):57–81.
Lecture Notes in Computer Science, pages 246–263. Springer.
Tozawa, A. (2001).
Doner, J. (1970).
Towards static type checking for XSLT.
Tree acceptors and some of their applications.
In DocEng ’01: Proceedings of the 2001 ACM Symposium on Document
Journal of Computer and System Sciences, 4:406–451.
Engineering, pages 18–27, New York, NY, USA. ACM Press.
Genevès, P., Layaïda, N., and Schmitt, A. (2007).
Efficient static analysis of XML paths and types.