You are on page 1of 8

Why Tree Automata?

Foundations of XML Types: Tree Automata


• Foundations of XML type languages (DTD, XML Schema, Relax NG...)
Pierre Genevès • Provide a general framework for XML type languages
• A tool to define regular tree languages with an operational semantics
CNRS
pierre.geneves@inria.fr • Provide algorithms for efficient validation
• Basic tool for static analysis (proofs, decision procedures in logic)
Ecole Polytechnique Fédérale de Lausanne
April, 2008

Prelude: Word Automata From Words to Trees: Binary Trees

Binary trees with a even number of a’s


b b
a
a even
start even odd
a even a odd
a
b even a odd b even b even

Transitions
a
even → odd How to write transitions?
a
odd → even a
(even, odd) → even
... a
(even, even) → odd
etc.
Ranked Trees? Ranked Alphabet

They come from parse trees of data (or programs)...

A ranked alphabet symbol is:


A function call f (a, b) f (g (a, b, c), h(i)) is a ranked tree
• a formalisation of a function call
• a symbol a with an integer arity(a)
f • arity(a) indicates the number of children of a

ga b
h
Notation
a b c i
a(k) : symbol a avec arity(a) = k

Example Ranked Tree Automata

Alphabet: {a(2) , b (2) , c (3) , #(0) }


Possible tree:
A ranked tree automaton A consists in:

a
Alphabet(A): finite alphabet of symbols
States(A): finite set of states
b a Rules(A): finite set of transition rules
Final(A): finite set of final states (⊆ States(A))
a c # #
where :
# # a # b a(k)
Rules(A) are of the form (q1 , ..., qk ) → q
a(0)
# # a # if k = 0, we will write  → q

# #
How do they work? Terminology
Example: Boolean Expressions

∧ q1

∨ q1 ∧ q1
• Language(A) : set of trees accepted by A
∧ q0 ∨ q1 ∨ q1 ∧ q1
• For a tree automaton A, Language(A) is a regular tree language
[Thatcher and Wright, 1968, Doner, 1970]
0 q0 1 q1 1 q1 0 q0 0 q0 1 q1 1 q1 1 q1

Rules(A)
Principle 0 1
 → q0  → q1
• Alphabet(A) = {∧, ∨, 0, 1} ∧ ∨
(q1 , q1 ) → q1 (q0 , q1 ) → q1
∧ ∨
• States(A) = {q0 , q1 } (q0 , q1 ) → q0 (q1 , q0 ) → q1
∧ ∨
• 1 accepting state at the root: (q1 , q0 ) → q0 (q1 , q1 ) → q1
∧ ∨
Final(A) = {q1 } (q0 , q0 ) → q0 (q0 , q0 ) → q0

Example Outline

Tree automaton A over {a(2) , b (2) , #(0) } which recognizes trees with a even
number of a’s
• Can we implement a tree automaton efficiently? (notion of determinism)
Alphabet(A) : {a, b, #}
States(A) : {even, odd} • Are tree automata closed under set-theoretic operations?
Final(A) : {even} • Can we check type inclusion?
Rules(A) : • Nice theory. But... what should I do with my unranked XML trees?
a b
(even, even) → odd (even, even) → even
a b
• Can we apply this for XSLT type-checking?
(even, odd) → even (even, odd) → odd
a b
(odd, even) → even (odd, even) → odd
a b
(odd, odd) → odd (odd, odd) → even
#
 → even
Deterministic Tree Automata Can we Make a Tree Automaton Deterministic?

Deterministic Theorem (determinisation)


does not have two rules of the form: From a given non-deterministic (bottom-up) tree automaton we can build a
a (k) deterministic tree automaton
(q1 , ..., qk ) → q
a(k)
(q1 , ..., qk ) → q 0 Corollary
Non-deterministic and deterministic (bottom-up) tree automata recognize the
for two different states q and q 0 same languages.

Intuition Complexity
At most one possible transition at a given node → implementation... EXPTIME (|States(Adet )| = 2|States(A)| )

Implementing Validation Set-Theoretic Operations

Membership Checking
Given a tree automaton A and a tree t, is t ∈ Language(A)?

Remark b {q, qb , qf }
We can implement even if A is Recall
non-deterministic... b {q, qb , qf }
b {q, qb } • We have seen that neither local tree grammars nor single-type tree
Example grammars are closed under boolean operations (e.g. union)
a {q}
Automaton with Final(A) = {qf } and : • What about tree automata?
c b b
→q q → qb q→q {q, qb } b b {q, qb }
b a
qb → qf (q, q) → q {q} c c {q}

Complexity
Membership-Checking is in PTIME (time linear in the size of the tree)
Closure under Set-Theoretic Operations: Overview Application for Checking Type Inclusion

Type Inclusion
Given two tree automata A1 and A2 , is Language(A1 ) ⊆ Language(A2 ) ?
Operation Cost
Union A1 ∪ A2 quadratic Theorem
Intersection A1 ∩ A2 quadratic Containment for non-deterministic tree automata can be decided in exponential
Complement A exponential time
?
Emptiness-Test Language(A) = ∅ polynomial
Principle
?
• Language(A1 ∩ A2 ) = ∅
Tree automata are closed under set-theoretic operations (see • For this purpose, we must make A2 deterministic (size: O(2|A2 | ))
[Comon et al., 1997] for details). → EXPTIME
• Essentially no better solution [Seidl, 1990]

Expressive Power of Tree Automata: Summary Unranked Trees

String
Ranked Tree Unranked Tree
as Tree a a
Theorem a
The following properties are equivalent for a tree language L: b b e c d a ... c d
b
(a) L is recognized by a bottom-up non-deterministic tree automaton
c b a a
(b) Lis recognized by a bottom-up deterministic tree automaton c
(c) L is generated by a regular tree grammar a a c b a ... c c a e
a
b a b c e b c b e
Proof Idea b
(a) ⇒ (b): determinisation (see [Comon et al., 1997])
(a) ⇔ (c) : ? (horizontal recursion a∗ ?) Unranked Tree Automata?
1. either we adapt ranked tree automata
2. or we encode unranked trees are ranked trees...
Unranked Tree Automata Second Option
Ranked Trees
q σ
Transitions can be described by finite sets:
δ(σ, q) = {(q1 , q2 ), (q3 , q4 ), ...} σ1 σ2
Can we encode unranked trees as ranked trees?
q1 q2
a
Unranked Trees b e c d a ... c d
q σ

σ1 σ2 ............ σn
a

...
a
?
b a c c a e
q1 q2 ............ qn ∈ δ(σ, q)?
e b c b e
δ(σ, q)
• For unranked trees, δ(σ, q) is a regular tree language
• δ(σ, q) may be specified by a regular expression or by a finite word
automaton [Murata, 1999]

Encoding Unranked Trees As Binary Trees Tree Automata: Summary

0
0 1
2 Definition
1 A tree language is regular iff it is recognized by a non-deterministic tree
2 3 automaton
3 Advantages
• Closure, decidable operations
• General tool (theoretical and algorithmic)

Bijective Encoding Limitations


• an b n
• “first child; next sibling” encoding
• Allows to focus on binary trees without loss of generality
• Results for ranked trees hold for unranked trees as well
Application for Type-Checking Application for XSLT Type-Checking

<xsl:stylesheet>...
...
<xsl:template>
...
<xsl:value-of
select="a/b"/>
...
</xsl:template>
...
...
The XSLT Type-Checking Problem </xsl:stylesheet>
f ()
Given a type Tin and an XSLT stylesheet f , does f (t) ∈ Tout for all t ∈ Tin ?
?
t Tinf ⊆ Tout
f (t)
Tin

Approach
• Compute Tinf = {f (t)|t ∈ Tin }
• Check whether Tinf ⊆ Tout holds
• In case Tinf ⊆ Tout holds, then we know that for any t ∈ Tin , f (t) ∈ Tout

Limitation of the Approach Backward Type Inference for XSLT

Modified Approach
Tinf may not be regular:
• Compute Tinf = {f −1 (t)|t ∈ Tout )
c c
Transform into • Check whether Tin ⊆ Tinf holds
a a ... a a a ... a b b {z... b}
| {z } | {z }|
n n n
Theorem and Research Prototype
Problem Static type-checking is decidable for an XSLT fragment: “XSLT0”
[Tozawa, 2001]
• Approximation is required, e.g.: an b n approximated by a∗ b ∗
• Inference of the input tree automaton (PTIME)
• Approximation is not contained in Tout (whereas the real type is)
• Containment of tree automata (EXPTIME)
• There is no “good” approximation...
• Consequence: this approach yields static type-checkers which are not
Limitation
complete: some correct transformations might be rejected.
• Only basic transformations are supported (no real XPath)
Concluding Remarks </session>

• Tree automata are part of the theoretical tools that provide the underlying
guiding principles for XML (like the relational algebra provide the A few pointers for the curious who want to learn more...
underlying principles for relational databases)
• Still a lot of research ongoing on the topic, important challenges remain • Sheaves automata [Dal-Zilio and Lugiez, 2003]
(how to model efficiently unordered content, e.g. XML attributes, or
interleaving/shuffle operator)
• Visibly pushdown automata [Alur and Madhusudan, 2004]
(beyond regular tree languages)
• A powerful and efficient modal tree logic [Genevès et al., 2007]
(how to support regular tree languages and XPath too)

Questions / discussions...?

Alur, R. and Madhusudan, P. (2004). In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Conference on
Programming Language Design and Implementation, pages 342–351, New
Visibly pushdown languages.
York, NY, USA. ACM Press.
In STOC ’04: Proceedings of the thirty-sixth annual ACM symposium on
Theory of computing, pages 202–211, New York, NY, USA. ACM. Murata, M. (1999).
Hedge automata: a formal model for XML schemata.
Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison,
S., and Tommasi, M. (1997). http://www.xml.gr.jp/relax/hedge_nice.html.
Tree automata techniques and applications. Seidl, H. (1990).
Available on: http://www.grappa.univ-lille3.fr/tata. Deciding equivalence of finite tree automata.
release October, 1st 2002. SIAM J. Comput., 19(3):424–437.
Dal-Zilio, S. and Lugiez, D. (2003). Thatcher, J. W. and Wright, J. B. (1968).
XML schema, tree logic and sheaves automata. Generalized finite automata theory with an application to a decision
In Nieuwenhuis, R., editor, RTA’03: Proceedings of the 14th International problem of second-order logic.
Conference on Rewriting Techniques and Applications, volume 2706 of Mathematical Systems Theory, 2(1):57–81.
Lecture Notes in Computer Science, pages 246–263. Springer.
Tozawa, A. (2001).
Doner, J. (1970).
Towards static type checking for XSLT.
Tree acceptors and some of their applications.
In DocEng ’01: Proceedings of the 2001 ACM Symposium on Document
Journal of Computer and System Sciences, 4:406–451.
Engineering, pages 18–27, New York, NY, USA. ACM Press.
Genevès, P., Layaïda, N., and Schmitt, A. (2007).
Efficient static analysis of XML paths and types.