DOCTORAATSPROEFSCHRIFT
2009  School voor Informatietechnologie
Kennistechnologie, Informatica, Wiskunde, ICT
Foundations of XML: Regular Expressions Revisited
Proefschrift voorgelegd tot het behalen van de graad van
Doctor in de Wetenschappen: richting informatica, te verdedigen door:
Wouter GELADE
Promotor: prof. dr. Frank Neven
Preface
This is the ideal opportunity to thank the people who have contributed either
directly or indirectly to this thesis.
First and foremost, I would like to thank my advisor, Frank Neven, for
his continuing guidance and support. Next to his excellent vision on scientiﬁc
work, he also is a great person whose students always come ﬁrst. Needless to
say, this dissertation would not have been possible without him.
During the last years I met many very interesting persons, too many to
thank all of them personally. But thank you Thomas, Bart, Wim, Henrik,
Walied, Geert Jan, Tim, Marcel, Kurt, Jan, Volker, Natalia, Goele, Stijn,
Marc, Wenfei, Floris, . . . for pleasant collaborations (scientiﬁc as well as
teaching), interesting conversations, and good company. Among them, special
thanks go out to Wim for many fun collaborations, and Thomas for inviting
me to his group in Dortmund, which really felt like a second research group
to me.
On a more personal note, I would also like to thank my friends for their
continuing support. Last, but deﬁnitely not least, I would like to thank my
parents, brothers, and sister, for being, each in their own way, simply the best
people I know. I can not imagine growing up in a better family as I did.
Thank you all!
Diepenbeek, September 2009
i
Contents
Preface i
1 Introduction 1
2 Deﬁnitions and preliminaries 7
2.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Deterministic Regular Expressions . . . . . . . . . . . . . . . . 9
2.3 Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Schema Languages for XML . . . . . . . . . . . . . . . . . . . . 11
2.5 Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . 12
I Foundations of Regular Expressions 15
3 Succinctness of Regular Expressions 17
3.1 A Generalization of a Theorem by Ehrenfeucht and Zeiger to a
Fixed Alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Complementing Regular Expressions . . . . . . . . . . . . . . . 27
3.3 Intersecting Regular Expressions . . . . . . . . . . . . . . . . . 32
4 Succinctness of Extended Regular Expressions 37
4.1 Deﬁnitions and Basic Results . . . . . . . . . . . . . . . . . . . 40
4.2 Succinctness w.r.t. NFAs . . . . . . . . . . . . . . . . . . . . . 43
4.3 Succinctness w.r.t. DFAs . . . . . . . . . . . . . . . . . . . . . 44
4.4 Succinctness w.r.t. Regular Expressions . . . . . . . . . . . . . 47
4.4.1 K
n
: The Basic Language . . . . . . . . . . . . . . . . . 48
4.4.2 L
n
: Binary Encoding of K
n
. . . . . . . . . . . . . . . . 49
4.4.3 M
n
: Succinctness of RE(&) . . . . . . . . . . . . . . . . 52
5 Complexity of Extended Regular Expressions 57
5.1 Automata for RE(#, &) . . . . . . . . . . . . . . . . . . . . . . 57
iii
iv CONTENTS
5.2 Complexity of Regular Expressions . . . . . . . . . . . . . . . . 66
6 Deterministic Regular Expressions with Counting 75
6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Expressive Power . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Succinctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Counter Automata . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 From RE(#) to CNFA . . . . . . . . . . . . . . . . . . . . . . . 97
6.6 Testing Strong Determinism . . . . . . . . . . . . . . . . . . . . 104
6.7 Decision Problems for RE(#) . . . . . . . . . . . . . . . . . . . 109
6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
II Applications to XML Schema Languages 111
7 Succinctness of PatternBased Schema Languages 113
7.1 Preliminaries and Basic Results . . . . . . . . . . . . . . . . . . 115
7.1.1 PatternBased Schema Languages . . . . . . . . . . . . 116
7.1.2 Characterizations of (Extended) DTDs . . . . . . . . . . 117
7.1.3 Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.1.4 Succinctness . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2 Regular PatternBased Schemas . . . . . . . . . . . . . . . . . . 120
7.3 Linear PatternBased Schemas . . . . . . . . . . . . . . . . . . 126
7.4 Strongly Linear PatternBased Schemas . . . . . . . . . . . . . 133
8 Optimizing XML Schema Languages 141
8.1 Complexity of DTDs and SingleType EDTDs . . . . . . . . . . 143
8.2 Complexity of Extended DTDs . . . . . . . . . . . . . . . . . . 144
8.3 Simpliﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9 Handling NonDeterministic Regular Expressions 151
9.1 Deciding Determinism . . . . . . . . . . . . . . . . . . . . . . . 155
9.2 Constructing Deterministic Expressions . . . . . . . . . . . . . 157
9.2.1 Growing automata . . . . . . . . . . . . . . . . . . . . . 158
9.2.2 Enumerating Automata . . . . . . . . . . . . . . . . . . 158
9.2.3 Optimizing the BKWAlgorithm . . . . . . . . . . . . . 161
9.3 Approximating Deterministic Regular Expressions . . . . . . . 165
9.3.1 Optimal Approximations . . . . . . . . . . . . . . . . . 165
9.3.2 Quality of the approximation . . . . . . . . . . . . . . . 166
9.3.3 Ahonen’s Algorithm . . . . . . . . . . . . . . . . . . . . 167
9.3.4 Ahonen’s Algorithm Followed by Grow . . . . . . . . . . 171
9.3.5 Shrink . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.4.1 Deciding Determinism . . . . . . . . . . . . . . . . . . . 173
9.4.2 Constructing Deterministic Regular Expressions . . . . 174
9.4.3 Approximating Deterministic Regular Expressions . . . 175
9.5 SUPAC: Supportive UPA Checker . . . . . . . . . . . . . . . . 177
10 Conclusion 179
11 Publications 181
Bibliography 183
Samenvatting 193
1
Introduction
The regular languages constitute a highly robust class of languages. They can
be represented by many diﬀerent formalisms such as ﬁnite automata [92], reg
ular expressions [67], ﬁnite semigroups [97], monadic secondorder logic [18],
rightlinear grammars [21], quantiﬁerfree ﬁrstorder updates [39], ... Reg
ular languages are closed under a wide range of operations and, even more
importantly, almost any interesting problem concerning regular languages is
decidable. All kinds of aspects of the regular languages have been studied
over the past 50 years. From a practical perspective, the most widespread
way to specify regular languages is by using regular expressions (REs). They
are used in applications in many diﬀerent areas of computer science, including
bioinformatics [84], programming languages [108], model checking [105], and
XML schema language [100].
XML is the lingua franca and the de facto standard for data exchange
on the Web. When two parties exchange data in XML, the XML documents
usually adhere to a certain format. Thereto, XML schema languages are used
to describe the structure an XML document can have. The most common
XML schema languages are DTD, XML Schema [100], both W3C standards,
and Relax NG [22]. From a formal language theory point of view each of these
is a grammarbased formalism with regular expressions at their righthand
sides. These expression, however, diﬀer from the standard regular expressions
in that they are extended by additional operators but are also restricted by the
requirement to be deterministic. Although these requirements are recorded in
W3C and ISO standards, it is not clear what their impact on the diﬀerent
2 Introduction
schema languages is. The goal of this thesis therefore is a study of these
consequences; in particular, we study the complexity of optimization in the
presence of additional operators, illustrate the diﬃculties of migrating from
one schema language to another, and study the implications of the determinism
constraint (also in the presence of additional operators) and try to make it
accessible in practice.
Although the questions we ask are mainly inspired by questions about
XML, we believe that the answers are also interesting for the general the
oretical computer science community as they answer fundamental questions
concerning regular expressions. Therefore, this thesis consists of two parts.
In the ﬁrst part, we study fundamental aspects of regular expression. In the
second, we apply the former results to applications concerning XML schema
languages. Although most of the work is of a foundational nature, we also
developed software to bring these theoretical insights to practice (cf. Chap
ter 9).
Foundations of Regular Expressions
In Chapter 3, we address succinctness of regular expressions. In particular,
we consider the following questions. As usual L(r) denotes the language de
ﬁned by regular expression r. Given regular expressions r, r
1
, . . . , r
k
over an
alphabet Σ,
1. what is the complexity of constructing a regular expression r
¬
deﬁning
Σ
∗
\ L(r), that is, the complement of r?
2. what is the complexity of constructing a regular expression r
∩
deﬁning
L(r
1
) ∩ · · · ∩ L(r
k
)?
In both cases, the naive algorithm takes time double exponential in the size
of the input. Indeed, for the complement, transform r to an NFA and de
terminize it (ﬁrst exponential step), complement it and translate back to a
regular expression (second exponential step). For the intersection there is a
similar algorithm through a translation to NFAs, taking the crossproduct and
a retranslation to a regular expression. Note that both algorithms do not
only take double exponential time but also result in a regular expression of
double exponential size. We show that these naive constructions can not be
substantially improved by exhibiting classes of regular expressions for which
this double exponential size increase can not be avoided. The main technical
contribution of this chapter is a generalization of a result by Ehrenfeucht and
Zeiger [30]. In particular, we construct a family of languages over a twoletter
alphabet which can not be deﬁned by small regular expressions. The latter
3
also allows to show that in a translation from automata to regular expression
an exponential sizeincrease can in general not be avoided, even over a binary
alphabet.
In addition, we consider the same questions for two strict subclasses:
singleoccurrence and deterministic regular expressions. A regular expres
sion is singleoccurrence when every alphabet symbol occurs at most once.
For instance, (a +b)
∗
c is a singleoccurrence regular expression (SORE) while
a
∗
(a + b)
∗
is not. Despite their apparent simplicity, SOREs nonetheless cap
ture the majority of XML schemas on the Web [9]. Determinism (also called
oneunambiguity [17]) intuitively requires that, when matching a string from
left to right against an expression, it is always clear against which position in
the expression the next symbol must be matched. For example, the expression
(a +b)
∗
a is not deterministic, but the equivalent expression b
∗
a(b
∗
a)
∗
is. The
XML schema languages DTD and XML Schema [100] require expressions to
be deterministic. In short, for both classes complementation becomes easier,
while intersection remains diﬃcult.
While Chapter 3 investigates the complexity of applying language opera
tions to regular expressions, Chapter 4 investigates the succinctness of ex
tended regular expressions, i.e. expressions extended with additional opera
tors. In particular, we study the succinctness of regular expressions extended
with counting (RE(#)), intersection (RE(∩)), and interleaving (RE(&)) op
erators. The counting operator allows for expressions such as a
2,5
, specifying
that there must occur at least two and at most ﬁve a’s. These RE(#)s are
used in egrep [58] and Perl [108] patterns and XML Schema [100]. The class
RE(∩) is a well studied extension of the regular expressions, and is often re
ferred to as the semiextended regular expressions. The interleaving operator
allows for expressions such as a &b &c, specifying that a, b, and c may occur
in any order, and is used in the XML schema language Relax NG [22] and, in
a very restricted form, in XML Schema [100]. We give a complete overview of
the succinctness of these classes of extended regular expressions with respect
to regular expressions, NFAs, and DFAs.
The main reason to study the complexity of a translation from extended
regular expressions to standard expressions is to gain more insight in the power
of the diﬀerent operators. However, these results also have important conse
quences concerning the complexity of translating from one schema language to
another. For instance, we show that in a translation from RE(&) to standard
RE a double exponential size increase can in general not be avoided. As Relax
NG allows interleaving, and XML Schema only allows a very restricted form
of interleaving, this implies that also in a translation from Relax NG to XML
Schema a double exponential size increase can not be avoided. Nevertheless,
as XML Schema is a widespread W3C standard, and Relax NG is a more ﬂex
4 Introduction
ible alternative, such a translation would be more than desirable. Second, we
study the succinctness of extended regular expressions to ﬁnite automata as,
when considering algorithmic problems, the easiest solution is often through a
translation to ﬁnite automata. However, our negative results indicate that it
is often more desirable to develop dedicated algorithms for extended regular
expressions which avoid such translations.
In Chapter 5, we investigate the complexity of the equivalence, inclusion,
and intersection nonemptiness problem for various classes of regular expres
sions. These classes are general regular expressions extended with counting
and interleaving operators, and CHAin Regular Expressions (CHAREs) ex
tended with counting. The latter is another simple subclass of the regular
expressions [75]. The main motivation for studying these classes lies in their
application to the complexity of the diﬀerent XML schema languages and,
consequently, these results will further be used in Chapter 8, when studying
the complexity of the same problems for XML schema languages.
In Chapter 6, we study deterministic regular expressions with counting,
as these are essentially the regular expressions allowed in XML Schema. The
commonly accepted notion of determinism is as given before: an expression
is deterministic when matching a string against it, it is always clear against
which position in the expression the next symbol must be matched. However,
one can also consider a stronger notion of determinism in which it additionally
must also always be clear how to go from one position to the next. We refer to
the former notion as weak and the latter as strong determinism. For example,
(a
∗
)
∗
is weakly deterministic, but not strongly deterministic since it is not
clear over which star one should iterate when going from one a to the next.
For standard regular expressions this distinction is usually not made as
the two notions almost coincide for them. That is, a weak deterministic ex
pression can be translated in linear time into an equivalent strong determin
istic one [15].
1
This situation changes completely when counting is involved.
Firstly, the algorithm for deciding whether an expression is weakly determin
istic is nontrivial [66]. For instance, (a
2,3
+b)
2,2
b is weakly deterministic, but
the very similar (a
2,3
+ b)
3,3
b is not. So, the amount of nondeterminism in
troduced depends on the concrete values of the counters. Second, as we show,
weakly deterministic expressions with counting are strictly more expressive
than strongly deterministic ones. Therefore, the aim of this chapter is an in
depth study of the notions of weak and strong determinism in the presence of
counting with respect to expressiveness, succinctness, and complexity.
1
Br¨ uggemannKlein [15] did not study strong determinism explicitly. However, she does
give a procedure to transform expressions into star normal form which rewrites weakly
determinisistic expressions into equivalent strongly deterministic ones in linear time.
5
Applications to XML Schema Languages
In Chapter 7, we study patternbased schema languages. This is a schema
language introduced by Martens et al. [77] and equivalent in expressive power
to singletype EDTDs, the commonly used abstraction of XML Schema. An
advantage of this language is that it makes the expressiveness of XML Schema
more apparent: the content model of an element can only depend on regu
lar string properties of the string formed by the ancestors of that element.
Patternbased schemas can therefore be used as a typefree frontend for XML
Schema. As they can be interpreted both in an existential and universal way,
we study in this chapter the complexity of translating between the two seman
tics and into the formalisms of DTDs, EDTDs, and singletype EDTDs, the
common abstractions of DTD, Relax NG, and XML Schema, respectively
Here, we make extensive use of the results in Chapter 3 and show that
in general translations from patternbased schemas to the other schema for
malisms requires exponential or even double exponential time, thereby reduc
ing much of the hope of using patternbased schemas in its most general form
as a useful frontend for XML Schema. Therefore, we also study more re
stricted classes of schemas: linear and strongly linear patternbased schemas.
Interestingly, strongly linear schemas, the most restricted class, allow for ef
ﬁcient translations to other languages, eﬃcient algorithms for basic decision
problems, and yet are expressive enough to capture the far majority of real
world XML Schema Deﬁnitions (XSDs). From a practical point of view, this
is hence a very useful class of schemas.
In Chapter 8, we study the impact of adding counting and interleaving
to regular expressions for the diﬀerent schema languages. We consider the
equivalence, inclusion, and intersection nonemptiness problem as these con
stitute the basic building blocks in algorithms for optimizing XML schemas.
We also consider the simpliﬁcation problem: Given an EDTD, is it equivalent
to a singletype EDTD or a DTD?
Finally, in Chapter 9, we provide algorithms for dealing with the require
ment in DTD and XML Schema that all regular expressions are deterministic,
i.e. the Unique Particle Attribution constraint in XML Schema. In most
books (c.f. [103]), the UPA constraint is usually explained in terms of a simple
example rather than by means of a clear syntactical deﬁnition. So, when after
the schema design process, one or several regular expressions are rejected by
the schema checker on account of being nondeterministic, it is very diﬃcult
for nonexpert
2
users to grasp the source of the error and almost impossible
to rewrite the expression into an admissible one. The purpose of the present
chapter is to investigate methods for transforming nondeterministic expres
2
In formal language theory.
6 Introduction
sions into concise and readable deterministic ones deﬁning either the same
language or constituting good approximations. We propose the algorithm
supac (Supportive UPA Checker) which can be incorporated in a responsive
XSD tester which in addition to rejecting XSDs violating UPA also suggests
plausible alternatives. Consequently, the task of designing an XSD is relieved
from the burden of the UPA restriction and the user can focus on designing
an accurate schema.
2
Deﬁnitions and preliminaries
In this section, we deﬁne the necessary deﬁnitions and notions concerning
regular expressions, ﬁnite automata and XML schema languages.
2.1 Regular Expressions
By N we denote the natural numbers without zero. For the rest of this thesis,
Σ always denotes a ﬁnite alphabet. A Σstring (or simply string) is a ﬁnite
sequence w = a
1
· · · a
n
of Σsymbols. We deﬁne the length of w, denoted by
w, to be n. We denote the empty string by ε. The set of positions of w is
{1, . . . , n} and the symbol of w at position i is a
i
. By w
1
· w
2
we denote the
concatenation of two strings w
1
and w
2
. As usual, for readability, we denote
the concatenation of w
1
and w
2
by w
1
w
2
. The set of all strings is denoted by
Σ
∗
. A string language is a subset of Σ
∗
. For two string languages L, L
′
⊆ Σ
∗
,
we deﬁne their concatenation L · L
′
to be the set {ww
′
 w ∈ L, w
′
∈ L
′
}. We
abbreviate L· L· · · L (i times) by L
i
. By w
1
&w
2
we denote the set of strings
that is obtained by interleaving w
1
and w
2
in every possible way. That is, for
w ∈ Σ
∗
, w&ε = ε &w = {w}, and aw
1
&bw
2
= ({a}(w
1
&bw
2
)) ∪({b}(aw
1
&
w
2
)). The operator & is then extended to languages in the canonical way.
The set of regular expressions over Σ, denoted by RE, is deﬁned in the
usual way: ∅, ε, and every Σsymbol is a regular expression; and when r
1
and
r
2
are regular expressions, then so are r
1
·r
2
, r
1
+r
2
, and r
∗
1
. By RE(&, ∩, ¬, #)
we denote the class of extended regular expressions, that is, REs extended with
interleaving, intersection, complementation, and counting operators. So, when
8 Deﬁnitions and preliminaries
r
1
and r
2
are RE(&, ∩, ¬, #) expressions then so are r
1
&r
2
, r
1
∩r
2
, ¬r
1
and r
k,ℓ
1
for k ∈ N∪{0}, ℓ ∈ N∪{∞} and k ≤ ℓ. Here, k < ∞ for any k ∈ N∪{0}. For
any subset S of {&, ∩, ¬, #}, we denote by RE(S) the class of extended regular
expressions using only operators in S. For instance, RE(∩, ¬) denotes the
set of all expressions using, next to the standard operators, only intersection
and complementation. To distinguish from extended regular expressions, we
often refer to the normal regular expressions as standard regular expressions.
All notions deﬁned for extended regular expressions in the remainder of this
section also apply to standard regular expressions, in the canonical manner.
The language deﬁned by an extended regular expression r, denoted by
L(r), is inductively deﬁned as follows:
• L(∅) = ∅;
• L(ε) = {ε};
• L(a) = {a};
• L(r
1
r
2
) = L(r
1
) · L(r
2
);
• L(r
1
+r
2
) = L(r
1
) ∪ L(r
2
);
• L(r
∗
) = {ε} ∪
¸
∞
i=1
L(r)
i
;
• L(r
1
&r
2
) = L(r
1
) &L(r
2
)
• L(r
1
∩ r
2
) = L(r
1
) ∩ L(r
2
);
• L(¬r
1
) = Σ
∗
\ L(r
1
); and
• L(r
k,ℓ
) =
¸
ℓ
i=k
L(r)
i
.
By r
+
,
¸
k
i=1
r
i
, r?, and r
k
, with k ∈ N, we abbreviate the expressions
rr
∗
, r
1
+ · · · + r
k
, r + ε, and rr · · · r (ktimes), respectively. For a set S =
{a
1
, . . . , a
n
} ⊆ Σ, we abbreviate by S the regular expression a
1
+ · · · + a
n
.
When r
k,l
is used in a standard regular expression, this is an abbreviation
for r
k
(r + ε)
l−k
, unless mentioned otherwise. Note also that in the context
of RE(#), r
∗
is an abbreviation of r
0,∞
. Therefore, we sometimes omit ∗ as
a basic operator when counting is allowed. Note also that, in the absence of
complementation, the operator ∅ is only necessary to deﬁne the language ∅ and
can be removed in linear time from any extended regular expression deﬁning
a language diﬀerent from ∅. Therefore, we assume in the remainder of this
thesis that the operator ∅ is only used in the expression r = ∅, and not in any
other expression not containing complementation.
2.2. Deterministic Regular Expressions 9
For an extended regular expression r, we denote by Char(r) the set of Σ
symbols occurring in r. We deﬁne the size of an extended regular expression r
over Σ, denoted by r, as the number of Σsymbols and operators occurring in
r plus the sizes of the binary representations of the integers. This is equivalent
to the length of its (parenthesisfree) reverse Polish form [112]. Formally,
∅ = ε = a = 1, for a ∈ Σ, r
1
r
2
 = r
1
∩ r
2
 = r
1
+ r
2
 = r
1
& r
2
 =
r
1
∩r
2
 = r
1
 +r
2
 +1, r
∗
 = ¬r = r +1, and r
k,ℓ
 = r +⌈log k⌉+⌈log ℓ⌉.
Other possibilities considered in the literature for deﬁning the size of a
regular expression are: (1) counting all symbols, operators, and parentheses
[2, 59]; or, (2) counting only the Σsymbols. However, it is known (see, for
instance [31]) that for standard regular expressions, provided they are prepro
cessed by syntactically eliminating superﬂuous ∅ and εsymbols, and nested
stars, the three length measures are identical up to a constant multiplicative
factor. For extended regular expressions, counting only the Σsymbols is not
suﬃcient, since for instance the expression (¬ε)(¬ε)(¬ε) does not contain any
Σsymbols. Therefore, we deﬁne the size of an expression as the length of its
reverse Polish form.
For an RE(#) expression r, the set ﬁrst(r) (respectively, last(r)) consists
of all symbols which are the ﬁrst (respectively, last) symbols in some string
deﬁned by r. These sets are inductively deﬁned as follows:
• ﬁrst(ε) = last(ε) = ∅ and ∀a ∈ Char(r), ﬁrst(a) = last(a) = {a};
• If r = r
1
+ r
2
: ﬁrst(r) = ﬁrst(r
1
) ∪ ﬁrst(r
2
) and last(r) = last(r
1
) ∪
last(r
2
);
• If r = r
1
· r
2
:
– If ε ∈ L(r
1
), ﬁrst(r) = ﬁrst(r
1
) ∪ ﬁrst(r
2
), else ﬁrst(r) = ﬁrst(r
1
);
– If ε ∈ L(r
2
), last(r) = last(r
1
) ∪ last(r
2
), else last(r) = last(r
2
);
• If r = r
k,ℓ
1
: ﬁrst(r) = ﬁrst(r
1
) and last(r) = last(r
1
).
2.2 Deterministic Regular Expressions
As mentioned in the introduction, several XML schema languages restrict reg
ular expressions occurring in rules to be deterministic (also sometimes called
oneunambiguous [17]). We introduce this notion in this section.
A marked regular expression Σ is a regular expression over Σ×N in which
every (Σ × N)symbol occurs at most once. We denote the set of all these
expressions by MRE. Formally, r ∈ MRE if Char(r) ⊂ Σ × N and, for every
subexpression ss
′
or s +s
′
of r, Char(s) ∩ Char(s
′
) = ∅. A marked string is a
10 Deﬁnitions and preliminaries
string over Σ×N (in which (Σ×N)symbols can occur more than once). When
r is a marked regular expression, L(r) is therefore a set of marked strings.
The demarking of a marked expression is obtained by deleting these in
tegers. Formally, the demarking of r is dm(r), where dm : MRE → RE is
deﬁned as dm(ε) := ε, dm((a, i)) := a, dm(rs) := dm(r)dm(s), dm(r +s) :=
dm(r) +dm(s), and dm(r
k,ℓ
) := dm(r)
k,ℓ
. Any function m : RE → MRE such
that for every r ∈ RE it holds that dm(m(r)) = r is a valid marking func
tion. For conciseness and readability, we will from now on write a
i
instead of
(a, i) in marked regular expressions. For instance, a marking of (a + b)a + bc
is (a
1
+ b
1
)a
2
+ b
2
c
1
. The markings and demarkings of strings are deﬁned
analogously. For the rest of the thesis, we usually leave the actual marking
and demarking functions m and dm implicit and denote by r a marking of the
expression r and, conversely, by r the demarking of r. Likewise w will denote
a marking of a string w. We always use overlined letters to denote marked
expressions, symbols, and strings.
Deﬁnition 1. An expression r ∈ RE is deterministic (also called oneunambi
guous) if, for all strings u, v, w ∈ Char(r)
∗
and all symbols a, b ∈ Char(r), the
conditions uav, ubw ∈ L(r) and a = b imply that a = b.
Intuitively, an expression is deterministic if, when matching a string against
the expression from left to right, we always know against which symbol in the
expression we must match the next symbol, without looking ahead in the
string. For instance, (a +b)
∗
a is not deterministic, while (b
∗
a)(b
∗
a)
∗
is.
2.3 Finite Automata
A nondeterministic ﬁnite automaton (NFA) A is a 4tuple (Q, q
0
, δ, F) where
Q is the set of states, q
0
is the initial state, F is the set of ﬁnal states and
δ ⊆ Q × Σ × Q is the transition relation. By δ
∗
we denote the reﬂexive
transitive closure of δ, i.e. δ
∗
is the smallest relation such that for all q ∈ Q,
(q, ε, q) ∈ δ
∗
, δ ⊂ δ
∗
and when (q
1
, u, q
2
) and (q
2
, v, q
3
) are in δ
∗
then so is
(q
1
, uv, q
3
). So, w is accepted by A if (q
0
, w, q
′
) ∈ δ
∗
for some q
′
∈ F. We also
sometimes write q ⇒
A,w
q
′
when (q, w, q
′
) ∈ δ
∗
. The set of strings accepted
by A is denoted by L(A). The size of an NFA is δ. An NFA is deterministic
(or a DFA) if for all a ∈ Σ, q ∈ Q, it holds that {(q, a, q
′
) ∈ δ  q
′
∈ Q} ≤ 1.
When A is a DFA we sometimes write δ and δ
∗
as a function instead of a
relation, i.e. δ(q, a) = p when (q, a, p) ∈ δ. For a transition (q, a, p) ∈ δ, we
say that q is its source state and p its target state.
A state q ∈ Q is useful if there exist strings w, w
′
∈ Σ
∗
such that (q
0
, w, q) ∈
δ
∗
, and (q, w
′
, q
f
) ∈ δ
∗
, for some q
f
∈ F. An NFA is trimmed if it only contains
2.4. Schema Languages for XML 11
useful states. For q ∈ Q, let symbols(q) = {a  ∃p ∈ Q, (p, a, q) ∈ δ}. Then,
A is statelabeled if for any q ∈ Q, symbols(q) ≤ 1, i.e., all transitions to a
single state are labeled with the same symbol. In this case, we also denote this
symbol by symbol(q). Further, A is nonreturning if symbols(q
0
) = ∅, i.e., q
0
has no incoming transitions.
We will make use of the following wellknown results concerning ﬁnite
automata and regular expressions.
Theorem 2. Let A be an NFA over Σ with m states, and let A = n and
Σ = k.
1. A regular expression r, with L(r) = L(A), of size O(mk4
m
), can be
constructed in time 2
O(n)
[82, 31].
2. A DFA B with 2
m
states, such that L(B) = L(A), can be constructed
in time O(2
n
) [110].
3. A DFA B with 2
m
states, such that L(B) = Σ
∗
\L(A), can be constructed
in time O(2
n
) [110].
4. Let r ∈ RE. An NFA B with r + 1 states, such that L(B) = L(r), can
be constructed in time O(r · k) [15].
2.4 Schema Languages for XML
We use unranked trees as an abstraction for XML documents. The set of
unranked Σtrees, denoted by T
Σ
, is the smallest set of strings over Σ and
the parenthesis symbols “(” and “)” containing ε such that, for a ∈ Σ and
w ∈ (T
Σ
)
∗
, a(w) is in T
Σ
. So, a tree is either ε (empty) or is of the form
a(t
1
· · · t
n
) where each t
i
is a tree. In the tree a(t
1
· · · t
n
), the subtrees t
1
, . . . , t
n
are attached to the root labeled a. We write a rather than a(). Notice that
there is no a priori bound on the number of children of a node in a Σtree;
such trees are therefore unranked. For every t ∈ T
Σ
, the set of nodes of t,
denoted by Dom(t), is the set deﬁned as follows: (i) if t = ε, then Dom(t) = ∅;
and (ii) if t = a(t
1
· · · t
n
), where each t
i
∈ T
Σ
, then Dom(t) = {ε} ∪
¸
n
i=1
{iu 
u ∈ Dom(t
i
)}. For a node u ∈ Dom(t), we denote the label of u by lab
t
(u).
In the sequel, whenever we say tree, we always mean Σtree. We denote by
ancstr
t
(u) the sequence of labels on the path from the root to u including
both the root and u itself, and chstr
t
(u) denotes the string formed by the
labels of the children of u, i.e., lab
t
(u1) · · · lab
t
(un). Denote by t
1
[u ← t
2
] the
tree obtained from a tree t
1
by replacing the subtree rooted at node u of t
1
by
t
2
. By subtree
t
(u) we denote the subtree of t rooted at u. A tree language is
a set of trees.
12 Deﬁnitions and preliminaries
We make use of the following deﬁnitions to abstract from the commonly
used schema languages [77]:
Deﬁnition 3. Let R be a class of representations of regular string languages
over Σ.
1. A DTD(R) over Σ is a tuple (Σ, d, s
d
) where d is a function that maps Σ
symbols to elements of R and s
d
∈ Σ is the start symbol. For notational
convenience, we sometimes denote (Σ, d, s
d
) by d and leave the start
symbol s
d
and alphabet Σ implicit.
A tree t satisﬁes d if (i) lab
t
(ε) = s
d
and, (ii) for every u ∈ Dom(t) with
n children, lab
t
(u1) · · · lab
t
(un) ∈ L(d(lab
t
(u))). By L(d) we denote the
set of trees satisfying d.
2. An extended DTD (EDTD(R)) over Σ is a 5tuple D = (Σ, Σ
′
, d, s, µ),
where Σ
′
is an alphabet of types, (Σ
′
, d, s) is a DTD(R) over Σ
′
, and µ
is a mapping from Σ
′
to Σ.
A tree t then satisﬁes an extended DTD if t = µ(t
′
) for some t
′
∈ L(d).
Here we abuse notation and let µ also denote its extension to deﬁne a
homomorphism on trees. Again, we denote by L(D) the set of trees
satisfying D. For ease of exposition, we always take Σ
′
= {a
i
 1 ≤ i ≤
k
a
, a ∈ Σ, i ∈ N} for some natural numbers k
a
, and we set µ(a
i
) = a.
3. A singletype EDTD (EDTD
st
(R)) over Σ is an EDTD(R) D = (Σ, Σ
′
, d,
s, µ) with the property that for every a ∈ Σ
′
, in the regular expression
d(a) no two types b
i
and b
j
with i = j occur.
We denote by EDTD (respectively, EDTD
st
) the class EDTD(RE) (respec
tively, EDTD
st
(RE)). Similarly, for a subset S of {&, ∩, ¬, #} we denote by
EDTD(S) and EDTD
st
(S) the classes EDTD(RE(S)) and EDTD
st
(RE(S)),
respectively. As explained in [77, 85], EDTDs and singletype EDTDs cor
respond to Relax NG and XML Schema, respectively. Furthermore, EDTDs
correspond to the unranked regular languages [16], while singletype EDTDs
form a strict subset thereof [77].
2.5 Decision Problems
The following decision problems will be studied for various classes of regular
expressions, automata, and XML schema languages.
Deﬁnition 4. Let M be a class of representations of regular string or tree
languages:
2.5. Decision Problems 13
• inclusion for M: Given two elements e, e
′
∈ M, is L(e) ⊆ L(e
′
)?
• equivalence for M: Given two elements e, e
′
∈ M, is L(e) = L(e
′
)?
• intersection for M: Given an arbitrary number of elements e
1
, . . . , e
n
∈
M, is
¸
n
i=1
L(e
i
) = ∅?
• membership for M: Given an element e ∈ M and a string or a tree f,
is f ∈ L(e)?
• satisfiability for M: Given an element e ∈ M, does there exist a
nonempty string or tree f such that f ∈ L(e)?
• simplification for M: Given an element e ∈ M, does there exist a
DTD D such that L(e) = L(D)?
Part I
Foundations of Regular
Expressions
15
3
Succinctness of Regular
Expressions
The two central questions addressed in this chapter are the following. Given
regular expressions (REs) r, r
1
, . . . , r
k
over an alphabet Σ,
1. what is the complexity of constructing a regular expression r
¬
deﬁning
Σ
∗
\ L(r), that is, the complement of r?
2. what is the complexity of constructing a regular expression r
∩
deﬁning
L(r
1
) ∩ · · · ∩ L(r
k
)?
In both cases, the naive algorithm takes time double exponential in the size
of the input. Indeed, for the complement, transform r to an NFA and de
terminize it (ﬁrst exponential step), complement it and translate back to a
regular expression (second exponential step). For the intersection there is a
similar algorithm through a translation to NFAs, taking the crossproduct and
a retranslation to a regular expression. Note that both algorithms do not only
take double exponential time but also result in a regular expression of double
exponential size. In this chapter, we exhibit classes of regular expressions for
which this double exponential size increase cannot be avoided. Furthermore,
when the number k of regular expressions is ﬁxed, r
∩
can be constructed in
exponential time and we prove a matching lower bound for the size increase.
In addition, we consider the fragments of deterministic and singleoccurrence
regular expressions relevant to XML schema languages [9, 11, 43, 77]. Our
main results are summarized in Table 3.1.
18 Succinctness of Regular Expressions
complement intersection intersection
(ﬁxed) (arbitrary)
regular expression 2exp exp 2exp
deterministic poly exp 2exp
singleoccurrence poly exp exp
Table 3.1: Overview of the size increase for the various operators and sub
classes.
The main technical part of the chapter is centered around the generaliza
tion of a result in [30]. They exhibit a class of languages (K
n
)
n∈N
each of
which can be accepted by a DFA of size O(n
2
) but cannot be deﬁned by a
regular expression of size smaller than 2
n−1
. The most direct way to deﬁne
K
n
is by the DFA that accepts it: it consists of n states, labeled 0 to n − 1,
where 0 is the initial and n − 1 the ﬁnal state, and which is fully connected
with the edge between state i and j carrying the label a
i,j
. Note that the
alphabet over which K
n
is deﬁned grows quadratically with n. We generalize
their result to a ﬁxed alphabet. In particular, we deﬁne H
n
as the binary
encoding of K
n
using a suitable encoding for a
i,j
and prove that every regular
expression deﬁning H
n
should be at least of size 2
n
. As integers are encoded
in binary the complement and intersection of regular expressions can now be
used to separately encode H
2
n (and slight variations thereof) leading to the
desired results.
Although the succinctness of various automata models have been investi
gated in depth [46] and more recently those of logics over (unary alphabet)
strings [47], the succinctness of regular expressions had, up to recently, hardly
been addressed. For the complement of a regular expression an exponential
lower bound is given in [31]. For the intersection of an arbitrary number of
regular expressions Petersen gave an exponential lower bound [90], while [31]
mentions a quadratic lower bound for the intersection of two regular expres
sions. In fact, in [31], it is explicitly asked what the maximum achievable size
increase is for the complement of one and the intersection of two regular ex
pressions (Open Problems 4 and 5), and whether an exponential size increase
in the translation from DFA to RE is also unavoidable when the alphabet is
ﬁxed (Open Problem 3).
More recently, there have been a number of papers concerning succinctness
of regular expressions and related matters [53, 48, 49, 50, 51, 52]. Most related
is [48], where, independently, a number of problems similar to the problems
in this chapter are studied. They also show that in constructing a regular
expression for the intersection of two expressions, an exponential sizeincrease
can not be avoided. However, we only give a 2
Ω(
√
n)
lower bound whereas
3.1. A Generalization of a Theorem by Ehrenfeucht and Zeiger to
a Fixed Alphabet 19
they obtain 2
Ω(n)
, which in [49] is shown to be almost optimal. For the com
plementation of an RE we both obtain a double exponential lower bound.
Here, however, they obtain 2
2
Ω(
√
nlog n)
whereas we prove a (tight) 2
2
Ω(n)
lower
bound. However, in [51] they also prove a 2
2
Ω(n)
lower bound. Finally, as
a corollary of our results we obtain that in a translation from a DFA to an
RE an exponential size increase can not be avoided, also when the alphabet
is ﬁxed. This yields a lower bound of 2
Ω(
√
n/ log n)
which in [48] is improved
to the (tight) bound of 2
Ω(n)
. All results mentioned here involve ﬁxed binary
alphabets, and, together, these results settle open problems 3, 4, and 5 of [31].
As already mentioned in the introduction, the main motivation for this
research stems from its application in the emerging area of XMLtheory [72,
86, 98, 106]. The lower bounds presented here are utilized in Chapter 7 to
prove, among other things, lower bounds on the succinctness of existential
and universal patternbased schemas on the one hand, and singletype EDTDs
(a formalization of XSDs) and DTDs, on the other hand. As the DTD and
XML Schema speciﬁcation require regular expressions occurring in rules to be
deterministic, formalized by Br¨ uggemannKlein and Wood in terms of one
unambiguous regular expressions [17], we also investigate the complement and
intersection of those. In particular, we show that a deterministic regular ex
pressions can be complemented in polynomial time, whereas the lower bounds
concerning intersection carry over from unrestricted regular expressions. A
study in [9] reveals that most of the deterministic regular expression used in
practice take a very simple form: every alphabet symbol occurs at most once.
We refer to those as singleoccurrence regular expressions (SOREs) and show
a tight exponential lower bound for intersection.
Outline. In Section 3.1 we extend the results of Ehrenfeucht and Zeiger to
languages over a ﬁxed alphabet. Then, in Sections 3.2 and 3.3 we investigate
the succinctness of the complement and intersection of regular expressions,
respectively.
3.1 A Generalization of a Theorem by Ehrenfeucht
and Zeiger to a Fixed Alphabet
We ﬁrst introduce the family (K
n
)
n∈N
of string languages deﬁned by Ehren
feucht and Zeiger [30] over an alphabet whose size grows quadratically with
the parameter n:
Deﬁnition 5. Let n ∈ N and Σ
n
= {a
i,j
 0 ≤ i, j ≤ n − 1}. Then, K
n
contains exactly all strings of the form a
0,i
1
a
i
1
,i
2
· · · a
i
k−1
,n−1
where k ∈ N.
20 Succinctness of Regular Expressions
q
0
q
1
q
2
a
1,2
a
0,0
a
1,1
a
2,2
a
0,2
a
1,0
a
2,1
a
0,1
a
2,0
Figure 3.1: The DFA A
K
3
, accepting K
3
.
A way to interpret K
n
is to consider the DFA with states {q
0
, . . . , q
n−1
}
which is fully connected, whose start state is state q
0
and ﬁnal state is q
n−1
and where the edge between state q
i
and q
j
is labeled with a
i,j
. Formally, let
A
K
n
= (Q, q
0
, δ, F) be deﬁned as Q = {q
0
, . . . , q
n−1
}, F = {q
n−1
}, and for all
i, j ∈ [0, n −1], (q
i
, a
i,j
, q
j
) ∈ δ. Figure 3.1 shows A
K
3
.
Ehrenfeucht and Zeiger obtained the succinctness of DFAs with respect to
regular expressions through the following theorem:
Theorem 6 ([30]). For n ∈ N, any regular expression deﬁning K
n
must be of
size at least 2
n−1
. Furthermore, there is a DFA of size O(n
2
) accepting K
n
.
We will now deﬁne the language H
n
as a binary encoding of K
n
over a
fourletter alphabet, and generalize Theorem 6 to H
n
. Later, we will deﬁne
H
2
n
as an encoding of H
n
over a binary alphabet.
The language H
n
will be deﬁned as the straightforward binary encoding of
K
n
that additionally swaps the pair of indices in every symbol a
i,j
. Thereto,
for a
i,j
∈ Σ
n
, deﬁne the function ρ
n
as
ρ
n
(a
i,j
) = enc(j)$enc(i)#,
where enc(i) and enc(j) denote the ⌈log(n)⌉bit binary encodings of i and j,
respectively. Note that since i, j < n, i and j can be encoded using only
⌈log(n)⌉bits. We extend the deﬁnition of ρ
n
to strings in the usual way:
ρ
n
(a
i
0
,i
1
· · · a
i
k−1
,i
k
) = ρ
n
(a
i
0
,i
1
) · · · ρ
n
(a
i
k−1
,i
k
).
We are now ready to deﬁne H
n
.
Deﬁnition 7. Let Σ
H
= {0, 1, $, #}. For n ∈ N, let H
n
= {ρ
n
(w)  w ∈ K
n
}.
For instance, for n = 5, w = a
0,2
a
2,1
a
1,4
a
4,4
∈ K
5
and thus
ρ
n
(w) = 010$000#001$010#100$001#100$100# ∈ H
5
.
Remark 8. Although it might seem a bit artiﬁcial to swap the indices in
the binary encoding of K
n
, it is deﬁnitely necessary. Indeed, consider the
3.1. A Generalization of a Theorem by Ehrenfeucht and Zeiger to
a Fixed Alphabet 21
q
2
f
2,0
f
2,1
f
2,2
f
2,3
0
1
0
1
0
1
1 0
1 0
1 0
1 0
$
$
$
$
Figure 3.2: The automaton B
2
, for n = 4.
language H
′
n
obtained from K
n
like H
n
but without swapping the indices of
the symbols in K
n
. This language can be deﬁned by a regular expressions of
size O(nlog(n)):
enc(0)$
¸
i<n
enc(i)#enc(i)$
∗
enc(n −1)#.
Therefore, H
′
n
is not suited to show exponential or double exponential lower
bounds on the size of regular expressions.
We now show that, by using the encoding that swaps the indices, it is
possible to generalize Theorem 6 to a fourletter alphabet as follows:
Theorem 9. For any n ∈ N, with n ≥ 2,
1. there is a DFA A
n
of size O(n
2
log n) deﬁning H
n
; and,
2. any regular expression deﬁning H
n
is of size at least 2
n
.
We ﬁrst show (1). Let n ∈ N and n ≥ 2. We compose A
n
from a number
of subautomata which each will be able to read one block of a string. That
is, strings deﬁned by the regular expression (0 + 1)
⌈log n⌉
$(0 + 1)
⌈log n⌉
. For
any i < n, let B
i
= (Q
i
, q
i
, δ
i
, {f
i,0
, . . . , f
i,n−1
}), with L(B
i
) = {w$w
′
 w, w
′
∈
(0 +1)
⌈log n⌉
∧enc(w
′
) = i}, such that for any j < n, q ⇒
B
i
,w$w
′ f
i,j
whenever
enc(w) = j. That is, B
i
reads strings of the form w$w
′
, checks whether w
′
en
codes i and remembers the value of w in its accepting state. The construction
of these automata is straightforward. We illustrate the construction of B
2
, for
n = 4, in Figure 3.2.
Now, let A
n
= (Q, q
0
, δ, q
n−1
) where Q =
¸
i<n
Q
i
and δ contains
¸
i<n
δ
i
plus for every i, j < n, (f
i,j
, #, q
j
) ∈ δ. So, A
n
works as follows: ﬁrst A
0
22 Succinctness of Regular Expressions
reads the ﬁrst block of the string, checks whether the second integer is 0 and
remembers the ﬁrst integer of the block, say j. Then it passes control to B
j
which reads the next block in which it remembers the ﬁrst integer, say k, and
checks whether the second integer is j. If so, it passes control to B
k
, and so
on. Whenever A
n
reaches the initial state of B
n−1
, a valid string has been
read and A
n
can accept.
Finally, for the size of A
n
, consider the subautomata B
i
as illustrated in
Figure 3.2. The ﬁrst, treelike, part consists of at most n + n/2 + n/4 +
· · · + 1 nodes and transitions, which is bounded by 2n. Further, the linear
automata following this part are each of size ⌈log n⌉ and there are n of these.
So, the total size of any B
i
is O(nlog n). Since A
n
consists of a linear number
of B
i
subautomata, A
n
is of size O(n
2
log n). This concludes the proof of
Theorem 9(1).
We now prove Theorem 9(2). It follows the structure of the proof of
Ehrenfeucht and Zeiger but is technically more involved as it deals with binary
encodings of integers.
We start by introducing some terminology. We say that a language L
covers a string w if there exist strings u, u
′
∈ Σ
∗
such that uwu
′
∈ L. A regular
expression r covers w when L(r) covers w. Let w = a
i
0
,i
1
a
i
1
,i
2
· · · a
i
k−1
,i
k
be
a string covered by K
n
. We say that i
0
is the startpoint of w and i
k
is its
endpoint. Furthermore, we say that w contains i or i occurs in w if i occurs
as an index of some symbol in w. That is, a
i,j
or a
j,i
occurs in w for some j.
For instance, a
0,2
a
2,2
a
2,1
has startpoint 0, endpoint 1, and contains 0, 1 and
2.
The notions of contains, occurs, start and endpoint of a string w are also
extended to H
n
. For w covered by K
n
, the start and endpoints of ρ
n
(w)
are the start and endpoints of w. Hence, the startpoint of ρ
n
(w) is the
integer occurring between the ﬁrst $ and # signs, and the notions of start
and endpoints is only extended to images of strings covered by K
n
. The
notion of containing an integer, on the other hand, is extended to every string
v which is a covered by H
n
. That is, v contains any integer encoded by
⌈log n⌉ consecutive bits in v. Clearly, for any w covered by K
n
, ρ
n
(w) contains
exactly the integers contained in w, but the notion is a bit more general. For
instance, for n = 4, the string 0$01#11$10 contains 1, 2, and 3 but its start
and endpoint are undeﬁned as it is not the image of a string covered by K
n
.
For a regular expression r, we say that i is a sidekick of r when it occurs
in every nonempty string deﬁned by r. A regular expression s is a starred
subexpression of a regular expression r when s is a subexpression of r and is
of the form t
∗
. We say that a regular expression r is proper if every starred
subexpression of r has a sidekick.
3.1. A Generalization of a Theorem by Ehrenfeucht and Zeiger to
a Fixed Alphabet 23
Lemma 10. Any starred subexpression s of a regular expression r deﬁning
H
n
is proper.
Proof. We prove this lemma by a detailed examination of the structure of
strings deﬁned by s. We start by making the following observation. For
any string w ∈ L(s), there exist strings u, u
′
∈ Σ
∗
H
such that uwu
′
∈ L(r).
Furthermore, w can be pumped in uwu
′
and still form a string deﬁned by r.
That is, for every j ∈ N, uw
j
u
′
∈ L(r). In addition, for every other w
′
∈ L(s),
uw
′
u
′
∈ L(r).
Let w be a nonempty string in L(s) and let u, u
′
be such that uwu
′
∈ L(r).
Then w must contain at least one $symbol. Towards a contradiction, suppose
it does not. If w contains a # then uwwu
′
∈ L(r) but uwwu
′
∈ H
n
which leads
to the desired contradiction. If w contains no # and therefore only consists of
0’s and 1’s, then uw
n
u
′
∈ L(r) but uw
n
u
′
∈ H
n
which again is a contradiction.
In a similar way, one can show that (i ) w contains at least one #symbol; (ii )
w contains an equal number of $ and #symbols; (iii ) the $ and #symbols
must alternate; and (iv) between any consecutive $ and #symbol there is a
string of length ⌈log(n)⌉ containing only 0’s and 1’s.
From the above it follows that w matches one of the following expressions:
1. α
1
= (0 + 1)
∗
$(0 + 1)
⌈log n⌉
#Σ
∗
H
2. α
2
= Σ
∗
H
#(0 + 1)
⌈log n⌉
$(0 + 1)
∗
.
We refer to the strings deﬁned by α
1
and α
2
as strings of type1 and type2,
respectively. We next show that all strings deﬁned by s are either all of type
1 or all of type2. Towards a contradiction assume there is a type1 string
w
1
and a type2 string w
2
in L(s). Then, w
2
w
1
∈ L(s) and thus there exist
u, u
′
∈ Σ
∗
H
such that uw
2
w
1
u
′
∈ L(r). However, because of the concatenation
of w
2
and w
1
, there are two $symbols without an intermediate #symbol and
therefore uw
2
w
1
u
′
∈ H
n
.
Assume that all strings deﬁned by s are of type1. We next argue that the
substring of length ⌈log n⌉, that is, the integer i, between the ﬁrst $ and #
symbol is the same for every w ∈ L(s) which gives us our sidekick. For a type1
string, we refer to this integer as the start block. Towards a contradiction,
suppose that w, w
1
and w
2
are nonempty strings in L(s) such that w
1
and w
2
have diﬀerent start blocks. Let u, u
′
be such that uww
1
u
′
∈ L(r) and therefore
uww
1
u
′
∈ H
n
. Now, uw contains at least one $ and #symbol. Therefore, by
deﬁnition of H
n
, the value of the start block of w
1
is uniquely determined by
uw. That is, it must be equal to the integer preceding the last $symbol in uw.
Now also uww
2
u
′
∈ L(r) as s is a starred subexpression, but uww
2
u
′
∈ H
n
as
w
2
has a diﬀerent start block, which yields the desired contradiction.
24 Succinctness of Regular Expressions
The same kind of reasoning can be used to show that s has a sidekick when
all deﬁned strings are of type2.
If there is a greatest integer m for which r covers w
m
, we call m the index of
w in r and denote it by I
w
(r). In this case we say that r is wﬁnite. Otherwise,
we say that r is winﬁnite. The index of a regular expression can be used to
give a lower bound on its size according to the following lemma.
Lemma 11 ([30]).
1
For any regular expression r and string w, if r is wﬁnite,
then I
w
(r) < 2r.
Now, we can state the most important property of H
n
.
Lemma 12. Let n ≥ 2. For any C ⊆ {0, . . . , n − 1} of cardinality k and
i ∈ C, there exists a string w covered by H
n
with start and endpoint i and
only containing integers in C, such that any proper regular expression r which
covers w is of size at least 2
k
.
Proof. The proof is by induction on k. For k = 1, C = {i}. Then, deﬁne
w = enc(i)$enc(i)#, which satisﬁes all conditions and any expression covering
w must deﬁnitely have size at least 2.
For the inductive step, let C = {j
1
, . . . , j
k
} and i ∈ C. Deﬁne C
ℓ
=
C \ {j
(ℓ mod k)+1
} and let w
ℓ
be the string given by the induction hypothesis
with respect to C
ℓ
(of size k − 1) and j
ℓ
. Note that j
ℓ
∈ C
ℓ
. Further, deﬁne
m = 2
k+1
and set w equal to the string
enc(j
1
)$enc(i)#w
m
1
enc(j
2
)$enc(j
1
)#w
m
2
enc(j
3
)$enc(j
2
)#· · · w
m
k
enc(i)$enc(j
k
)#.
Then, w is covered by H
n
, has i as start and endpoint and only contains
integers in C. It only remains to show that any expression r which is proper
and covers w is of size at least 2
k
.
Fix such a regular expression r. If r is w
ℓ
ﬁnite for some ℓ ≤ k then,
I
w
ℓ
(r
k
) ≥ m = 2
k+1
, by construction of w. By Lemma 11, r ≥ 2
k
and we are
done.
Therefore, assume that r is w
ℓ
inﬁnite for every ℓ ≤ k. For every ℓ ≤ k,
consider all subexpressions of r which are w
ℓ
inﬁnite. We will see that all
minimal elements in this set of subexpressions must be starred subexpressions.
Here and in the following, we say that an expression is minimal with respect to
a set simply when no other expression in the set is a subexpression. Indeed, a
subexpression of the form a or ε can never be w
ℓ
inﬁnite and a subexpression
1
In fact, in [30] the length of an expression is deﬁned as the number of Σsymbols occurring
in it. However, since our length measure also contains these Σsymbols, this lemma still holds
in our setting.
3.1. A Generalization of a Theorem by Ehrenfeucht and Zeiger to
a Fixed Alphabet 25
of the form r
1
r
2
or r
1
+ r
2
can only be w
ℓ
inﬁnite if r
1
and/or r
2
are w
ℓ

inﬁnite and is thus not minimal with respect to w
ℓ
inﬁnity. Among these
minimal starred subexpressions for w
ℓ
, choose one and denote it by s
ℓ
. Let
E = {s
1
, . . . , s
k
}. Note that since r is proper, all its subexpressions are also
proper. As in addition each s
ℓ
covers w
ℓ
, by the induction hypothesis the size
of each s
ℓ
is at least 2
k−1
. Now, choose from E some expression s
ℓ
such that
s
ℓ
is minimal with respect to the other elements in E.
As r is proper and s
ℓ
is a starred subexpression of r, there is an integer
j such that every nonempty string in L(s
ℓ
) contains j. As, by the induction
hypothesis, w
ℓ
only contains integers in C, s
ℓ
is w
ℓ
inﬁnite, and s
ℓ
is chosen
to be minimal with respect to w
ℓ
inﬁnity, it follows that j ∈ C = {j
1
, . . . , j
k
}
must hold. Let k
′
be such that j = j
k
′ . By deﬁnition of the inductively
obtained strings w
1
, . . . , w
k
, we have that for p = k
′
−1 (if k
′
≥ 2, and p = k,
otherwise), w
p
does not contain j. Denote by s
p
the starred subexpression
from E which is w
p
inﬁnite. In particular, as j is a sidekick of s
ℓ
, and s
p
deﬁnes strings not containing j (recall that it is w
p
inﬁnite, and minimal in
this respect), s
ℓ
and s
p
cannot be the same subexpression of r.
Now, there are three possibilities:
• s
ℓ
and s
p
are completely disjoint subexpressions of r. That is, they are
both not a subexpression of one another. By induction they must both
be of size 2
k−1
and thus r ≥ 2
k−1
+ 2
k−1
= 2
k
.
• s
p
is a strict subexpression of s
ℓ
. This is not possible since s
ℓ
is chosen
to be a minimal element from E.
• s
ℓ
is a strict subexpression of s
p
. We show that if we replace s
ℓ
by ε
in s
p
, then s
p
is still w
p
inﬁnite. It then follows that s
p
still covers w
p
,
and thus s
p
without s
ℓ
is of size at least 2
k−1
. As s
ℓ
 ≥ 2
k−1
as well it
follows that r ≥ 2
k
.
To see that s
p
without s
ℓ
is still w
p
inﬁnite, recall that any nonempty
string deﬁned by s
ℓ
contains j and j does not occur in w
p
. Therefore, a
full iteration of s
ℓ
can never contribute to the matching of any number
of repetitions of w
p
. Or, more speciﬁcally, no nonempty word w ∈ L(s
ℓ
)
can ever be a substring of a word w
i
p
, for any i. So, s
p
can only lose its
w
p
inﬁnity by this replacement if s
ℓ
contains a subexpression which is
itself w
p
inﬁnite. However, this then also is a subexpression of s
p
and s
p
is chosen to be minimal with respect to w
p
inﬁnity, a contradiction. We
can only conclude that s
p
without s
ℓ
is still w
p
inﬁnite.
Since by Lemma 10 any expression deﬁning H
n
is proper, Theorem 9(2)
directly follows from Lemma 12 by choosing i = 0, k = n. This concludes the
proof of Theorem 9(2).
26 Succinctness of Regular Expressions
We now extend the results on H
n
, which is deﬁned over the fourletter
alphabet Σ
H
= {0, 1, $, #}, to a binary alphabet. Thereto, let Σ
2
= {a, b},
and deﬁne ρ
2
as ρ
2
(0) = aa, ρ
2
(1) = ab, ρ
2
($) = ba, and ρ
2
(#) = bb. Then, as
usual, for w = a
1
· · · a
n
∈ Σ
∗
H
, ρ
2
(w) = ρ
2
(a
1
) · · · ρ
2
(a
n
). Then, we can deﬁne
H
2
n
as follows.
Deﬁnition 13. For n ∈ N, let H
2
n
= {ρ
2
(w)  w ∈ H
n
}.
We can now easily extend Theorem 9 to H
2
n
, thus showing that in a trans
lation from a DFA to an RE an exponential blowup can not be avoided, even
when the alphabet is ﬁxed and binary.
Theorem 14. For any n ∈ N, with n ≥ 2,
1. there is a DFA A
n
of size O(n
2
log n) deﬁning H
2
n
; and,
2. any regular expression deﬁning H
2
n
is of size at least 2
n
.
Proof. The proof of Theorem 9 carries over almost literally to the current
setting. For (1), the automaton A
n
can be constructed very similarly as the
DFA A
n
in Theorem 9. The only diﬀerence is that for every transition, we
have to add an additional state to the automaton. For instance, if there is a
transition from state q to q
′
, reading a 0, we have to insert a new state, say
q
0
, with transitions from q to q
0
, and q
0
to q
′
, each reading an a. Further, if
after applying this transformation, we obtain a state which has two outgoing
transitions with the same label, we can simply merge the two states to which
the transitions point. The latter happens, for instance, when a state had an
outgoing 0 and 1 transition, as ρ
2
(0) = aa and ρ
2
(1) = ab. The obtained DFA
accepts H
2
n
and is of size O(n
2
log n).
For (2), the proof of Theorem 9(2) also carries over almost literally. In
particular, all starred subexpressions of an expression deﬁning H
2
n
must still
have a sidekick. And, in the last and crucial step of the proof, if we replace the
starred subexpression s
ℓ
by ε in s
p
, then s
p
is still w
p
inﬁnite. Hence, it can
be proved that any regular expression deﬁning H
2
n
is of size at least 2
n
.
We conclude the section by noting that Waizenegger [107] already claimed
a result similar to Theorem 9 using a diﬀerent binary encoding of K
n
. Unfor
tunately, we believe that a more sophisticated encoding as presented here is
necessary, and hence the proof in [107] to be incorrect, as we discuss in some
detail next.
He takes an approach similar to ours and deﬁnes a binary encoding of K
n
as follows. For n ∈ N, Σ
n
consists of n
2
symbols. If we list these, in some
unspeciﬁed order, and associate a natural number from 0 to n
2
− 1 to each
of them, we can encode K
n
using log(n
2
) bits. For instance, if n = 2, then
3.2. Complementing Regular Expressions 27
Σ
n
= {a
0,0
, a
0,1
, a
1,0
, a
1,1
} and we can encode these by the numbers 0, 1, 2, 3,
respectively. Now, a string in K
n
is encoded by replacing every symbol by its
binary encoding and additionally adding an a and b to the start and end of
each symbol. For instance, a
0,0
a
0,1
becomes a00ba01b, which gives a language,
say W
n
, over the fourletter alphabet {0, 1, a, b}.
Then, Waizenegger shows that W
n
can be described by an NFA of size
O(k
2
· log k
2
), but every regular expression deﬁning it must be of size at least
exponential in n. The latter is done by using the same proof techniques as in
[30]. However, in this proof it is claimed that if r
n
is an expression deﬁning
W
n
, then every string deﬁned by a starred subexpression of r
n
must start with
an a. This statement is incorrect. For instance, if r
2
is an expression deﬁning
W
2
, and we still use the same encoding as above, then ((a0(0ba0)
∗
0b) + ε)r
2
still deﬁnes W
2
but does not have this property. Albeit small, the mistake has
serious consequences. It follows from this claim that every starred subexpres
sion has a sidekick (using our terminology), and the rest of the proof builds
upon this fact. To see the importance, consider the language H
′
n
obtained
from K
n
like H
n
but without swapping the indices of the symbols in K
n
. As
illustrated in Remark 8, H
′
n
can be deﬁned by a regular expression of size
O(nlog(n)). However, if we assume that every string deﬁned by a starred
subexpression starts with a #, we can reuse our proof for H
n
and show that
any regular expression deﬁning H
′
n
must be of at least exponential size which
is clearly false.
To summarize, Waizenegger’s claim that the speciﬁc encoding of the n
2
symbols in Σ does not matter in the proof for Theorem 9 is incorrect. Our
encoding used in deﬁning H
n
allows to prove the sidekick lemma (Lemma 10)
in a correct way.
3.2 Complementing Regular Expressions
It is known that extended regular expressions are nonelementary more suc
cinct than classical ones [27, 102]. Intuitively, each exponent in the tower
requires nesting of an additional complement. In this section, we show that in
deﬁning the complement of a single regular expression, a double exponential
size increase cannot be avoided in general. In contrast, when the expression
is deterministic its complement can be computed in polynomial time.
Theorem 15. 1. For every regular expression r over Σ, a regular expres
sion s with L(s) = Σ
∗
\ L(r) can be constructed in time 2
2
O(n)
.
2. Let Σ be a binary alphabet. For every n ∈ N, there is a regular expression
r
n
of size O(n) such that any regular expression r deﬁning Σ
∗
\ L(r
n
) is
of size at least 2
2
n
.
28 Succinctness of Regular Expressions
Proof. (1) Let r ∈ RE. We ﬁrst construct a DFA A, with L(A) = Σ
∗
\ L(r)
and then construct the regular expression s equivalent to A. According to
Theorem 2(3) and (4) A contains at most 2
r+1
states and can be constructed
in time exponential in the size of r. Then, by Theorem 2(1), the total algorithm
is in time 2
2
O(n)
.
(2) Take Σ as Σ
H
, that is, {0, 1, $, #}. Let n ∈ N. We deﬁne an expression
r
′
n
of size O(n), such that Σ
∗
\ L(r
′
n
) = H
2
n. By Theorem 9, any regular
expression deﬁning H
2
n is of size exponential in 2
n
, that is, of size 2
2
n
. This
hence already proves that the theorem holds for languages over an alphabet
of size four. Afterwards, we show that it also holds for alphabets of size two.
The expression r
′
n
is then deﬁned as the disjunction of the following ex
pressions:
• all strings that do not start with a preﬁx in (0 + 1)
n
$0
n
:
Σ
0,2n
+ Σ
0,n−1
($ + #)Σ
∗
+ Σ
n
(0 + 1 + #)Σ
∗
+ Σ
n+1,2n
(1 + $ + #)Σ
∗
• all strings that do not end with a suﬃx in 1
n
$(0 + 1)
n
#:
Σ
0,2n+1
+ Σ
∗
(0 + 1 + $) + Σ
∗
($ + #)Σ
1,n
+
Σ
∗
(0 + 1 + #)Σ
n+1
+ Σ
∗
(0 + $ + #)Σ
n+2,2n+1
• all strings where a $ is not followed by a string in (0 + 1)
n
#:
Σ
∗
$
Σ
0,n−1
(# + $) + Σ
n
(0 + 1 + $)
Σ
∗
• all strings where a nonﬁnal # is not followed by a string in (0 + 1)
n
$:
Σ
∗
#
Σ
0,n−1
(# + $) + Σ
n
(0 + 1 + #)
Σ
∗
• all strings where the corresponding bits of corresponding blocks are dif
ferent:
((0+1)
∗
+Σ
∗
#(0+1)
∗
)0Σ
3n+2
1Σ
∗
+((0+1)
∗
+Σ
∗
#(0+1)
∗
)1Σ
3n+2
0Σ
∗
.
Notice here that in constructing each of the above regular expressions, we
only need to consider words which are not yet deﬁned by one of the foregoing
expressions. For instance, the last expression above is only properly deﬁned
over words of the form ((0 + 1)
n
$(0 + 1)
n
#)
∗
, as all other strings are already
deﬁned by the previous expressions. Then, it should be clear that a string
over {0, 1, $, #} is matched by none of the above expressions if and only if it
belongs to H
2
n. So, the complement of r
′
n
deﬁnes exactly H
2
n.
3.2. Complementing Regular Expressions 29
We now extend this result to a binary alphabet Σ = {a, b}. Thereto, let
ρ
2
(r
′
n
) be the expression obtained by replacing every symbol σ occurring in
r
′
n
by ρ
2
(σ). Then, the expression r
n
= ((a + b)(a + b))
∗
(a + b) + ρ
2
(r
′
n
) is
the desired expression as its complement is exactly H
2
2
n, which by Theorem 14
must be of size at least 2
2
n
. To see that the complement of r
n
is indeed H
2
2
n,
notice that the expression ((a + b)(a + b))
∗
(a + b) deﬁnes exactly all strings
which are not the encoding of some string w ∈ Σ
∗
H
. Then, ρ
2
(r
′
n
) captures
exactly all other strings which do represent a string w ∈ Σ
∗
H
, but are not in
H
2
2
n. Hence, the complement of r
n
is exactly H
2
2
n.
The previous theorem essentially shows that in complementing a regular
expression, there is no better algorithm than translating to a DFA, computing
the complement and translating back to a regular expression which includes
two exponential steps. However, when the given regular expression is deter
ministic, a corresponding DFA can be computed in quadratic time through the
Glushkov construction [17] eliminating already one exponential step. It should
be noted that this construction, which we refer to as Glushkov construction,
was introduced by [14], and often goes by the name position automata [57].
However, as in the context of deterministic expressions the name Glushkov
construction is the most common, we will consistently use this naming.
The proof of the next theorem shows that the complement of the Glushkov
automaton of a deterministic expression can directly be deﬁned by a regular
expression of polynomial size.
Theorem 16. For any deterministic regular expression r over an alphabet Σ,
a regular expression s deﬁning Σ
∗
\ L(r) can be constructed in time O(n
3
),
where n is the size of r.
Proof. Let r be a deterministic expression over Σ and ﬁx a marking r of r.
We introduce some notation.
• The set notﬁrst(r) contains all Σsymbols which are not the ﬁrst symbol
in any word deﬁned by r, that is, notﬁrst(r) = Σ \ ﬁrst(r).
• For any symbol x ∈ Char(r), the set notfollow(r, x) contains all Σ
symbols of which no marked version can follow x in any word deﬁned
by r. That is, notfollow(r, x) = Σ \ {dm(y)  y ∈ Char(r) ∧ ∃w, w
′
∈
Char(r)
∗
, wxyw
′
∈ L(r)}.
2
• Recall that last(r) contains all marked symbols which are the last symbol
of some word deﬁned by r.
2
Recall that dm(y) denotes the demarking of y.
30 Succinctness of Regular Expressions
We deﬁne the following regular expressions:
• init(r) =
notﬁrst(r)Σ
∗
if ε ∈ L(r); and
ε + notﬁrst(r)Σ
∗
if ε / ∈ L(r).
• For every x ∈ Char(r), r
x
will denote an expression deﬁning {wx  w ∈
Char(r)
∗
∧ ∃u ∈ Char(r)
∗
, wxu ∈ L(r)}. That is, all preﬁxes of strings
in r ending in x. Then, let r
x
be dm(r
x
).
We are now ready to deﬁne s:
init(r) +
¸
x/ ∈last(r)
r
x
(ε + notfollow(r, x)Σ
∗
) +
¸
x∈last(r)
r
x
notfollow(r, x)Σ
∗
.
We conclude by showing that s can be constructed in time cubic in the size
of r and that s deﬁnes the complement of r. We prove that L(s) = Σ
∗
\
L(r) by highlighting the correspondence between s and the complement of
the Glushkov automaton G
r
of r. The Glushkov automaton G
r
is the DFA
(Q, q
0
, δ, F), where
• Q = {q
0
} ∪ {q
x
 x ∈ Char(r)};
• F = {q
x
 x ∈ last(r)} (plus q
0
if ε ∈ L(r)); and
• for x, y ∈ Char(r), there is
– a transition (q
0
, dm(x), q
x
) ∈ δ, if x ∈ ﬁrst(r); and,
– a transition (q
x
, dm(y), q
y
) ∈ δ, if y follows x in some word deﬁned
by r.
It is known that L(r) = L(G
r
) and that G
r
is deterministic whenever r is
deterministic [17].
The complement automaton G
c
r
= (Q
c
, q
0
, δ
c
, F
c
) is obtained from G
r
by
making it complete and interchanging ﬁnal and nonﬁnal states. Formally,
Q
c
= Q ∪ {q } (with q ∈ Q) and δ
c
contains δ plus the triples (q , a, q ), for
every a ∈ Σ, and (q, a, q ) for every state q ∈ Q and symbol a ∈ Σ for which
there is no q
′
∈ Q with (q, a, q
′
) ∈ δ. Finally, F
c
= {q } ∪ (Q \ F). Clearly,
L(G
c
r
) = Σ
∗
\ L(G
r
).
Now, we show that L(s) = L(G
c
r
). First, by deﬁnition of s, ε ∈ L(s) if and
only if ε / ∈ L(r) if and only if ε ∈ L(G
c
r
). We prove that for any nonempty
word w, w ∈ L(s) if and only if w ∈ L(G
c
r
), from which the lemma follows.
Thereto, we show that the nonempty words deﬁned by the diﬀerent disjuncts
of s correspond exactly to subsets of the language G
c
r
.
3.2. Complementing Regular Expressions 31
• init(r) deﬁnes exactly the nonempty words for which G
c
r
immediately
goes from q
0
to q and reads the rest of the word while in q .
• For any x ∈ last(r), r
x
(notfollow(r, x)Σ
∗
) deﬁnes exactly all strings w
for which G
c
r
arrives in q
x
after reading a part of w, goes to q and reads
the rest of w there.
• For any x / ∈ last(r), r
x
(ε+notfollow(r, x)Σ
∗
) deﬁnes exactly all strings w
for which G
c
r
either (1) arrives in q
x
after reading w and accepts because
q
x
is an accepting state; or (2) arrives in q
x
after reading a part of w,
goes to q and reads the rest of w there.
Note that because there are no incoming transitions in q
0
, and q only
has transitions to itself, we have described exactly all accepting runs of G
c
r
.
Therefore, any nonempty string w ∈ L(s) if and only if w ∈ G
c
r
.
We now show that s can be computed in time cubic in the size of r. By a
result of Br¨ uggemannKlein [15] the Glushkov automaton G
r
corresponding to
r as deﬁned above can be computed in time quadratic in the size of r. Using
G
r
, the sets notﬁrst(r), notfollow(r, x) and last(r) can be computed in time
quadratic in the size of r. So, all sets can be computed in time cubic in the
size of r. The expression init(r) can be constructed in linear time. We next
show that for any x ∈ Char(r), the expression r
x
can be constructed in time
quadratic in the size of r. As r
x
= dm(r
x
), it follows that s can be constructed
in cubic time. The expression r
x
is inductively deﬁned as follows:
• For r = ε or r = ∅, r
x
= ∅.
• For r = y ∈ Char(r),
r
x
=
x if y = x
∅ otherwise.
• For r = αβ,
r
x
=
α
x
if x occurs in α
αβ
x
otherwise.
• For r = α +β,
r
x
=
α
x
if x occurs in α
β
x
otherwise.
• For r = α
∗
, r
x
= α
∗
α
x
32 Succinctness of Regular Expressions
The correctness is easily proved by induction on the structure of r. Note that
there are no ambiguities in the inductive deﬁnition of concatenation and dis
junction since the expressions are marked and therefore every marked symbol
occurs only once in the expression. For the time complexity, notice that all
steps in the above inductive deﬁnition, except for the case r = α
∗
, are linear.
Further, for r = α
∗
, r
x
= α
∗
α
x
, and hence α is doubled. However, as the
inductive construction continues on only one of the two operands it follows
that the complete construction is at most quadratic.
We illustrate the construction in the previous proof by means of an exam
ple. Let r = a(ab
∗
c)
∗
, and r = a
1
(a
2
b
∗
3
c
4
)
∗
. Then,
• r
a
1
= a
1
,
• r
a
2
= a
1
(a
2
b
∗
3
c
4
)
∗
a
2
,
• r
b
3
= a
1
(a
2
b
∗
3
c
4
)
∗
a
2
b
∗
3
b
3
,
• r
c
4
= a
1
(a
2
b
∗
3
c
4
)
∗
a
2
b
∗
3
c
4
, and
• init(r) = ε + (b +c)Σ
∗
.
Then the complement of r is deﬁned by
ε + (b +c)Σ
∗
+a(ab
∗
c)
∗
a(ε +aΣ
∗
) +a(ab
∗
c)
∗
ab
∗
b(ε +aΣ
∗
)
+a((b +c)Σ
∗
) +a(ab
∗
c)
∗
ab
∗
c((b +c)Σ
∗
).
We conclude this section by remarking that deterministic regular expres
sions are not closed under complement and that the constructed expression s
is therefore not necessarily deterministic.
3.3 Intersecting Regular Expressions
In this section, we study the succinctness of intersection. In particular, we
show that the intersection of two (or any ﬁxed number) and an arbitrary
number of regular expressions are exponentially and double exponentially
more succinct than regular expressions, respectively. Actually, the exponential
bound for a ﬁxed number of expressions already holds for singleoccurrence
regular expressions
3
, whereas the double exponential bound for an arbitrary
3
Recall that an expression is singleoccurrence if every alphabet symbol occurs at most
once in it.
3.3. Intersecting Regular Expressions 33
number of expressions only carries over to deterministic expressions. For
singleoccurrence expressions this can again be done in exponential time.
In this respect, we introduce a slightly altered version of H
n
.
Deﬁnition 17. Let Σ
D
= {0, 1, $, #, △}. For all n ∈ N, D
n
= {ρ
n
(w)△ 
K
n
covers w ∧ w is even}.
Here, we require that K
n
covers w, instead of requiring w ∈ K
n
, to allow
also for strings with a diﬀerent start and endpoint than 0 and n −1, respec
tively. This will prove technically convenient later on, but does not change
the structure of the language.
We also deﬁne a variant of K
n
which only slightly alters the a
i,j
symbols in
K
n
. Thereto, let Σ
◦
n
= {a
i
◦
,j
, a
i,j
◦  0 ≤ i, j < n} , ˆ ρ(a
i,j
a
j,k
) = £
i
a
i,j
◦a
j
◦
,k
and
ˆ ρ(a
i
0
,i
1
a
i
1
,i
2
· · · a
i
k−2
,i
k−1
a
i
k−1
,i
k
) = ˆ ρ(a
i
0
,i
1
a
i
1
,i
2
) · · · ˆ ρ(a
i
k−2
,i
k−1
a
i
k−1
,i
k
), where
k is even.
Deﬁnition 18. Let n ∈ N and Σ
n
E
= Σ
◦
n
∪ {£
0
, △
0
, . . . , £
n−1
, △
n−2
}. Then,
E
n
= {ˆ ρ(w)△
i
 K
n
covers w ∧ w is even ∧ i is the endpoint of w} ∪ {△
i

0 ≤ i < n}.
Note that words in E
n
are those in K
n
where every odd position is pro
moted to a circled one (
◦
), and triangles labeled with the noncircled positions
are added. For instance, the string a
2,4
a
4,3
a
3,3
a
3,0
which is covered by K
5
is
mapped to the string £
2
a
2,4
◦a
4
◦
,3
£
3
a
3,3
◦a
3
◦
,0
△
0
∈ E
5
.
We make use of the following property:
Lemma 19. Let n ∈ N.
1. Any regular expression deﬁning D
n
is of size at least 2
n
.
2. Any regular expression deﬁning E
n
is of size at least 2
n−1
.
Proof. (1) Analogous to Lemma 10, any regular expression r deﬁning D
n
must
also be proper and must furthermore cover any string w ∈ H
n
. By choosing
i = 0 and k = n in Lemma 12, we see that r must be of size at least 2
n
.
(2) Let n ∈ N and r
E
be a regular expression deﬁning E
n
. Let r
2
K
be the
regular expression obtained from r
E
by replacing £
i
and △
i
, with i < n, by
ε and any a
i
◦
,j
or a
i,j
◦, with 0 ≤ i, j < n, by a
i,j
. Then, r
2
K
 ≤ r
E
 and r
2
K
deﬁnes exactly all strings of even length in K
n
plus ε, and thus also covers
every string in K
n
. Since the proof in [30] also constructs a string w covered by
K
n
such that any proper expression covering w must be of size at least 2
n−1
,
it immediately follows that r
2
K
and thus r
E
must be of size at least 2
n−1
.
The next theorem shows the succinctness of the intersection operator.
34 Succinctness of Regular Expressions
Theorem 20. 1. For any k ∈ N and regular expressions r
1
, . . . , r
k
, a regu
lar expression deﬁning
¸
i≤k
L(r
k
) can be constructed in time 2
O((m+1)
k
)
,
where m = max {r
i
  1 ≤ i ≤ k}.
2. For every n ∈ N, there are SOREs r
n
and s
n
of size O(n
2
) such that any
regular expression deﬁning L(r
n
) ∩ L(s
n
) is of size at least 2
n−1
.
3. For every n ∈ N, there are deterministic regular expressions r
1
, . . . , r
m
,
with m = 2n +1, of size O(n) such that any regular expression deﬁning
¸
i≤m
L(r
i
) is of size at least 2
2
n
.
4. Let r
1
, . . . , r
n
be SOREs. A regular expression deﬁning
¸
i≤n
L(r
n
) can
be constructed in time 2
O(m)
, where m =
¸
i≤n
r
i
.
Proof. (1) First, construct NFAs A
1
, . . . , A
k
such that L(A
i
) = L(r
i
), for any
i ≤ k. If m = max {r
i
  1 ≤ i ≤ k}, then by Theorem 2(4) any A
i
has at
most m+ 1 states and can be constructed in time O(m· Σ). Then, an NFA
A with (m + 1)
k
states, such that L(A) =
¸
i≤k
L(A
i
), can be constructed in
time O((m + 1)
k
) by means of a product construction. By Theorem 2(1), a
regular expression deﬁning L(A) can then be constructed in time 2
O((m+1)
k
)
.
(2) Let n ∈ N. By Lemma 19(2), any regular expression deﬁning M
n
is of size
at least 2
n−1
. We deﬁne SOREs r
n
and s
n
of size quadratic in n, such that
L(r
n
) ∩ L(s
n
) = E
n
. We start by partitioning Σ
n
E
in two diﬀerent ways. To
this end, for every i < n, deﬁne Out
i
= {a
i,j
◦  0 ≤ j < n}, In
i
= {a
j
◦
,i
 0 ≤
j < n}, Out
i
◦ = {a
i
◦
,j
 0 ≤ j < n}, and, In
i
◦ = {a
j,i
◦  0 ≤ j < n}. Then,
Σ
n
E
=
¸
i
In
i
∪ Out
i
∪ {£
i
, △
i
} =
¸
i
◦
In
i
◦ ∪ Out
i
◦ ∪ {£
i
, △
i
}.
Further, deﬁne
r
n
=
(£
0
+· · · +£
n−1
)
¸
i
◦
In
i
◦Out
i
◦
∗
(△
0
+· · · +△
n−1
)
and
s
n
=
¸
i
(In
i
+ε)(£
i
+△
i
)(Out
i
+ε)
∗
.
Now, r
n
checks that every string consists of a sequence of blocks of the
form £
i
a
j,k
◦a
k
◦
,ℓ
, for i, j, k, ℓ < n, ending with a △
i
, for i < n. It thus sets
the format of the strings and checks whether the circled indices are equal.
Further, s
n
checks whether the noncircled indices are equal and whether the
triangles have the correct indices. Since the alphabet of E
n
is of size O(n
2
),
also r
n
and s
n
are of size O(n
2
).
3.3. Intersecting Regular Expressions 35
(3) Let n ∈ N. We deﬁne m = 2n +1 deterministic regular expressions of size
O(n), such that their intersection deﬁnes D
2
n. By Lemma 19(1), any regular
expression deﬁning D
2
n is of size at least 2
2
n
and the theorem follows. For
ease of readability, we denote Σ
D
simply by Σ. The expressions are as follows.
There should be an even length sequence of blocks:
(0 + 1)
n
$(0 + 1)
n
#(0 + 1)
n
$(0 + 1)
n
#
∗
△.
For all i ∈ {0, . . . , n−1}, the (i +1)th bit of the two numbers surrounding an
odd # should be equal:
Σ
i
(0Σ
3n+2
0 + 1Σ
3n+2
1)Σ
n−i−1
#
∗
△.
For all i ∈ {0, . . . , n−1}, the (i +1)th bit of the two numbers surrounding an
even # should be equal:
Σ
2n+2
Σ
i
(0Σ
2n−i+1
(△+ Σ
n+i+1
0Σ
n−i−1
#)+
(1Σ
2n−i+1
(△+ Σ
n+i+1
1Σ
n−i−1
#)))
∗
.
Clearly, the intersection of the above expressions deﬁnes D
2
n. Furthermore,
every expression is of size O(n) and is deterministic as the Glushkov construc
tion translates them into a DFA [17].
(4) We show that given SOREs r
1
, . . . , r
n
, we can construct an NFA A with
Σ + 1 states deﬁning
¸
i≤n
L(r
i
) in time cubic in the sizes of r
1
, . . . , r
n
. It
then follows from Theorem 2(1) that an expression deﬁning
¸
i≤n
L(r
i
) can be
constructed in time 2
O(m)
, where m =
¸
i≤n
r
i
 since m ≥ Σ.
We ﬁrst construct NFAs A
1
, . . . , A
n
such that L(A
i
) = L(r
i
), for any
i ≤ n, by using the Glushkov construction [15]. This construction creates
an automaton which has an initial state, with only outgoing transitions, and
additionally one state for each symbol in the regular expressions. Furthermore,
all incoming transitions for that state are labeled with that symbol. So, we
could also say that each state is labeled with a symbol, and that all incoming
transitions carry the label of that state. Since r
1
, . . . , r
n
are SOREs, for every
symbol there exists at most one state labeled with that symbol in any A
i
.
Now, let A
i
= (Q
i
, q
i
0
, δ
i
, F
i
) then we say that Q
i
= {q
i
0
} ∪{q
i
a
 a ∈ Σ}, where
q
i
a
is the state labeled with a. For ease of exposition, if a symbol a does not
occur in an expression r
i
, we add a state q
i
a
to Q
i
which does not have any
incoming or outgoing transitions.
Now, we are ready to construct the NFA A = (Q, q
0
, δ, F) deﬁning the
intersection of A
1
, . . . , A
n
. First, Q has again an initial state and one state
for each symbol: Q = {q
0
} ∪ {q
a
 a ∈ Σ}. A state is accepting if all its
36 Succinctness of Regular Expressions
corresponding states are accepting: F = {q
a
 ∀i ≤ n, q
i
a
∈ F
i
}. Here, a can
denote 0 or an alphabet symbol. Finally, there is a transition between q
a
and
q
b
if there is a transition between q
i
a
and q
i
b
, in every A
i
: δ = {(q
a
, b, q
b
)  ∀i ≤
n, (q
i
a
, b, q
i
b
) ∈ δ
i
}. Now, L(A) =
¸
i≤n
L(A
i
). Since the Glushkov construction
takes quadratic time [15], and we have to construct n automata, the total
construction can be done in cubic time.
We note that the lower bounds in Theorem 20(2) and (3) do not make
use of a ﬁxed size alphabet. Notice that for the results in Theorem 20(2)
it is impossible to avoid this, as the number of SOREs over an alphabet of
ﬁxed size is ﬁnite. However, if we weaken the statements and allow to use
unrestricted regular expressions in Theorem 20(2) and (3) (instead of SOREs
and deterministic expressions) it is easy to adapt the proofs to use the language
K
2
n
, and hence only use a binary alphabet.
We can thus obtain the following corollary.
Corollary 21. Let Σ be an alphabet with Σ = 2:
1. For every n ∈ N, there are regular expressions r
n
and s
n
over Σ, of size
O(n
2
), such that any regular expression deﬁning L(r
n
) ∩L(s
n
) is of size
at least 2
n−1
.
2. For every n ∈ N, there are regular expressions r
1
, . . . , r
m
over Σ, with
m = 2n + 1, of size O(n), such that any regular expression deﬁning
¸
i≤m
L(r
i
) is of size at least 2
2
n
.
4
Succinctness of Extended
Regular Expressions
Next to XML schema languages, regular expressions are used in many ap
plications such as text processors and programming languages [108]. These
applications, however, usually do not restrict themselves to the standard reg
ular expression using disjunction (+), concatenation (·) and star (
∗
), but also
allow the use of additional operators. Although these operators mostly do not
increase the expressive power of the regular expressions, they can have a dras
tic impact on succinctness, thus making them harder to handle. For instance,
it is wellknown that expressions extended with the complement operator can
describe certain languages nonelementary more succinct than standard regu
lar expressions or ﬁnite automata [102].
In this chapter, we study the succinctness of regular expressions extended
with counting (RE(#)), intersection (RE(∩)), and interleaving (RE(&)) op
erators. The counting operator allows for expressions such as a
2,5
, specifying
that there must occur at least two and at most ﬁve a’s. These RE(#)s are
used in egrep [58] and Perl [108] patterns and in the XML schema language
XML Schema [100]. The class RE(∩) is a well studied extension of the regular
expressions, and is often referred to as the semiextended regular expressions.
The interleaving operator allows for expressions such as a & b & c, specifying
that a, b, and c may occur in any order, and is used, for instance, in the XML
schema language Relax NG [22].
A problem we consider, is the translation of extended regular expressions
38 Succinctness of Extended Regular Expressions
into (standard) regular expressions. For RE(#) the complexity of this transla
tion is exponential [64] while it follows from the results in the previous chapter
that for RE(∩) it is double exponential.
1
We show that also in constructing
an expression for the interleaving of a set of expressions (an hence also for
an RE(&)) a double exponential size increase can not be avoided. This is
the main technical result of the chapter. Apart from a pure mathematical
interest, the latter result has two important consequences. First, it prohibits
an eﬃcient translation from Relax NG (which allows interleaving) to XML
Schema Deﬁnitions (which only allows a very limited form of interleaving).
However, as XML Schema is the widespread W3C standard, and Relax NG is
a more ﬂexible alternative, such a translation would be more than desirable.
A second consequence concerns the automatic discovery of regular expressions
describing a set of given strings. The latter problem occurs in the learning
of XML schema languages [8, 9, 11]. At present these algorithms do not take
into account the interleaving operator, but for Relax NG this would be wise
as this would allow to learn signiﬁcantly smaller expressions.
It should be noted here that Gruber and Holzer [51] obtained similar re
sults. They show that any regular expression deﬁning the language (a
1
b
1
)
∗
&
· · · & (a
n
b
n
)
∗
must be of size at least double exponential in n. Compared to
the result in this chapter, this gives a tighter bound (2
2
Ω(n)
instead of 2
2
Ω(
√
n)
),
and shows that the double exponential size increase already occurs for very
simple expressions. On the other hand, the alphabet of the counterexamples
grows linear with n, whereas the alphabet size is constant for the languages in
this chapter. Over a constant size alphabet, they improved our 2
2
Ω(
√
n)
bound
to 2
2
Ω(n/ log n)
.
We also consider the translation of extended regular expressions to NFAs.
For the standard regular expressions, it is wellknown that such a transla
tion can be done eﬃciently [15]. Therefore, when considering problems such
as membership, equivalence, and inclusion testing for regular expressions the
ﬁrst step is almost invariantly a translation to a ﬁnite automaton. For ex
tended regular expressions, such an approach is less fruitful. We show that
an RE(&, ∩, #) can be translated in exponential time into an NFA. However,
it has already been shown by Kilpelainen and Tuhkanen [64] and Mayer and
Stockmeyer [79] that such an exponential size increase can not be avoided for
RE(#) and RE(&), respectively. For the translation from RE(∩) to NFAs, a
2
Ω(
√
n)
lower bound is reported in [90], which we here improve to 2
Ω(n)
.
As the translation of extended regular expressions to NFAs already involves
an exponential size increase, it is natural to ask what the size increase for DFAs
1
The bound following from the results in the previous chapter is 2
2
Ω(
√
n)
. This has been
improved in [51] to 2
2
Ω(n)
.
39
NFA DFA RE
RE(#) 2
Ω(n)
[64] 2
2
Ω(n)
(Prop. 28) 2
θ(n)
[64]
RE(∩) 2
Ω(n)
(Prop. 26) 2
2
Ω(n)
(Thm. 29) 2
2
θ(n)
[51]
RE(&) 2
Ω(n)
[79] 2
2
Ω(
√
n)
(Thm. 31) 2
2
Ω(
√
n)
(Thm. 39)
RE(&, ∩, #) 2
O(n)
(Prop. 25) 2
2
O(n)
(Prop. 27) 2
2
O(n)
(Prop. 32)
(a)
RE
RE ∩ RE 2
Ω(n)
[48]
¸
RE 2
2
Ω(
√
n)
(Ch. 3)
RE & RE 2
Ω(n)
[48]
(b)
Figure 4.1: Table (a) gives an overview of the results in this chapter con
cerning the complexity of translating extended regular expressions into NFAs,
DFAs, and regular expressions. Proposition and theorem numbers are given in
brackets. Table (b) lists some related results obtained in the previous chapter
and [48].
is. Of course, we can translate any NFA into a DFA in exponential time, thus
giving a double exponential translation, but can we do better? For instance,
from the results in the previous chapter, we can conclude that given a set of
regular expressions, constructing an NFA for their intersection can not avoid
an exponential size increase. However, it is not too hard to see that also a
DFA of exponential size accepting their intersection can be constructed. In
the present chapter, we show that this is not possible for the classes RE(#),
RE(∩), and RE(&). For each class we show that in a translation to a DFA, a
double exponential size increase can not be avoided. An overview of all results
is given in Figure 4.1(a).
Related work. The diﬀerent classes of regular expressions considered
here have been well studied. In particular, the RE(∩) and its membership [90,
61, 71] and equivalence and emptiness [35, 89, 94] problems, but also the
classes RE(#) [64, 83] and RE(&) [79] have received interest. Succinctness
of regular expressions has been studied by Ehrenfeucht and Zeiger [30] and,
more recently, by Ellul et. al [31], Gruber and Holzer [48, 50, 49], and Gruber
and Johannsen [53]. Some relevant results are listed in Figure 4.1(b). Schott
and Spehner give lower bounds for the translation of the interleaving of words
to DFAs [96]. Also related, but diﬀerent in nature, are the results on state
complexity [111], in which the impact of the application of diﬀerent operations
on ﬁnite automata is studied.
40 Succinctness of Extended Regular Expressions
Outline. In Section 4.1 we give some additional necessary deﬁnitions and
present some basic results. In Sections 4.2, 4.3, and 4.4 we study the trans
lation of extended regular expressions to NFAs, DFAs, and regular expressions,
respectively.
4.1 Deﬁnitions and Basic Results
In this section we present some additional deﬁnitions and basic results used
in the remainder of this chapter.
Intuitively, the star height of a regular expression r, denoted by sh(r),
equals the number of nested stars in r. Formally, sh(∅) = sh(ε) = sh(a) = 0,
for a ∈ Σ, sh(r
1
r
2
) = sh(r
1
+r
2
) = max {sh(r
1
), sh(r
2
)}, and sh(r
∗
) = sh(r)+1.
The star height of a regular language L, denoted by sh(L), is the minimal star
height among all regular expressions deﬁning L.
The latter two concepts are related through the following theorem due to
Gruber and Holzer [48], which will allow us to reduce our questions about
the size of regular expressions to questions about the star height of regular
languages.
Theorem 22 ([48]). Let L be a regular language. Then any regular expression
deﬁning L is of size at least 2
1
3
(sh(L)−1)
−1.
A language L is bideterministic if there exists a DFA A, accepting L, such
that the inverse of A is again deterministic. That is, A may have at most one
ﬁnal state and the automaton obtained by inverting every transition in A, and
exchanging the initial and ﬁnal state, is again deterministic.
A (directed) graph G is a tuple (V, E), where V is the set of vertices and
E ⊆ V ×V is the set of edges. A graph (U, F) is a subgraph of G if U ⊆ V and
F ⊆ E. For a set of vertices U ⊆ V , the subgraph of G induced by U, denoted
by G[U], is the graph (U, F), where F = {(u, v)  u, v ∈ U ∧ (u, v) ∈ E}.
A graph G = (V, E) is strongly connected if for every pair of vertices
u, v ∈ V , both u is reachable from v, and v is reachable from u. A set
of edges V
′
⊆ V is a strongly connected component (SCC) of G if G[V
′
] is
strongly connected and for every set V
′′
, with V
′
V
′′
, G[V
′′
] is not strongly
connected. Let SCC(G) denote the set of strongly connected components of
G.
We now introduce the cycle rank of a graph G = (V, E), denoted by cr(G),
which is a measure for the structural complexity of G. It is inductively deﬁned
as follows: (1) if G is acyclic or empty, then cr(G) = 0, otherwise (2) if G is
strongly connected, then cr(G) = min
v∈V
cr(G[V \{v}]) +1, and otherwise (3)
cr(G) = max
V
′
∈SCC(G)
cr(G[V
′
]).
4.1. Deﬁnitions and Basic Results 41
We say that a graph H is a minor of a graph G if H can be obtained
from G by removing edges, removing vertices, and contracting edges. Here,
contracting an edge between two vertices u and v, means replacing u and v by
a new vertex, which inherits all incoming and outgoing edges of u and v. Now,
it has been shown by Cohen [23] that removing edges or nodes from a graph
does not increase the cycle rank of a graph, while McNaughton [80] essentially
showed that the same holds when contracting edges.
Lemma 23 ([23, 80]). If a graph H is a minor of a graph G, then cr(H) ≤
cr(G).
Let A = (Q, q
0
, δ, F) be an NFA. The underlying graph G of A is the
graph obtained by removing the labels from the transition edges of A, or
more formally G = (Q, E), with E = {(q, q
′
)  ∃a ∈ Σ, (q, a, q
′
) ∈ δ}. In the
following, we often abuse notation and for instance say the cycle rank of A,
referring to the cycle rank of its underlying graph.
There is a strong connection between the star height of a regular language,
and the cycle rank of the NFAs accepting it, as witnessed by the following
theorem. Theorem 24(1) is known as Eggan’s Theorem [29] and proved in its
present form by Cohen [23]. Theorem 24(3) is due to McNaughton [80].
Theorem 24. For any regular language L,
1. sh(L) = min {cr(A)  A is an NFA accepting L} [29, 23].
2. sh(L) · Σ ≥ min{cr(A)  A is a nonreturning statelabeled NFA such
that L = L(A)}.
3. if L is bideterministic, then sh(L) = cr(A), where A is the minimal
trimmed DFA accepting L. [80]
Proof. We only have to prove (2). Thereto, let L be a regular language over
an alphabet Σ. By Eggan’s Theorem (Theorem 24(1)), we know that there
exists an NFA A, with L(A) = L and cr(A) = sh(L). We show that, given A,
we can construct a nonreturning statelabeled NFA B
sl
equivalent to A such
that cr(A) · Σ ≥ cr(B
sl
) from which the theorem follows.
Let A = (Q, q
0
, δ, F) be an NFA over Σ, we construct B
sl
in two steps.
First, we construct a nonreturning NFA B = (Q
B
, q
B
, δ
B
, F
B
), with L(B) =
L(A), as follows: Q
B
= Q⊎{q
B
}, δ
B
= δ∪{(q
B
, a, q)  q ∈ Q, a ∈ Σ, (q
0
, a, q) ∈
δ}, and F
B
= F if q
0
/ ∈ F, and F
B
= F ∪ {q
B
}, otherwise. Intuitively,
B is A extended with a new initial state which only inherits the outgoing
transitions of the old initial state. It should be clear that B is nonreturning
and L(B) = L(A). Furthermore, cr(B) = cr(A) because q
B
has no incoming
42 Succinctness of Extended Regular Expressions
transitions and it thus forms a separate strongly connected component in B
whose cycle rank is 0. From the deﬁnitions it then follows that cr(B) = cr(A).
From B, we now construct the nonreturning statelabeled NFA B
sl
such
that cr(B
sl
) ≤ cr(B) · Σ = cr(A) · Σ. Let B
sl
= (Q
sl
, q
sl
0
, δ
sl
, F
sl
) be deﬁned
as
• Q
sl
= {q
a
 q ∈ Q
B
, a ∈ Σ};
• q
sl
0
= q
a
B
, for some a ∈ Σ;
• δ
sl
= {(q
a
, b, p
b
)  q, p ∈ Q
B
, a, b ∈ Σ, (q, b, p) ∈ δ
B
}; and
• F
sl
= {q
a
 q ∈ F
B
, a ∈ Σ}
That is, B
sl
contains Σ copies of every state q of B, each of which captures
all incoming transitions of q for one alphabet symbol. Obviously, B
sl
is a non
returning statelabeled NFA with L(B) = L(B
sl
).
We conclude by showing that cr(B) · Σ ≥ cr(B
sl
). In the following, we
abuse notation and, for a set of states P, write B[P] for the subautomaton of
B induced by P, deﬁned in the obvious way. Now, for a set of states P of B,
let P
sl
= {q
a
 q ∈ P, a ∈ Σ}, and observe that (B[Q \ P])
sl
= B
sl
[Q
sl
\ P
sl
]
always holds. We now show cr(B) · Σ ≥ cr(B
sl
) by induction on the number
of states of B. If Q
B
 = 1 then either the single state does not contain a
loop, such that cr(B) = cr(B
sl
) = 0, or the single state contains a self loop,
in which case cr(B) = 1, and as Q
sl
 = Σ, cr(B
sl
) ≤ Σ holds, and hence
cr(B) · Σ ≥ cr(B
sl
).
For the induction step, if B is acyclic, then, again, cr(B) = cr(B
sl
) = 0.
Otherwise, if B is strongly connected, then cr(B) = cr(B[Q\{q}])+1, for some
q ∈ Q
B
. Then, cr(B) · Σ = cr(B[Q\ {q}]) · Σ +Σ, which by the induction
hypothesis gives us cr(B)· Σ ≥ cr(B
sl
[Q
sl
\{q}
sl
])+Σ. Finally, it is shown in
[48] that removing a set of states P from a graph can never decrease its cycle
rank by more than P. Therefore, as {q}
sl
 = Σ, cr(B
sl
[Q \ {q}
sl
]) + Σ ≥
cr(B
sl
), and hence cr(B) · Σ ≥ cr(B
sl
).
Otherwise, if B consists of strongly connected components Q
1
, . . . , Q
k
,
with k ≥ 2, then cr(B) = max
i≤k
cr(B[Q
i
]). But now, by construction, every
strongly connected component V of B
sl
is contained in Q
sl
i
, for some i ∈ [1, k].
Then, B
sl
[V ] is a minor of B
sl
[Q
sl
i
], and hence it follows from Lemma 23 that
cr(B
sl
) = max
V ∈SCC(B
sl
)
cr(B
sl
[V ]) ≤ max
i≤k
cr(B
sl
[Q
sl
i
]). Now, by induction,
cr(B[Q
i
]) · Σ ≥ cr(B
sl
[Q
sl
i
]) for all i ∈ [1, k], from which it follows that
cr(B) · Σ ≥ cr(B
sl
).
4.2. Succinctness w.r.t. NFAs 43
4.2 Succinctness w.r.t. NFAs
In this section, we study the complexity of translating extended regular expres
sions into NFAs. We show that such a translation can be done in exponential
time, by constructing the NFA by induction on the structure of the expression.
Proposition 25. Let r be a RE(&, ∩, #). An NFA A with at most 2
r
states,
such that L(r) = L(A), can be constructed in time 2
O(r)
.
Proof. We construct A by induction on the structure of the formula. For the
base cases, r = ε, r = ∅, and r = a, for a ∈ Σ, and the induction cases r = r
1
r
2
,
r = r
1
+ r
2
, and r
∗
1
this can easily be done using standard constructions. We
give the full construction for the three special operators:
• If r = r
1
∩ r
2
, for i ∈ [1, 2], let A
i
= (Q
i
, q
i
0
, δ
i
, F
i
) accept L(r
i
). Then,
A = (Q, q
0
, δ, F) is deﬁned as Q = Q
1
×Q
2
, q
0
= (q
1
0
, q
2
0
), F = F
1
×F
2
,
and δ = {((q
1
, q
2
), a, (p
1
, p
2
))  (q
1
, a, p
1
) ∈ δ
1
∧ (q
2
, a, p
2
) ∈ δ
2
}.
• If r = r
1
&r
2
, then A is deﬁned exactly as for r
1
∩r
2
, except for δ which
now equals {((q
1
, q
2
), a, (p
1
, q
2
))  (q
1
, a, p
1
) ∈ δ
1
}∪{((q
1
, q
2
), a, (q
1
, p
2
)) 
(q
2
, a, p
2
) ∈ δ
2
}.
• If r = r
k,ℓ
1
, let A
1
accept L(r
1
). Then, let B
1
to B
ℓ
be ℓ identical copies
of A
1
with disjoint sets of states. For i ∈ [1, ℓ], let B
i
= (Q
i
, q
i
0
, δ
i
, F
i
).
Now, deﬁne A = (Q, q
0
0
, δ, F) accepting r
k,ℓ
1
as follows: Q =
¸
i≤ℓ
Q
i
,
F =
¸
k≤i≤ℓ
F
i
, and δ =
¸
i≤ℓ
δ
i
∪ {(q
i
, a, q
i+1
0
)  q
i
∈ Q
i
∧ ∃p
i
∈
F
i
such that (q
i
, a, p
i
) ∈ δ
i
}.
We note that the construction for the interleaving operator is introduced
by Mayer and Stockmeyer [79], who already used it for a translation from
RE(&) to NFAs. We argue that A contains at most 2
r
states. For r = r
1
∩r
2
or r = r
1
& r
2
, by induction A
1
and A
2
contain at most 2
r
1

and 2
r
2

states
and, hence, A contains at most 2
r
1

· 2
r
2

= 2
r
1
+r
2

≤ 2
r
states. For r = r
k,ℓ
1
,
similarly, A contains at most ℓ · 2
r
1

= 2
r
1
+log ℓ
≤ 2
r
states. Furthermore,
as the intermediate automata never have more than 2
r
states and we have to
do at most r such constructions, the total construction can be done in time
2
O(r)
.
This exponential size increase can not be avoided for any of the classes.
For RE(#) this is witnessed by the expression a
2
n
,2
n
and for RE(&) by the
expression a
1
&· · · &a
n
, as already observed by Kilpelainen and Tuhkanen [64]
and Mayer and Stockmeyer [79], respectively. For RE(∩), a 2
Ω(
√
n)
lower bound
has already been reported in [90]. The present tighter statement, however,
44 Succinctness of Extended Regular Expressions
will follow from Theorem 29 and the fact that any NFA with n states can be
translated into a DFA with 2
n
states (Theorem 2(2)).
Proposition 26. For any n ∈ N, there exist an RE(#) r
#
, an RE(∩) r
∩
, and
an RE(&) r
&
, each of size O(n), such that any NFA accepting r
#
, r
∩
, or r
&
contains at least 2
n
states.
4.3 Succinctness w.r.t. DFAs
In this section, we study the complexity of translating extended regular ex
pressions into DFAs. First, from Proposition 25 and the fact that any NFA
with n states can be translated into a DFA with 2
n
states in exponential time
(Theorem 2(2)), we can immediately conclude the following.
Proposition 27. Let r be a RE(&, ∩, #). A DFA A with at most 2
2
r
states,
such that L(r) = L(A), can be constructed in time 2
2
O(r)
.
We show that, for each of the classes RE(#), RE(∩), or RE(&), this double
exponential size increase can not be avoided.
Proposition 28. For any n ∈ N there exists an RE(#) r
n
of size O(n) such
that any DFA accepting L(r
n
) contains at least 2
2
n
states.
Proof. Let n ∈ N and deﬁne r
n
= (a+b)
∗
a(a+b)
2
n
,2
n
. Here, r
n
is of size O(n)
since the integers in the numerical predicate are stored in binary. We show that
any DFA A = (Q, q
0
, δ, F) accepting L(r
n
) has at least 2
2
n
states. Towards a
contradiction, suppose that A has less than 2
2
n
states and consider all strings
of length 2
n
containing only a’s and b’s. As there are exactly 2
2
n
such strings,
and A contains less than 2
2
n
states, there must be two diﬀerent such strings
w, w
′
and a state q of A such that both (q
0
, w, q) ∈ δ
∗
and (q
0
, w
′
, q) ∈ δ
∗
.
But now, as w = w
′
, there exists some i ∈ [1, 2
n
] such that the ith position
of w contains an a, and the ith position of w
′
contains a b (or the other way
around, but that is identical). Therefore, wa
i
∈ L(r
n
), and w
′
a
i
/ ∈ L(r
n
) but
wa
i
and w
′
a
i
are either both accepted or both not accepted by A, and hence
L(r
n
) = L(A), which gives us the desired contradiction.
We now move to regular expressions extended with the intersection op
erator. The succinctness of RE(∩) with respect to DFAs can be obtained
along the same lines as the simulation of exponential space turing machines
by RE(∩) by F¨ urer [35].
Theorem 29. For any n ∈ N there exists an RE(∩) r
∩
n
of size O(n) such that
any DFA accepting L(r
∩
n
) contains at least 2
2
n
states.
4.3. Succinctness w.r.t. DFAs 45
Proof. Let n ∈ N. We start by describing the language G
n
which will be used
to establish the lower bound. This will be a variation of the following language
over the alphabet {a, b}: {ww  w = 2
n
}. It is wellknown that this language
is hard to describe by a DFA. However, to deﬁne it succinctly by an RE(∩)
expression, we need to add some additional information to it.
Thereto, we ﬁrst deﬁne a marked number as a string over the alphabet
{0, 1, 0, 1} deﬁned by the regular expression (0 + 1)
∗
10
∗
+ 0
∗
, i.e., a binary
number in which the rightmost 1 and all following 0’s are marked. Then, for
any i ∈ [0, 2
n
−1] let enc(i) denote the nbit marked number encoding i. These
marked numbers were introduced in [35], where the following is observed: if
i, j ∈ [0, 2
n
− 1] are such that j = i + 1 (mod 2
n
), then the bits of i and j
which are diﬀerent are exactly the marked bits of j. For instance, for n = 2,
enc(1) = 01 and enc(2) = 10 and they diﬀer in both bits as both bits of enc(2)
are marked. Further, let enc
R
(i) denote the reversal of enc(i).
Now, for a string w = a
0
a
1
. . . a
2
n
−1
, deﬁne
enc(w) = enc
R
(0)a
0
enc(0)$enc
R
(1)a
1
enc(1)$ · · · enc
R
(2
n
−1)a
2
n
−1
enc(2
n
−1)
and, ﬁnally, deﬁne
G
n
= {#enc(w)#enc(w)  w ∈ L((a +b)
∗
) ∧ w = 2
n
}.
For instance, for n = 2, and w = abba, enc(w) = 00a00$10b01$01b10$11a11
and hence #00a00$10b01$01b10$11a11#00a00$10b01$01b10$11a11 ∈ G
2
.
Now, lets consider G
n
, the complement of G
n
. Using a standard argument,
similar to the in one in the proof of Proposition 28, it is straightforward to
show that any DFA accepting G
n
must contain at least 2
2
n
states. We conclude
by showing that we can construct a regular expression r
∩
n
of size O(n) deﬁning
G
n
.
Note that Σ = {0, 0, 1, 1, a, b, $, #}. We deﬁne N = {0, 0, 1, 1}, S = {a, b},
D = {$, #} and for any set U, and σ ∈ U, let U
σ
= U\{σ}. Now, we construct
a set of expressions, each capturing a possible mistake in a string. Then, r
∩
n
is simply the disjunction of these expressions. The expressions are as follows.
All strings which do not start with #:
ε + Σ
#
Σ
∗
.
All strings in which two symbols at a distance n + 1 do not match:
Σ
∗
(NΣ
n
(S +D) +SΣ
n
(S +N) +DΣ
n
(D +N))Σ
∗
.
All strings which do not end with enc(2
n
−1) = 1
n−1
1:
Σ
∗
(Σ
1
+ Σ
1
Σ
1,n−1
) .
46 Succinctness of Extended Regular Expressions
All strings in which # occurs before any number other than enc
R
(0) or where
$ occurs before enc
R
(0):
Σ
∗
($0
n
+ #(Σ
0,n−1
Σ
0
))Σ
∗
.
All strings which contain more or less than 2 #symbols
Σ
∗
#
(# +ε)Σ
∗
#
+ Σ
∗
#Σ
∗
#Σ
∗
#Σ
∗
.
All strings which contain a (nonreversed) binary number which is not correctly
marked:
Σ
∗
SN
∗
((0 + 1)(0 + $ + #) + (0 + 1)N
0
)Σ
∗
.
All strings in which the binary encodings of two numbers surrounding an a or
b are not each others reverse. Thereto, we ﬁrst deﬁne expressions r
i
, for all
i ∈ [0, n], such that L(r
i
) = {wσw
′
 σ ∈ S ∧w, w
′
∈ Σ
∗
∧w = w
′
 ∧w ≤ i},
inductively as follows: r
0
= S and, for all j ∈ [1, n], r
j
= r
0
+ Σr
j−1
Σ. Then,
the following is the desired expression:
Σ
∗
(r
n
∩
¸
σ∈N
σΣ
∗
Σ
σ
)Σ
∗
.
All strings in which the binary encodings of two numbers surrounding a $ or #
do not diﬀer by exactly one, i.e., there is a substring of the form enc(i)$enc
R
(j)
or enc(i)#enc
R
(j) such that j = i + 1 (mod 2
n
). Exactly as above, we can
inductively deﬁne r
′
n
such that L(r
′
n
) = {wσw
′
 σ ∈ D ∧ w, w
′
∈ Σ
∗
∧ w =
w
′
 ∧ w ≤ n}. Then, we obtain:
Σ
∗
(r
′
n
∩ ((0 + 0)Σ
∗
(0 + 1) + (1 + 1)Σ
∗
(1 + 0)))Σ
∗
.
All strings in which two a or b symbols which should be equal are not equal.
We now deﬁne expressions s
i
, for all i ∈ [0, n] such that L(s
i
) = {wu#vw
R

u, v, w ∈ Σ
∗
∧ w = i}. By induction, s
0
= Σ
∗
#Σ
∗
, and for all i ∈ [1, n],
s
i
= Σs
i−1
Σ ∩
¸
σ∈Σ
σΣ
∗
σ. Then, the following is the desired expression:
Σ
∗
(as
n
b +bs
n
a)Σ
∗
.
Now, a string is not in G
n
if and only if it is accepted by at least one of
the previous expressions. Hence, r
∩
n
, deﬁned as the disjunction of all these
expressions, deﬁnes exactly G
n
. Furthermore, notice that all expressions, in
cluding the inductively deﬁned ones, are of size O(n), and hence r
∩
n
is also of
size O(n). This concludes the proof.
4.4. Succinctness w.r.t. Regular Expressions 47
We can now extend the results for RE(∩) to RE(&). We do this by using
a technique of Mayer and Stockmeyer [79] which allows, in some sense, to
simulate an RE(∩) expression by an RE(&) expression. We illustrate their
technique in a simple case, for an expression r = r
1
∩ r
2
, where r
1
and r
2
are standard regular expressions, not containing intersection operators. Let c
be a symbol not occurring in r, and for i = 1, 2, deﬁne r
c
i
as the expression
obtained from r
i
by replacing any symbol a in r
i
by ac. Then, it is easily seen
that a string a
0
· · · a
n
∈ L(r
i
) if and only if a
0
c · · · a
n
c ∈ L(r
i
). Now, consider
the expression r
c
= r
c
1
&r
c
2
. Then, a string a
0
· · · a
n
∈ L(r) = L(r
1
) ∩ L(r
2
) if
and only if a
0
c · · · a
n
c ∈ L(r
c
1
) ∩ L(r
c
2
) if and only if a
2
0
c
2
· · · a
2
n
c
2
∈ L(r
c
). So,
not all strings deﬁned by r
c
are of the form a
2
0
c
2
· · · a
2
n
c
2
, but the set of strings
of the form a
2
0
c
2
· · · a
2
n
c
2
deﬁned by r
c
corresponds exactly to the set of strings
deﬁned by r. In this sense, r
c
simulates r.
The latter technique can be extended to general RE(∩) and RE(&). To
formally deﬁne this, we need some notation. Let w = a
0
· · · a
n
be a string over
an alphabet Σ, and let c be a symbol not in Σ. Then, for any i ∈ N, deﬁne
pump
i
(w) = a
i
0
c
i
a
i
1
c
i
· · · a
i
k
c
i
.
Lemma 30 ([79]). Let r be an RE(∩) containing k ∩operators. Then, there
exists an RE(&) s of size at most r
2
such that for any w ∈ Σ
∗
, w ∈ L(r) if
and only if pump
k
(w) ∈ L(s).
Using this lemma, we can now prove the following theorem.
Theorem 31. For any n ∈ N there exists an RE(&) r
&
n
of size O(n
2
) such
that any DFA accepting L(r
&
n
) contains at least 2
2
n
states.
Proof. Let n ∈ N and consider the expression r
∩
n
of size O(n) constructed in
Theorem 29 such that any DFA accepting L(r
∩
n
) contains at least 2
2
n
states.
Now, let r
&
n
be the regular expression simulating r
∩
n
obtained from Lemma 30,
such that for some k ∈ N and any w ∈ Σ
∗
, w ∈ L(r
∩
n
) if and only if pump
k
(w) ∈
L(r
&
n
). Then, r
&
n
is of size O(n
2
) and, exactly as before, it is straightforward
to show that any DFA accepting L(r
&
n
) contains at least 2
2
n
states.
4.4 Succinctness w.r.t. Regular Expressions
In this section, we study the translation of extended regular expressions to
(standard) regular expressions. First, for the class RE(#) it has already been
shown by Kilpelainen and Tuhkanen [64] that this translation can be done in
exponential time, and that an exponential size increase can not be avoided.
Furthermore, from Proposition 25 and the fact that any NFA with n states
can be translated into a regular expression in time 2
O(n)
(Theorem 2(1)) it
immediately follows that:
48 Succinctness of Extended Regular Expressions
Proposition 32. Let r be a RE(&, ∩, #). A regular expression s equivalent
to r can be constructed in time 2
2
O(r)
Furthermore, from Theorem 20 it follows that in a translation from RE(∩)
to standard regular expressions, a double exponential size increase can not be
avoided.
Hence, it only remains to show a double exponential lower bound on the
translation from RE(&) to standard regular expressions, which is exactly what
we will do in the rest of this section. Thereto, we proceed in several steps and
deﬁne several families of languages. First, we revisit the family of languages
(K
n
)
n∈N
, on which all following languages will be based, and establish its star
height. The star height of languages will be our tool for proving lower bounds
on the size of regular expressions deﬁning these languages. Then, we deﬁne
the family (L
n
)
n∈N
which is a binary encoding of (K
n
)
n∈N
, diﬀerent from the
binary encoding introduced in Section 3.1, and show that these languages can
be deﬁned as the intersection of small regular expressions.
Finally, we deﬁne the family (M
n
)
n∈N
which is obtained by simulating the
intersection of the previously obtained regular expressions by the interleaving
of related expressions, similar to the simulation of RE(∩) by RE(&) in Sec
tion 4.3. Bringing everything together, this then leads to the desired result: a
double exponential lower bound on the translation of RE(&) to RE.
4.4.1 K
n
: The Basic Language
Recall from Section 3.1 that the language K
n
, for any n ∈ N, consists of all
strings of the form a
0,i
1
a
i
1
,i
2
· · · a
i
k
,n−1
where k ∈ N∪{0}. We now determine
the star height of K
n
.
Lemma 33. For any n ∈ N, sh(K
n
) = n.
Proof. We start by observing that, for any n ∈ N, the language K
n
is bide
terministic. Indeed, the inverse of the DFA A
K
n
accepting K
n
is again deter
ministic as every transition is labeled with a diﬀerent symbol. Furthermore,
A
K
n
is the minimal trimmed DFA accepting K
n
. Hence, by Theorem 24(3),
sh(K
n
) = cr(A
K
n
). We conclude by showing that, for any n ∈ N, cr(A
K
n
) = n.
Thereto, ﬁrst observe that the graph underlying A
K
n
is the complete graph
(including selfloops) on n nodes, which we denote by K
n
. We proceed by
induction on n. For n = 1, the graph is a single node with a self loop and
hence by deﬁnition cr(K
1
) = 1. For the inductive step, suppose that cr(K
n
) =
n and consider K
n+1
with node set V
n+1
. Since V
n+1
consists of only one
strongly connected component, cr(K
n+1
) = 1 + min
v∈V
n+1
{cr(K
n+1
[V
n+1
\
{v}])}. However, for any v ∈ V
n+1
, K
n+1
[V
n+1
\{v}] is isomorphic to K
n
, and
hence by the induction hypothesis cr(K
n+1
) = n + 1.
4.4. Succinctness w.r.t. Regular Expressions 49
4.4.2 L
n
: Binary Encoding of K
n
In this section we want to construct a set of small regular expressions such
that any expression deﬁning their intersection must be large (that is, of double
exponential size). Ideally, we would like to use the languages of the family
(K
n
)
n∈N
for this as we have shown in the previous section that they have
a large star height, and thus by Theorem 22 can not be deﬁned by small
expressions. Unfortunately, this is not possible as the alphabet of (K
n
)
n∈N
grows quadratically with n.
Therefore, we will introduce in this section the family of languages (L
n
)
n∈N
which is a binary encoding of (K
n
)
n∈N
over a ﬁxed alphabet. Thereto, let n ∈ N
and recall that K
n
is deﬁned over the alphabet Σ
n
= {a
i,j
 i, j ∈ [0, n − 1]}.
Now, for a
i,j
∈ Σ
n
, deﬁne the function ρ
n
as
ρ
n
(a
i,j
) = #enc(j)$enc(i)△enc(i + 1)△· · · △enc(n −1)△,
where enc(k), for k ∈ N, denotes the ⌈log(n)⌉bit marked number encoding
k as deﬁned in the proof of Theorem 29. So, the encoding starts by the
encoding of the second index, followed by an ascending sequence of encodings
of all numbers from the ﬁrst index to n−1. We extend the deﬁnition of ρ
n
to
strings in the usual way: ρ
n
(a
0,i
1
· · · a
i
k−1
,n−1
) = ρ
n
(a
0,i
1
) · · · ρ
n
(a
i
k
,n−1
). We
are now ready to deﬁne L
n
.
Deﬁnition 34. Let Σ = {0, 1, 0, 1, $, #, △}. For n ∈ N, L
n
= {ρ
n
(w)  w ∈
K
n
}.
For instance, for n = 3, a
0,1
a
1,2
∈ K
3
and hence ρ
3
(a
0,1
a
1,2
) = #01$00△01
△10△#10$01△10△ ∈ L
3
. We now show that this encoding does not aﬀect
the star height.
Lemma 35. For any n ∈ N, sh(L
n
) = n.
Proof. Let n ∈ N. We ﬁrst show that sh(L
n
) ≤ n. By Lemma 33, sh(K
n
) =
n, and hence there exists a regular expression r
K
, with L(r
K
) = K
n
and
sh(r
K
) = n. Let r
L
be the regular expression obtained from r
K
by replacing
every symbol a
i,j
by ρ
n
(a
i,j
). Obviously, L(r
L
) = L
n
and sh(r
L
) = n, and
hence sh(L
n
) ≤ n.
We now show that sh(L
n
) ≥ n. The proof is along the same lines as the
proof of Lemma 33. We ﬁrst show that L
n
is bideterministic and can then
determine its star height by looking at the minimal DFA accepting it. In fact,
the reason we use a slightly involved encoding of K
n
, diﬀerent from the one in
Chapter 3, is precisely the bideterminism property.
To show that L
n
is bideterministic, we now construct the minimal DFA A
L
n
accepting L
n
. Here, A
L
n
will consist of n identical subautomata B
0
n
to B
n−1
n
50 Succinctness of Extended Regular Expressions
q
2
0
q
2
1
q
2
2
q
2
3
p
2
0
p
2
1
p
2
2
0 0
△
0 1
△
1 0
△ #
0
0
1
0
1
0
Figure 4.2: The automaton B
2
3
.
deﬁned as follows. For any i ∈ [0, n − 1], B
i
n
= (Q
i
, q
i
n
, δ
i
, F
i
) is the smallest
automaton for which Q
i
contains distinct states q
i
0
, . . . , q
i
n
and p
i
0
, . . . , p
i
n−1
such that for any j ∈ [0, n −1],
• (q
i
j
, enc(j)△, q
i
j+1
) ∈ δ
∗
i
, and for all w = enc(j)△, (q
i
j
, w, q
i
j+1
) / ∈ δ
∗
i
; and
• (q
i
n
, #enc(j), p
i
j
) ∈ δ
∗
i
, and for all w = #enc(j), (q
i
n
, w, p
i
j
) / ∈ δ
∗
i
.
As an example, Figure 4.2 shows B
2
3
. Now, A
L
n
= (Q, q
0
n
, δ, F) is deﬁned
as Q =
¸
i<n
Q
i
, F = {q
n−1
n
} and δ =
¸
i<n
δ
i
∪ {(p
i
j
, $, q
j
i
)  i, j ∈ [0, n − 1]}.
Figure 4.3 shows A
L
3
.
To see that A
L
n
indeed accepts L
n
, notice that after reading a substring
#enc(i), for some i, the automaton moves to subautomaton B
i
n
. Then, af
ter passing through B
i
n
which ends by reading a new substring #enc(j), the
automaton moves to state q
j
i
of subautomaton B
j
n
. This ensures that the
subsequent ascending sequence of numbers starts with enc(i). Hence, the au
tomaton checks correctly whether the numbers which should be equal, are
equal.
Furthermore, A
L
n
is bideterministic and minimal. To see that it is bideter
ministic, notice that all states except the q
i
j
only have one incoming transition.
Furthermore, the q
i
j
each have exactly two incoming transitions, one labeled
with $ and one labeled with △. Minimality follows immediately from the fact
that A
L
n
is both trimmed and bideterministic.
Now, since L
n
is bideterministic and A
L
n
is the minimal trimmed DFA
accepting L
n
, it follows from Theorem 24(3) that sh(L
n
) = cr(A
L
n
). Therefore,
it suﬃces to show that cr(A
L
n
) ≥ n. Thereto, observe that A
K
n
, the minimal
DFA accepting K
n
, is a minor of A
L
n
. Indeed, we can easily contract edges and
remove nodes from A
K
n
such that the only remaining nodes are q
0
n
to q
n−1
n
, and
such that they form a complete graph. Now, since it is shown in Lemma 33
that cr(A
K
n
) = n and by Lemma 23 and the fact that A
K
n
is a minor of A
L
n
,
4.4. Succinctness w.r.t. Regular Expressions 51
q
0
0
q
0
1
q
0
2
q
0
3
p
0
0
p
0
1
p
0
2
00△ 01△ 10△ #01
#00
#10
q
1
0
q
1
1
q
1
2
q
1
3
p
1
0
p
1
1
p
1
2
00△ 01△ 10△ #01
#00
#10
q
2
0
q
2
1
q
2
2
q
2
3
p
2
0
p
2
1
p
2
2
00△ 01△ 10△ #01
#00
#10
$ $
$
$
$ $
$ $
$
B
0
3
B
1
3
B
2
3
Figure 4.3: The DFA A
L
3
, accepting L
3
.
we know that cr(A
L
n
) ≥ cr(A
K
n
), it follows that sh(L
n
) = cr(A
L
n
) ≥ n. This
completes the proof.
Furthermore, it can be shown that L
n
can be described as the intersection
of a set of small regular expressions.
Lemma 36. For every n ∈ N, there are regular expressions r
1
, . . . , r
m
, with
m = 4n + 3, each of size O(n), such that
¸
i≤m
L(r
i
) = L
2
n.
Proof. Let n ∈ N, and recall that L
2
n is deﬁned over the alphabet Σ =
{0, 1, 0, 1, $, #, △}. Let N = {0, 1, 0, 1}, D = {$, #, △} and for any σ ∈ Σ, let
Σ
σ
= Σ \ {σ}. The expressions are as follows.
The format of the string has to be correct:
(#N
n
$N
n
△(N
n
△)
+
)
+
.
Every number should be properly marked:
(D
+
((0 + 1)
∗
10
∗
+ 0
∗
))
∗
.
Every number before a $ should be equal to the number following the next
$. We deﬁne two sets of regular expressions. First, for all i ∈ [0, n − 1], the
52 Succinctness of Extended Regular Expressions
(i + 1)th bit of the number before an even $ should be equal to the (i + 1)th
bit of the number after the next $.
(#N
i
¸
σ∈N
(σΣ
∗
$
$Σ
∗
$
$N
i
σ)Σ
∗
#
)
∗
(ε + #Σ
∗
#
) .
Second, for all i ∈ [0, n − 1], the (i + 1)th bit of the number before an odd $
should be equal to the (i + 1)th bit of the number after the next $.
#Σ
∗
#
(#N
i
¸
σ∈N
(σΣ
∗
$
$Σ
∗
$
$N
i
σ)Σ
∗
#
)
∗
(ε + #Σ
∗
#
) .
Every two (marked) numbers surrounding a △ should diﬀer by exactly one.
Again, we deﬁne two sets of regular expressions. First, for all i ∈ [0, n − 1],
the (i + 1)th bit of the number before an even △ should properly match the
(i + 1)th bit of the next number:
(#Σ
∗
$
((△+$)Σ
i
((0+0)Σ
n
#
(1+0)+(1+1)Σ
n
#
(0+1))Σ
n−i−1
)
∗
((△+$)Σ
n
+ε)△)
∗
.
Second, for all i ∈ [0, n −1], the (i + 1)th bit of the number before an odd △
should properly match the (i + 1)th bit of the next number:
(#Σ
∗
$
$Σ
n
(△Σ
i
((0 + 0)Σ
n
#
(1 + 0) + (1 + 1)Σ
n
#
(0 + 1))Σ
n−i−1
)
∗
(△Σ
n
+ε)△)
∗
.
Every ascending sequence of numbers should end with enc(2
n
−1) = 1
n−1
1:
(#Σ
∗
#
1
n−1
1△)
∗
.
4.4.3 M
n
: Succinctness of RE(&)
In this section, we will ﬁnally show that RE(&) are double exponentially more
succinct than standard regular expressions. We do this by simulating the
intersection of the regular expressions obtained in the previous section, by
the interleaving of related expressions, similar to the simulation of RE(∩) by
RE(&) in Section 4.3. This approach will partly yield the following family of
languages. For any n ∈ N, deﬁne
M
n
= {pump
4⌈log n⌉+3
(w)  w ∈ L
n
} .
As M
n
is very similar to L
n
, we can easily extend the result on the star
height of L
n
(Lemma 35) to M
n
:
Lemma 37. For any n ∈ N, sh(M
n
) = n.
4.4. Succinctness w.r.t. Regular Expressions 53
Proof. The proof is along the same lines as the proof of Lemma 35. Again,
M
n
is bideterministic as witnessed by the minimal DFA A
M
n
accepting M
n
.
In fact, A
M
n
is simply obtained from A
L
n
by replacing each transition over a
symbol a by a sequence of states reading a
m
c
m
, with m = 4⌈log n⌉ +3. Some
care has to be taken, however, for the states q
i
j
, as these have two incoming
transitions: one labeled with $ and one labeled with △. Naively creating these
two sequences of states leading to q
i
j
, we obtain a nonminimal DFA. However,
if we merge the last m states of these two sequences (the parts reading c
m
),
we obtain the minimal trimmed DFA A
M
n
accepting M
n
. Then, the proof
proceeds exactly as the proof of Lemma 35.
However, the language we will eventually deﬁne will not be exactly M
n
.
Therefore, we need an additional lemma, for which we ﬁrst introduce some
notation. For k ∈ N, and an alphabet Σ, we deﬁne Σ
(k)
to be the language de
ﬁned by the expression (
¸
σ∈Σ
σ
k
)
∗
, i.e., all strings which consist of a sequence
of blocks of identical symbols of length k. Further, for a language L, deﬁne
index(L) = max {i  i ∈ N ∧ ∃w, w
′
∈ Σ
∗
, a ∈ Σ such that wa
i
w
′
∈ L}. Notice
that index(L) can be inﬁnite. However, we will only be interested in languages
for which it is ﬁnite, as in the following lemma.
Lemma 38. Let L be a regular language, and k ∈ N, such that index(L) ≤ k.
Then, sh(L) · Σ ≥ sh(L ∩ Σ
(k)
).
Proof. Let L be a regular language, and k ∈ N, such that index(L) ≤ k. We
show that, given a nonreturning statelabeled NFA A accepting L, we can
can construct an NFA B such that L(B) = L ∩ Σ
(k)
and B is a minor of A.
We show how this implies the lemma. Let A be any nonreturning state
labeled NFA of minimal cycle rank accepting L, and let B be as constructed
above. First, since B is a minor of A it follows from Lemma 23 that cr(B) ≤
cr(A). Furthermore, by Theorem 24(1), cr(B) ≥ sh(L(B)) = sh(L ∩ Σ
(k)
),
and hence cr(A) ≥ sh(L ∩ Σ
(k)
). Now, since A is a nonreturning state
labeled NFA of minimal cycle rank, we can conclude from Theorem 24(2) that
sh(L) · Σ ≥ cr(A), and thus sh(L) · Σ ≥ sh(L ∩ Σ
(k)
).
We conclude by giving the construction of B. Thereto, let A = (Q, q
0
, δ, F)
be a nonreturning statelabeled NFA. Now, for any q ∈ Q, and a ∈ Σ, deﬁne
the function in
a
(q) as follows:
in
a
(q) = max{i  i ∈ N ∪ {0} ∧ ∃w ∈ Σ
∗
such that (q
0
, wa
i
, q) ∈ δ
∗
} .
Notice that as i can equal zero and index(L) is ﬁnite, in
a
(q) is well deﬁned
for any state which is not useless. Intuitively, in
a
(q) represents the maximal
number of a symbols which can be read when entering state q.
54 Succinctness of Extended Regular Expressions
Now, the following algorithm transforms A, accepting L, into B, accepting
L ∩ Σ
(k)
. Repeat the following two steps, until no more changes are made:
1. Apply one of the following rules, if possible:
(a) If exists q, q
′
∈ Q, a ∈ Σ, with (q, a, q
′
) ∈ δ, such that in
a
(q
′
) >
in
a
(q) + 1 ⇒ remove (q, a, q
′
) from δ.
(b) If exists q, q
′
∈ Q, a ∈ Σ, with (q, a, q
′
) ∈ δ and q = q
0
, such that
in
symbol(q)
(q) < k and symbol(q) = a ⇒ remove (q, a, q
′
) from δ.
(c) If exists q ∈ F, q = q
0
, such that in
symbol(q)
(q) < k ⇒ make q
nonﬁnal, i.e., remove q from F.
2. Remove all useless states from A, and recompute in
a
(q) for all q ∈ Q,
a ∈ Σ.
It remains to show that B, the automaton obtained when no more rules
can be applied, is the desired automaton. That is, that B is a minor of A and
that L(B) = L(A) ∩ Σ
(k)
. It is immediate that B is a minor of A since B is
obtained from A by only removing transitions or states.
To show that L(B) = L(A) ∩Σ
(k)
, we ﬁrst prove that L(A) ∩Σ
(k)
⊆ L(B).
Thereto, let A
1
, . . . , A
n
, with A
1
= A and A
n
= B, be the sequence of NFAs
produced by the algorithm, where each A
i
is obtained from A
i−1
by applying
exactly one rule and possibly removing useless states. It suﬃces to show
that for all i ∈ [1, n − 1], L(A
i
) ∩ Σ
(k)
⊆ L(A
i+1
) ∩ Σ
(k)
, as L(A) ∩ Σ
(k)
⊆
L(B) ∩ Σ
(k)
⊆ L(B) then easily follows.
Before proving L(A
i
) ∩Σ
(k)
⊆ L(A
i+1
) ∩Σ
(k)
, we introduce some notation.
For any q ∈ Q, a ∈ Σ, we deﬁne
out
a
(q) = max{i  i ∈ N ∪ {0} ∧ ∃w ∈ Σ
∗
, q
f
∈ F such that (q, a
i
w, q
f
) ∈ δ
∗
}.
Here, out
a
(q) is similar to in
a
(q) and represents the maximal number of a’s
which can be read when leaving q. Since of course L(A
i
) ⊆ L(A) holds for all
i ∈ [1, n], it also holds that index(A
i
) ≤ k. Therefore, for any A
i
, any state q
of A
i
and any a ∈ Σ, it holds that
in
a
(q) + out
a
(q) ≤ k . (4.1)
We are now ready to show that L(A
i
)∩Σ
(k)
⊆ L(A
i+1
)∩Σ
(k)
. Thereto, we
prove that for any w ∈ L(A
i
) ∩ Σ
(k)
, any accepting run of A
i
on w is still an
accepting run of A
i+1
on w. More precisely, we prove for every rule separately
that if A
i+1
is obtained from A
i
by applying this rule, then the assumption
that the removed state or transition is used in an accepting run of A
i
on w
leads to a contradiction.
4.4. Succinctness w.r.t. Regular Expressions 55
For rule (1a), suppose transition (q, a, q
′
) is removed but (q, a, q
′
) occurs
in an accepting run of A
i
on w, i.e., w = uav for u, v ∈ Σ
∗
, (q
0
, u, q) ∈ δ
∗
and
(q
′
, v, q
f
) ∈ δ
∗
, for some q
f
∈ F. Since w ∈ Σ
(k)
, there exists a j ∈ [0, n − 1]
and u
′
, v
′
∈ Σ
∗
such that u = u
′
a
j
and v = a
k−j−1
v
′
. It immediately follows
that in
a
(q) ≥ j, and out
a
(q
′
) ≥ k − j − 1. Then, since in
a
(q
′
) > in
a
(q) + 1,
we have in
a
(q
′
) + out
a
(q
′
) > j +k −j −1 + 1 = k. However, this contradicts
equation (4.1).
For rules (1b) and (1c), we describe the obtained contradictions less formal.
For rule (1b), if the transition (q, a, q
′
), with symbol(q) = a, is used in an
accepting run on w, then q is entered after reading k symbol(q) symbols,
and hence in
symbol(q)
(q) ≥ k. Contradiction. For rule (1c), again, if q is the
accepting state of some run on w, then q must be preceded by k symbol(q)
symbols, and hence in
symbol(q)
(q) ≥ k, contradiction.
We now show L(B) ⊆ L(A)∩Σ
(k)
. Thereto, let B = (Q
B
, q
B
0
, δ
B
, F
B
). We
ﬁrst make the following observation. Let (q, a, q
′
) be a transition in δ
B
. Since
B does not contain useless states, it is easy to see that in
a
(q
′
) ≥ in
a
(q) +1. As
rule (1a) can not be applied to transition (q, a, q
′
), we now obtain the following
equality:
For q, q
′
∈ Q
B
, a ∈ Σ, with (q, a, q
′
) ∈ δ
B
: in
a
(q) + 1 = in
a
(q
′
) . (4.2)
We are now ready to show that L(B) ⊆ L(A) ∩ Σ
(k)
. Since L(B) ⊆ L(A)
deﬁnitely holds, it suﬃces to show that L(B) = L(B) ∩ Σ
(k)
, i.e., B only
accepts strings in Σ
(k)
.
Towards a contradiction, suppose that B accepts a string w / ∈ Σ
(k)
. Then,
there must exist i ∈ [1, k − 1], a ∈ Σ, u ∈ L(ε + Σ
∗
(Σ \ {a})) and v ∈
L(ε + (Σ \ {a})Σ
∗
), such that w = ua
i
v. Furthermore, as w ∈ L(B), there
also exist states p
0
, . . . , p
i
, such that (q
0
, u, p
0
) ∈ δ
∗
B
, (p
i
, v, q
f
) ∈ δ
∗
B
for some
q
f
∈ F, and (p
j
, a, p
j+1
) ∈ δ
B
for all j ∈ [0, i −1].
We ﬁrst argue that in
a
(p
i
) = k. By (4.1) it suﬃces to show that in
a
(p
i
) ≥ k.
Notice that, as i > 0, symbol(p
i
) = a. We consider two cases. If v = ε, then
p
i
∈ F and hence by rule (1c) in
a
(p
i
) ≥ k. Otherwise, v = bv
′
for b ∈ Σ, b = a
and v
′
∈ Σ
∗
and hence p
i
must have an outgoing blabeled transition, with
b = a = symbol(p
i
). By rule (1b), in
a
(p
i
) ≥ k must hold.
Now, by repeatedly applying equation (4.2), we obtain in
a
(p
j
) = k−(i−j),
for all j ∈ [0, i] and, in particular, in
a
(p
0
) = k−i > 0. This gives us the desired
contradiction. To see why, we distinguish two cases. If u = ε, then p
0
= q
0
. As
A was nonreturning and we did not introduce any new transitions, B is also
nonreturning and hence in
a
(q
0
) = 0 should hold. Otherwise, if u = u
′
b for
u
′
∈ Σ
∗
and b = a, then p
0
has an incoming b transition. As B is statelabeled,
56 Succinctness of Extended Regular Expressions
p
0
does not have an incoming a transition, and hence, again, in
a
(p
0
) = 0 should
hold. This concludes the proof.
Now, we are ﬁnally ready to prove the desired theorem:
Theorem 39. For every n ∈ N, there are regular expressions s
1
, . . . , s
m
, with
m = 4n + 3, each of size O(n), such that any regular expression deﬁning
L(s
1
) &L(s
2
) &· · · &L(s
m
) is of size at least 2
1
24
(2
n
−8)
−1.
Proof. Let n ∈ N, and let r
1
, . . . , r
m
, with m = 4n +3, be the regular expres
sions obtained in Lemma 36 such that
¸
i≤m
L(r
i
) = L
2
n.
Now, similar to Lemma 30, it is shown in [79] that given r
1
, . . . , r
m
it
is possible to construct regular expressions s
1
, . . . , s
m
such that (1) for all
i ∈ [1, m], s
i
 ≤ 2r
i
, and if we deﬁne N
2
n = L(s
1
) & · · · & L(s
m
), then (2)
index(N
2
n) ≤ m, and (3) for every w ∈ Σ
∗
, w ∈
¸
i≤m
L(r
i
) if and only if
pump
m
(w) ∈ N
2
n. Furthermore, it follows immediately from the construction
in [79] that any string in N
2
n ∩ Σ
(m)
is of the form a
m
1
c
m
a
m
2
c
m
· · · a
m
l
c
m
, i.e.,
pump
m
(w) for some w ∈ Σ
∗
.
Since
¸
i≤m
L(r
i
) = L
2
n, and M
2
n = {pump
m
(w)  w ∈ L
2
n}, it hence
follows that M
2
n = N
2
n ∩Σ
(m)
. As furthermore, by Lemma 37, sh(M
2
n) = 2
n
and index(N
2
n) ≤ m, it follows from Lemma 38 that sh(N
2
n) ≥
sh(M
2
n)
Σ
=
2
n
8
.
So, N
2
n can be described by the interleaving of the expressions s
1
to s
m
, each
of size O(n), but any regular expression deﬁning N
2
n must, by Theorem 22,
be of size at least 2
1
24
(2
n
−8)
−1. This completes the proof.
Corollary 40. For any n ∈ N, there exists an RE(&) r
n
of size O(n
2
) such
that any regular expression deﬁning L(r
n
) must be of size at least 2
1
24
(2
n
−8)
−1.
This completes the chapter. As a ﬁnal remark, we note that all lower
bounds in this chapter make use of a constant size alphabet and can fur
thermore easily be extended to a binary alphabet. For any language over an
alphabet Σ = {a
1
, . . . , a
k
}, we obtain a new language over the alphabet {b, c}
by replacing, for any i ∈ [1, k], every symbol a
i
by b
i
c
k−i+1
. Obviously, the size
of a regular expression for this new language is at most k +1 times the size of
the original expression, and the lower bounds on the number of states of DFAs
trivially carry over. Furthermore, it is shown in [81] that this transformation
does not aﬀect the star height, and hence the lower bounds on the sizes of the
regular expression also carry over.
5
Complexity of Extended
Regular Expressions
In this chapter, we study the complexity of the equivalence, inclusion, and in
tersection nonemptiness problem for regular expressions extended with count
ing and interleaving operators. The main reason for this study is pinpoint
ing the complexity of these decision problems for XML schema languages, as
done in Chapter 8. Therefore, we also investigate CHAin Regular Expressions
(CHAREs) extended with a counting operator. CHAREs are a subclass of the
regular expressions introduced by Martens et. al. [75] relevant to XML schema
languages. To study these problems, we introduce NFA(#, &)s, an extension
of NFAs with counter and split/merge states for dealing with counting and
interleaving operators. An overview of the results is given in Tables 5.1 and
5.2.
Outline. In Section 5.1, we deﬁne NFA(#, &). In Section 5.2 we establish
the complexity of the basic decision problems for diﬀerent classes of regular
expressions.
5.1 Automata for RE(#, &)
We introduce the automaton model NFA(#, &). In brief, an NFA(#, &) is an
NFA with two additional features: (i) split and merge transitions to handle
interleaving; and, (ii) counting states and transitions to deal with numerical
58 Complexity of Extended Regular Expressions
inclusion equivalence intersection
RE pspace ([102]) pspace ([102]) pspace ([70])
RE(&) expspace ([79]) expspace ([79]) PSPACE
RE(#) expspace ([83]) expspace ([83]) PSPACE
RE(#, &) EXPSPACE EXPSPACE PSPACE
NFA(#), NFA(&),
and NFA(#, &)
EXPSPACE EXPSPACE PSPACE
Table 5.1: Overview of some new and known complexity results. All results
are completeness results. The new results are printed in bold.
occurrence constraints. The idea of split and merge transitions stems from
J¸ edrzejowicz and Szepietowski [60]. Their automata are more general than
NFA(#, &) as they can express shuﬄeclosure which is not regular. Counting
states are also used in the counter automata of Kilpel¨ ainen and Tuhkanen [65],
and Reuter [93] although these counter automata operate quite diﬀerently
from NFA(#)s. Zilio and Lugiez [26] also proposed an automaton model that
incorporates counting and interleaving by means of Presburger formulas. None
of the cited papers consider the complexity of the basic decision problems of
their model. We will use NFA(#, &)s to obtain complexity upper bounds in
Section 5.2 and Chapter 8.
For readability, we denote Σ ∪ {ε} by Σ
ε
. We then deﬁne an NFA(#, &)
as follows.
Deﬁnition 41. An NFA(#, &) is a tuple A = (Q, s, f, δ) where
• Q is a ﬁnite set of states. To every q ∈ Q, we associate a lower bound
min(q) ∈ N and an upper bound max(q) ∈ N∪{∞}, such that min(q) ≤
max(q);
• s, f ∈ Q are the start and ﬁnal states, respectively; and,
• δ is the transition relation and is a subset of the union of the following
sets:
(1) Q×Σ
ε
×Q ordinary transition (resets the counter)
(2) Q×{store} ×Q transition not resetting the counter
(3) Q×{split} ×Q×Q split transition
(4) Q×Q×{merge} ×Q merge transition
Let max(A) = max{{max(q)  q ∈ Q ∧ max(q) ∈ N} ∪ {min(q)  q ∈
Q ∧ max(q) = ∞}}. A conﬁguration γ is a pair (P, α) where P ⊆ Q is a set
of states and α : Q → {0, . . . , max(A)} is the value function mapping states
5.1. Automata for RE(#, &) 59
to the value of their counter. For a state q ∈ Q, we denote by α
q
the value
function mapping q to 1 and every other state to 0. The initial conﬁguration γ
s
is ({s}, α
s
). The ﬁnal conﬁguration γ
f
is ({f}, α
f
). When α is a value function
then α[q = 0] denotes the function obtained from α by setting the value of q
to 0 while leaving other values unchanged. Similarly, α[q
++
] increments the
value of q by 1, unless max(q) = ∞ and α(q) = min(q) in which case also the
value of q remains unchanged. The rationale behind the latter is that for a
state which is allowed an unlimited number of iterations, as max(q) = ∞, it
is suﬃcient to know that α(q) is at least min(q).
We now deﬁne the transition relation between conﬁgurations. Intuitively,
the value of the state at which the automaton arrives is always incremented by
one. When exiting a state, the state’s counter is always reset to zero, except
when we exit through a counting transition, in which case the counter remains
the same. In addition, exiting a state through a noncounting transition is
only allowed when the value of the counter lies between the allowed minimum
and maximum. The latter, hence, ensures that the occurrence constraints are
satisﬁed. Split and merge transitions start and close a parallel composition.
A conﬁguration γ
′
= (P
′
, α
′
) immediately follows a conﬁguration γ =
(P, α) by reading σ ∈ Σ
ε
, denoted γ →
A,σ
γ
′
, when one of the following
conditions hold:
1. (ordinary transition) there is a q ∈ P and (q, σ, q
′
) ∈ δ such that
min(q) ≤ α(q) ≤ max(q), P
′
= (P −{q}) ∪{q
′
}, and α
′
= α[q = 0][q
′++
].
That is, A is in state q and moves to state q
′
by reading σ (note that
σ can be ε). The latter is only allowed when the counter value of q is
between the lower and upper bound. The state q is replaced in P by q
′
.
The counter of q is reset to zero and the counter of q
′
is incremented by
one.
2. (counting transition) there is a q ∈ P and (q, store, q
′
) ∈ δ such that
α(q) < max(q), P
′
= (P − {q}) ∪ {q
′
}, and α
′
= α[q
′++
]. That is, A is
in state q and moves to state q
′
by reading ε when the counter of q has
not reached its maximal value yet. The state q is replaced in P by q
′
.
The counter of q is not reset but remains the same. The counter of q
′
is
incremented by one.
3. (split transition) there is a q ∈ P and (q, split, q
1
, q
2
) ∈ δ such that
min(q) ≤ α(q) ≤ max(q), P
′
= (P − {q}) ∪ {q
1
, q
2
}, and α
′
= α[q =
0][q
1
++
][q
2
++
]. That is, A is in state q and splits into states q
1
and q
2
by reading ε when the counter value of q is between the lower and upper
bound. The state q in P is replaced by (split into) q
1
and q
2
. The counter
of q is reset to zero, and the counters of q
1
and q
2
are incremented by
60 Complexity of Extended Regular Expressions
one.
4. (merge transition) there are q
1
, q
2
∈ P and (q
1
, q
2
, merge, q) ∈ δ such
that, for each j = 1, 2, min(q
j
) ≤ α(q
j
) ≤ max(q
j
), P
′
= (P −{q
1
, q
2
}) ∪
{q}, and α
′
= α[q
1
= 0][q
2
= 0][q
++
]. That is, A is in states q
1
and q
2
and moves to state q by reading ε when the respective counter values of
q
1
and q
2
are between the lower and upper bounds. The states q
1
and
q
2
in P are replaced by (merged into) q, the counters of q
1
and q
2
are
reset to zero, and the counter of q is incremented by one.
For a string w and two conﬁgurations γ, γ
′
, we denote by γ ⇒
A,w
γ
′
when
there is a sequence of conﬁgurations γ →
A,σ
1
· · · →
A,σ
n
γ
′
such that w =
σ
1
· · · σ
n
. The latter sequence is called a run when γ is the initial conﬁguration
γ
s
. A string w is accepted by A if and only if γ
s
⇒
A,w
γ
f
with γ
f
the ﬁnal
conﬁguration. We usually denote ⇒
A,w
simply by ⇒
w
when A is clear from
the context. We denote by L(A) the set of strings accepted by A. The size
of A, denoted by A, is Q +δ + Σ
q∈Q
log(min(q)) + Σ
q∈Q
log(max(q)). So,
each min(q) and max(q) is represented in binary.
10,12
dvd
store
store
cd
10,12
Figure 5.1: An NFA(#, &) for the language dvd
10,12
&cd
10,12
. For readability,
we only displayed the alphabet symbol on nonepsilon transitions and counters
for states q where min(q) and max(q) are diﬀerent from one. The arrows
from the initial state and to the ﬁnal state are split and merge transitions,
respectively. The arrows labeled store represent counting transitions.
An example of an NFA(#, &) deﬁning dvd
10,12
& cd
10,12
is shown in Fig
ure 5.1. An NFA(#) is an NFA(#, &) without split and merge transitions.
An NFA(&) is an NFA(#, &) without counting transitions.
Clearly NFA(#, &) accept all regular languages. The next theorem shows
the complexity of translating between RE(#, &) and NFA(#, &), and between
NFA(#, &) and NFA.
In its proof we will make use of the corridor tiling problem. Thereto,
a tiling instance is a tuple T = (X, H, V, b, t, n) where X is a ﬁnite set of tiles,
H, V ⊆ X × X are the horizontal and vertical constraints, n is an integer in
5.1. Automata for RE(#, &) 61
unary notation, and b, t are ntuples of tiles (b and t stand for bottom row and
top row, respectively).
A correct corridor tiling for T is a mapping λ : {1, . . . , m}×{1, . . . , n} → X
for some m ∈ N such that the following constraints are satisﬁed:
• the bottom row is b: b = (λ(1, 1), . . . , λ(1, n));
• the top row is t: t = (λ(m, 1), . . . , λ(m, n));
• all vertical constraints are satisﬁed: ∀i < m, ∀j ≤ n, (λ(i, j), λ(i+1, j)) ∈
V ; and,
• all horizontal constraints are satisﬁed: ∀i ≤ m, ∀j < n, (λ(i, j), λ(i, j +
1)) ∈ H.
The corridor tiling problem asks, given a tiling instance, whether there
exists a correct corridor tiling. The latter problem is pspacecomplete [104].
We are now ready to prove Theorem 42.
Theorem 42. (1) Given an RE(#, &) expression r, an equivalent NFA(#, &)
can be constructed in time linear in the size of r.
(2) Given an NFA(#, &) A, an equivalent NFA can be constructed in time
exponential in the size of A.
Proof. (1) We prove the theorem by induction on the structure of RE(#, &)
expressions. For every r we deﬁne a corresponding NFA(#, &) A(r) = (Q
r
, s
r
,
f
r
, δ
r
) such that L(r) = L(A(r)).
For r of the form ε, a, r
1
· r
2
, r
1
+r
2
and r
∗
1
these are the usual RE to NFA
constructions with εtransition as displayed in text books such as [55].
We perform the following steps for the numerical occurrence and interleav
ing operator which are graphically illustrated in Figure 5.2. The construction
for the interleaving operator comes from [60].
(i) If r = (r
1
)
k,ℓ
and A(r
1
) = (Q
1
, s
1
, f
1
, δ
1
), then
• Q
r
:= Q
r
1
⊎ {s
r
, f
r
, q
r
};
• min(s
r
) = max(s
r
) = min(f
r
) = max(f
r
) = 1, min(q
r
) = k, and
max(q
r
) = ℓ;
• if k = 0 then δ
r
:= δ
r
1
⊎{(s
r
, ε, s
r
1
), (f
r
1
, ε, q
r
), (q
r
, store, s
r
1
), (q
r
, ε, f
r
)};
and,
• if k = 0 then δ
r
:= δ
r
1
⊎{(s
r
, ε, s
r
1
), (f
r
1
, ε, q
r
), (q
r
, store, s
r
1
), (q
r
, ε, f
r
),
(s
r
, ε, f
r
)}.
62 Complexity of Extended Regular Expressions
(ii) If r = r
1
& r
2
, A(r
1
) = (Q
r
1
, s
r
1
, f
r
1
, δ
r
1
) and A(r
2
) = (Q
r
2
, s
r
2
, f
r
2
, δ
r
2
),
then
• Q
r
:= Q
r
1
⊎ Q
r
2
⊎ {s
r
, f
r
};
• min(s
r
) = max(s
r
) = min(f
r
) = max(f
r
) = 1;
• δ
r
:= δ
r
1
⊎ δ
r
2
⊎ {(s
r
, split, s
r
1
, s
r
2
), (f
r
1
, f
r
2
, merge, f
r
)}.
Notice that in each step of the construction, a constant number of states
are added to the automaton. Moreover, the constructed counters are linear
in the size of r. It follows that the size of A(r) is linear in the size of r.
The correctness of the construction can easily be proved by induction on the
structure of r.
(2) Let A = (Q
A
, s
A
, f
A
, δ
A
) be an NFA(#, &). We deﬁne an NFA B =
(Q
B
, q
0
B
, F
B
, δ
B
) such that L(A) = L(B). Formally,
• Q
B
= 2
Q
A
×({1, . . . , max(A)}
Q
A
);
• s
B
= ({s
A
}, α
s
A
);
• F
B
= {({f
A
}, α
f
A
)} if ε / ∈ L(A), and F
B
= {({f
A
}, α
f
A
), ({s
A
}, α
s
A
)},
otherwise;
• δ
B
= {
(P
1
, α
1
), σ, (P
2
, α
2
)
 σ ∈ Σ and (P
1
, α
1
) ⇒
A,σ
(P
2
, α
2
) for
conﬁgurations (P
1
, α
1
) and (P
2
, α
2
) of A}.
Obviously, B can be constructed from A in exponential time. Notice that
the size of Q
B
is smaller than 2
Q
A

· 2
A·Q
A

. Furthermore, as B mimics A
by storing the conﬁgurations of B in its states, it is immediate that L(A) =
L(B).
We next turn to the complexity of the basic decision problems for NFA(#, &).
Theorem 43. (1) equivalence and inclusion for NFA(#, &) are expspace
complete;
(2) intersection for NFA(#, &) is pspacecomplete; and,
(3) membership for NFA(#) is nphard, membership for NFA(&) and NFA(#,
&) are pspacecomplete.
Proof. (1) expspacehardness follows from Theorem 42(1) and the expspace
hardness of equivalence for RE(&) [79]. Membership in expspace follows
from Theorem 42(2) and the fact that inclusion for NFAs is in pspace [102].
5.1. Automata for RE(#, &) 63
store
s
r
s
r1
f
r1
f
r
q
r
ε
ε ε ε
k, ℓ
if k = 0
f
r2
s
r
f
r
s
r1
f
r1
s
r2
Figure 5.2: From RE(#, &) to NFA(#, &).
(2) The lower bound follows from [70]. We show that the problem is in pspace.
For j ∈ {1, . . . , n}, let A
j
= (Q
j
, s
j
, f
j
, δ
j
) be an NFA(#, &). The algorithm
proceeds by guessing a Σstring w such that w ∈
¸
n
j=1
L(A
j
). Instead of
guessing w at once, we guess it symbol by symbol and keep for each A
j
one current conﬁguration γ
j
on the tape. More precisely, at each time in
stant, the tape contains for each A
j
a conﬁguration γ
j
= (P
j
, α
j
) such that
γ
s
j
⇒
A
j
,w
i
(P
j
, α
j
), where w
i
= a
1
· · · a
i
is the preﬁx of w guessed up to now.
The algorithm accepts when each γ
j
is a ﬁnal conﬁguration. Formally, the
algorithm operates as follows.
1. Set γ
j
= ({s
j
}, α
s
j
) for j ∈ {1, . . . , n};
2. While not every γ
j
is a ﬁnal conﬁguration
(i) Guess an a ∈ Σ.
(ii) Nondeterministically replace each γ
j
by a (P
′
j
, α
′
j
) such that
(P
j
, α
j
) ⇒
A
j
,a
(P
′
j
, α
′
j
).
As the algorithm only uses space polynomial in the size of the NFA(#, &)
and step (2,ii) can be done using only polynomial space, the overall algorithm
operates in pspace.
(3) The membership problem for NFA(#, &)s is easily seen to be in pspace
by an ontheﬂy implementation of the construction in Theorem 42(2). Indeed,
as a conﬁguration of an NFA(#, &) A = (Q, s, f, δ) has size at most Q +Q ·
log(max(A)), we can store a conﬁguration using only polynomial space.
We show that the membership problem for NFA(#)s is nphard by a
reduction from a modiﬁcation of integer knapsack. We deﬁne this problem
as follows. Given a set of natural numbers W = {w
1
, . . . , w
k
} and two integers
64 Complexity of Extended Regular Expressions
.
.
.
w
k
, w
k
w
1
, w
1
q
w
k
q f s
ε
q
w
1
ε
ε
store
store
ε
store
m, n
ε
Figure 5.3: nphardness of membership for NFA(#).
m and n, all in binary notation, the problem asks whether there exists a
mapping τ : W →N such that m ≤
¸
w∈W
τ(w) ×w ≤ n. The latter mapping
is called a solution. This problem is known to be npcomplete [36].
We construct an NFA(#) A = (Q, s, f, δ) such that L(A) = {ε} if W, m, n
has a solution, and L(A) = ∅ otherwise.
The set of states Q consists of the start and ﬁnal states s and f, a state
q
w
i
for each weight w
i
, and a state q. Intuitively, a successful computation of
A loops at least m and at most n times through state q. In each iteration,
A also visits one of the states q
w
i
. Using numerical occurrence constraints,
we can ensure that a computation accepts if and only if it passes at least m
and at most n times through q and a multiple of w
i
times through each q
w
i
.
Hence, an accepting computation exists if and only if there is a mapping τ
such that m ≤
¸
w∈W
τ(w) ×w ≤ n.
Formally, the transitions of A are the following:
• (s, ε, q
w
i
) for each i ∈ {1, . . . , k};
• (q
w
i
, store, q) for each i ∈ {1, . . . , k};
• (q
w
i
, ε, q) for each i ∈ {1, . . . , k};
• (q, store, s); and,
• (q, ε, f).
We set min(s) = max(s) = min(f) = max(f) = 1, min(q) = m, max(q) = n
and min(q
w
i
) = max(q
w
i
) = w
i
for each q
w
i
. The automaton is graphically
illustrated in Figure 5.3.
Finally, we show that membership for NFA(&)s is pspacehard. Before
giving the proof, we describe some nary merge and split transitions which
can be rewritten in function of the regular binary split and merge transitions.
1. (q
1
, q
2
, mergesplit, q
′
1
, q
′
2
): States q
1
and q
2
are read, and immediately
split into states q
′
1
and q
′
2
.
5.1. Automata for RE(#, &) 65
2. (q
1
, q
2
, q
3
, mergesplit, q
′
1
, q
′
2
, q
′
3
): States q
1
, q
2
and q
3
are read, and im
mediately split into states q
′
1
, q
′
2
and q
′
3
.
3. (q
1
, split, q
′
1
, . . . , q
′
n
): State q
1
is read, and is immediately split into states
q
′
1
, . . . , q
′
n
.
4. (q
1
, . . . , q
n
, merge, q
′
1
): States q
1
, . . . , q
n
are read, and are merged into
state q
′
1
.
Transitions of type 1 (resp. 2) can be rewritten using 2 (resp. 4) regular
transitions, and 1 (resp. 3) new auxiliary states. Transitions of type 3 and 4
can be rewritten using (n − 1) regular transitions and (n − 1) new auxiliary
states. For example, the transition (q
1
, q
2
, mergesplit, q
′
1
, q
′
2
) is equal to the
transitions (q
1
, q
2
, merge, q
h
), and (q
h
, split, q
′
1
, q
′
2
), where q
h
is a new auxiliary
state.
To show that membership for NFA(&)s is pspacehard, we reduce from
corridor tiling. Given a tiling instance T = (X, H, V, b, t, n), we construct
an NFA(&) A over the empty alphabet (Σ = ∅) which accepts ε if and only if
there exists a correct corridor tiling for T.
The automaton constructs the tiling row by row. Therefore, A must at
any time reﬂect the current row in its state set. (recall that an NFA(&) can
be in more than one state at once) To do this, A contains for every tile x a set
of states x
1
, . . . , x
n
, where n is the length of each row. If A is in state x
i
, this
means that the ith tile of the current row is x. For example, if b = x
1
x
3
x
1
,
and t = x
2
x
2
x
1
, then the initial state set is {x
1
1
, x
2
3
, x
3
1
}, and A can accept
when the state set is {x
1
2
, x
2
2
, x
3
1
}.
It remains to describe how A can transform the current row (“state set”),
into a state set which describes a valid row on top of the current row. This
transformation proceeds on a tile by tile basis and begins with the ﬁrst tile,
say x
i
, in the current row which is represented by x
1
i
in the state set. Now, for
every tile x
j
, for which (x
i
, x
j
) ∈ V , we allow x
1
i
to be replaced by x
1
j
, since
x
j
can be the ﬁrst tile of the row on top of the current row. For the second
tile of the next row, we have to replace the second tile of the current row, say
x
k
, by a new tile, say x
ℓ
, such that the vertical constraints between x
k
and x
ℓ
are satisﬁed and such that the horizontal constraints between x
ℓ
and the tile
we just placed at the ﬁrst position of the ﬁrst row, x
j
, are satisﬁed as well.
The automaton proceeds in this manner for the remainder of the row. For
this, the automaton needs to know at any time at which position a tile must
be updated. Therefore, an extra set of states p
1
, . . . , p
n
is created, where the
state p
i
says that the tile at position i has to be updated. So, the state set
always consists of one state p
i
, and a number of states which represent the
current and next row. Here, the states up to position i already represent the
66 Complexity of Extended Regular Expressions
tiles of the next row, the states from position i still represent the current row,
and i is the next position to be updated.
We can now formally construct an NFA(&) A = (Q, s, f, δ) which accepts
ε if and only if there exists a correct corridor tiling for a tiling instance T =
(X, H, V, b, t, n) as follows:
• Q = {x
j
 x ∈ X, 1 ≤ j ≤ n} ∪ {p
i
 1 ≤ i ≤ n} ∪ {s, f}
• Σ = ∅
• δ is the union of the following transitions
– (s, split, p
1
, b
1
1
, . . . , b
n
n
): From the initial state the automaton imme
diately goes to the states which represent the bottom row.
– (p
1
, t
1
1
, . . . , t
n
n
, merge, f): When the state set represents a full row
(the automaton is in state p
1
), and the current row is the accepting
row, all states are merged to the accepting state.
– ∀x
i
, x
j
∈ X, (x
j
, x
i
) ∈ V : (p
1
, x
1
j
, mergesplit, p
2
, x
1
i
): When the
ﬁrst tile has to be updated, the automaton only has to check the
vertical constraints with the ﬁrst tile of the previous row.
– ∀x
i
, x
j
, x
k
∈ X, m ∈ N, 2 ≤ m ≤ n, (x
k
, x
i
) ∈ V, (x
j
, x
i
) ∈ H:
(p
m
, x
m
k
, x
m−1
j
, mergesplit, p
(m mod n)+1
, x
m
i
, x
m−1
j
): When a tile at
the mth (m = 1) position has to be updated, the automaton has
to check the vertical constraint with the mth tile at the previous
row, and the horizontal constraint with the (m − 1)th tile of the
new row.
Clearly, if there exists a correct corridor tiling for T, there exists a run
of A accepting ε. Conversely, the construction of our automaton, in which
the updates are always determined by the position p
i
, and the horizontal and
vertical constraints, assures that when there is an accepting run of A on ε,
this run simulates a correct corridor tiling for T.
5.2 Complexity of Regular Expressions
We now investigate the complexity of regular expressions, extended with count
ing and interleaving, and CHAREs.
Mayer and Stockmeyer [79] and Meyer and Stockmeyer [83] already estab
lished the expspacecompleteness of inclusion and equivalence for RE(&)
and RE(#), respectively. From Theorem 42(1) and Theorem 43(1) it then di
rectly follows that allowing both operators does not increase the complexity. It
5.2. Complexity of Regular Expressions 67
inclusion equivalence intersection
CHARE pspace [75] in pspace [102] pspace [75]
CHARE(#) EXPSPACE in EXPSPACE PSPACE
CHARE(a, a?) conp [75] in ptime [75] np [75]
CHARE(a, a
∗
) conp [75] in ptime [75] np [75]
CHARE(a, a?, a#) coNP in PTIME NP
CHARE(a, a#
>0
) in PTIME in PTIME in PTIME
Table 5.2: Overview of new and known complexity results concerning Chain
Regular Expressions. All results are completeness results, unless mentioned
otherwise. The new results are printed in bold.
further follows from Theorem 42(1) and Theorem 43(2) that intersection for
RE(#, &) is in pspace. We stress that the latter results could also have been
obtained without making use of NFA(#, &)s but by translating RE(#, &)s
directly to NFAs. However, in the case of intersection such a construction
should be done in an ontheﬂy fashion to not go beyond pspace. Although
such an approach certainly is possible, we prefer the shorter and more elegant
construction using NFA(#, &)s.
Theorem 44. 1. equivalence and inclusion for RE(#, &) are in ex
pspace; and
2. intersection for RE(#, &) is pspacecomplete.
Proof. (1) Follows directly from Theorem 42(1) and Theorem 43(1).
(2) The upper bound follows directly from Theorem 42(1) and Theorem 43(2).
The lower bound is already known for ordinary regular expressions.
Bex et al. [10] established that the far majority of regular expressions oc
curring in practical DTDs and XSDs are of a very restricted form as deﬁned
next. The class of chain regular expressions (CHAREs) are those REs con
sisting of a sequence of factors f
1
· · · f
n
where every factor is an expression of
the form (a
1
+· · · +a
n
), (a
1
+· · · +a
n
)?, (a
1
+· · · +a
n
)
+
, or, (a
1
+· · · +a
n
)
∗
,
where n ≥ 1 and every a
i
is an alphabet symbol. For instance, the expression
a(b +c)
∗
d
+
(e +f)? is a CHARE, while (ab +c)
∗
and (a
∗
+b?)
∗
are not.
1
We introduce some additional notation to deﬁne subclasses and extensions
of CHAREs. By CHARE(#) we denote the class of CHAREs where also
factors of the form (a
1
+ · · · + a
n
)
k,ℓ
, with k, ℓ ∈ N ∪ {0} are allowed. For
1
We disregard here the additional restriction used in [9] that every symbol can occur only
once.
68 Complexity of Extended Regular Expressions
the following fragments, we list the admissible types of factors. Here, a, a?,
a
∗
denote the factors (a
1
+ · · · + a
n
), (a
1
+ · · · + a
n
)?, and (a
1
+ · · · + a
n
)
∗
,
respectively, with n = 1, while a# denotes a
k,ℓ
, and a#
>0
denotes a
k,ℓ
with
k > 0.
Table 5.2 lists the new and the relevant known results. We ﬁrst show that
adding numerical occurrence constraints to CHAREs increases the complexity
of inclusion by one exponential. We reduce from expcorridor tiling.
Theorem 45. inclusion for CHARE(#) is expspacecomplete.
Proof. The expspace upper bound already follows from Theorem 44(1).
The proof for the expspace lower bound is similar to the proof for pspace
hardness of inclusion for CHAREs in [75]. The main diﬀerence is that the
numerical occurrence operator allows to compare tiles over a distance expo
nential in the size of the tiling instance.
The proof is a reduction from expcorridor tiling. A tiling instance is
a tuple T = (X, H, V, x
⊥
, x
⊤
, n) where X is a ﬁnite set of tiles, H, V ⊆ X×X
are the horizontal and vertical constraints, x
⊥
, x
⊤
∈ X, and n is a natural
number in unary notation. A correct exponential corridor tiling for T is a
mapping λ : {1, . . . , m} × {1, . . . , 2
n
} → X for some m ∈ N such that the
following constraints are satisﬁed:
• the ﬁrst tile of the ﬁrst row is x
⊥
: λ(1, 1) = x
⊥
;
• the ﬁrst tile of the mth row is x
⊤
: λ(m, 1) = x
⊤
;
• all vertical constraints are satisﬁed: ∀i < m, ∀j ≤ 2
n
, (λ(i, j), λ(i +
1, j)) ∈ V ; and,
• all horizontal constraints are satisﬁed: ∀i ≤ m, ∀j < 2
n
, (λ(i, j), λ(i, j +
1)) ∈ H.
The expcorridor tiling problem asks, given a tiling instance, whether
there exists a correct exponential corridor tiling. The latter problem is easily
shown to be expspacecomplete [104].
We proceed with the reduction from expcorridor tiling. Thereto, let
T = (X, H, V, x
⊥
, x
⊤
, n) be a tiling instance. Without loss of generality, we
assume that n ≥ 2. We construct two CHARE(#) expressions r
1
and r
2
such
that
L(r
1
) ⊆ L(r
2
) if and only if
there exists no correct exponential corridor tiling for T.
5.2. Complexity of Regular Expressions 69
As expspace is closed under complement, the expspacehardness of inclu
sion for CHARE(#) follows.
Set Σ = X ⊎ {$, △}. For ease of exposition, we denote X ∪ {△} by X
△
and X∪{△, $} by X
△,$
. We encode candidates for a correct tiling by a string
in which the rows are separated by the symbol △, that is, by strings of the
form
△R
0
△R
1
△· · · △R
m
△, (†)
in which each R
i
represents a row, that is, belongs to X
2
n
. Moreover, R
0
is
the bottom row and R
m
is the top row. The following regular expressions
detect strings of this form which do not encode a correct tiling for T:
• X
∗
△
△X
0,2
n
−1
△X
∗
△
. This expression detects rows that are too short,
that is, contain less than 2
n
symbols.
• X
∗
△
△X
2
n
+1,2
n
+1
X
∗
△X
∗
△
. This expression detects rows that are too
long, that is, contain more than 2
n
symbols.
• X
∗
△
x
1
x
2
X
∗
△
, for every x
1
, x
2
∈ X, (x
1
, x
2
) ∈ H. These expressions
detect all violations of horizontal constraints.
• X
∗
△
x
1
X
2
n
,2
n
△
x
2
X
∗
△
, for every x
1
, x
2
∈ X, (x
1
, x
2
) ∈ V . These expressions
detect all violations of vertical constraints.
Let e
1
, . . . , e
k
be an enumeration of the above expressions. Notice that k =
O(X
2
). It is straightforward that a string w in (†) does not match
¸
k
i=1
e
i
if
and only if w encodes a correct tiling.
Let e = e
1
· · · e
k
. Because of leading and trailing X
∗
△
expressions, L(e) ⊆
L(e
i
), for every i ∈ {1, . . . , k}. We are now ready to deﬁne r
1
and r
2
:
r
1
=
k times e
. .. .
$e$e$ · · · $e$ △x
⊥
X
2
n
−1,2
n
−1
△X
∗
△
△x
⊤
X
2
n
−1,2
n
−1
△
k times e
. .. .
$e$e$ · · · $e$
r
2
= $X
∗
△,$
$e
1
$e
2
$ · · · $e
k
$X
∗
△,$
$
Notice that both r
1
and r
2
are in CHARE(#) and can be constructed in
polynomial time. It remains to show that L(r
1
) ⊆ L(r
2
) if and only if there is
no correct tiling for T.
We ﬁrst show the implication from left to right. Thereto, let L(r
1
) ⊆ L(r
2
).
Let uwu
′
be an arbitrary string in L(r
1
) such that u, u
′
∈ L($e$e$ · · · $e$) and
w ∈ △x
⊥
X
2
n
−1,2
n
−1
△X
∗
△
△x
⊤
X
2
n
−1,2
n
−1
△. By assumption, uwu
′
∈ L(r
2
).
Notice that uwu
′
contains 2k + 2 times the symbol “$”. Moreover, the
ﬁrst and the last “$” of uwu
′
is always matched onto the ﬁrst and last “$”
of r
2
. This means that k + 1 consecutive $symbols of the remaining 2k $
symbols in uwu
′
must be matched onto the $symbols in $e
1
$e
2
$ · · · $e
k
$.
70 Complexity of Extended Regular Expressions
Hence, w is matched onto some e
i
. So, w does not encode a correct tiling.
As the subexpression △x
⊥
X
2
n
−1,2
n
−1
△X
∗
△
△x
⊤
X
2
n
−1,2
n
−1
△ of r
1
deﬁnes all
candidate tilings, the system T has no solution.
To show the implication from right to left, assume that there is a string
uwu
′
∈ L(r
1
) that is not in r
2
, where u, u
′
∈ L($e$e$ · · · $e$). Then w ∈
¸
k
i=1
L(e
i
) and, hence, w encodes a correct tiling.
Adding numerical occurrence constraints to the fragment CHARE(a, a?)
keeps equivalence in ptime, intersection in np and inclusion in conp.
Theorem 46. (1) equivalence for CHARE(a, a?, a#) is in ptime.
(2) inclusion for CHARE(a, a?, a#) is conpcomplete.
2
(3) intersection for CHARE(a, a?, a#) is npcomplete.
Proof. (1) It is shown in [75] that two CHARE(a, a?) expressions are equivalent
if and only if they have the same sequence normal form (which is deﬁned be
low). As a
k,ℓ
is equivalent to a
k
(a?)
ℓ−k
, it follows that two CHARE(a, a?, a#)
expressions are equivalent if and only if they have the same sequence normal
form. It remains to argue that the sequence normal form of CHARE(a, a?, a#)
expressions can be computed in polynomial time. To this end, let r = f
1
· · · f
n
be a CHARE(a, a?, a#) expression with factors f
1
, . . . , f
n
. The sequence nor
mal form is then obtained in the following way. First, we replace every factor
of the form
• a by a[1, 1];
• a? by a[0, 1]; and,
• a
k,ℓ
by a[k, ℓ],
where a is an alphabet symbol. We call a the base symbol of the factor a[i, j].
Then, we replace successive subexpressions a[i
1
, j
1
] and a[i
2
, j
2
] carrying the
same base symbol a by a[i
1
+i
2
, j
1
+j
2
] until no further replacements can be
made anymore. For instance, the sequence normal form of aa?a
2,5
a?bb?b?b
1,7
is a[3, 8]b[2, 10]. Obviously, the above algorithm to compute the sequence nor
mal form of CHARE(a, a?, a#) expressions can be implemented in polynomial
time. It can then be tested in linear time whether two sequence normal forms
are the same.
(2) conphardness is immediate since inclusion is already conpcomplete for
CHARE(a, a?) expressions [75].
2
In the extended abstract presented at ICDT’07, in which these results ﬁrst appeared,
the complexity was wrongly attributed to lie between pspace and expspace.
5.2. Complexity of Regular Expressions 71
We show that the problem remains in conp. To this end, we represent
strings w by their sequence normal form as discussed above, where we take each
string w as the regular expression deﬁning w. We call such strings compressed.
Let r
1
and r
2
be two CHARE(a, a?, a#)s. We can assume that they are in
sequence normal form.
To show that L(r
1
) ⊆ L(r
2
), we guess a compressed string w of polynomial
size for which w ∈ L(r
1
), but w / ∈ L(r
2
). We guess w ∈ L(r
1
) in the following
manner. We iterate from left to right over the factors of r
1
. For each factor
a[k, ℓ] we guess an h such that k ≤ h ≤ ℓ, and add a
h
to the compressed
string w. This algorithm gives a compressed string of polynomial size which
is deﬁned by r
1
. Furthermore, this algorithm is capable of guessing every
possible string deﬁned by r
1
. It is however possible that in the compressed
string there are two consecutive elements a
i
, a
j
with the same base symbol
a. If this is the case we merge these elements to a
i+j
which gives a proper
compressed string.
The following lemma shows that testing w / ∈ L(r
2
) can be done in polyno
mial time.
Lemma 47. Given a compressed string w and an expression r in sequence
normal form, deciding whether w ∈ L(r) is in ptime.
Proof. Let w = a
p
1
1
· · · a
p
n
n
, and r = b
1
[k
1
, ℓ
1
] · · · b
m
[k
m
, ℓ
m
]. Denote b
i
[k
i
, ℓ
i
]
by f
i
. For every position i of w (0 < i ≤ n), we deﬁne C
i
as a set of factors
b[k, ℓ] of r. Formally, f
j
∈ C
i
when a
p
1
1
· · · a
p
i−1
i−1
∈ L(f
1
· · · f
j−1
) and a
i
= b
j
.
We compute the C
i
as follows.
• C
1
is the set of all b
j
[k
j
, ℓ
j
] such that a
1
= b
j
, and ∀h < j, k
h
= 0. These
are all the factors of r which can match the ﬁrst symbol of w.
• Then, for all i ∈ {2, . . . , n}, we compute C
i
from C
i−1
. In particular,
f
h
= b
h
[k
h
, ℓ
h
] ∈ C
i
when there is a f
j
= b
j
[k
j
, ℓ
j
] ∈ C
i−1
such that
a
p
i−1
i−1
∈ f
j
· · · f
h−1
and a
i
= b
h
. That is, the following conditions should
hold:
– j < h: f
h
occurs after f
j
in r.
– b
h
= a
i
: f
h
can match the ﬁrst symbol of a
p
i
i
.
– ∀e ∈ {j, . . . , h −1}, if b
e
= a
i−1
then k
e
= 0: in between factors f
j
and f
h
it is possible to match only symbols a
i−1
.
– Let min =
¸
e∈{j,...,h−1},b
e
=a
i−1
k
e
and max =
¸
e∈{j,...,h−1},b
e
=a
i−1
ℓ
e
.
Then min ≤ p
i−1
≤ max. That is, p
i−1
symbols a
i−1
should be
matched from f
j
to f
h−1
.
72 Complexity of Extended Regular Expressions
Then, w ∈ L(r) if and only if there is an f
j
∈ C
n
such that a
p
n
n
∈
L(f
j
· · · f
n
). As the latter test and the computation of C
1
, . . . , C
n
can be
done in polynomial time, the lemma follows.
(3) nphardness is immediate since intersection is already npcomplete for
CHARE(a, a?) expressions [75].
We show that the problem remains in np. As in the proof of Theo
rem 46(2) we represent a string w as a compressed string. Let r
1
, . . . , r
n
be CHARE(a, a?, a#) expressions.
Lemma 48. If
¸
n
i=1
L(r
i
) = ∅, then there exists a string w = a
p
1
1
· · · a
p
m
m
∈
¸
n
i=1
L(r
i
) such that m ≤ min{r
i
  i ∈ {1, . . . , n}} and, for each i ∈
{1, . . . , n}, j
i
is not larger than the largest integer occurring in r
1
, . . . , r
n
.
Proof. Suppose that there exists a string w = a
p
1
1
· · · a
p
m
m
∈
¸
n
i=1
L(r
i
), with
a
i
= a
i+1
for every i ∈ {1, . . . , m−1}. Since w is matched by every expression
r
1
, . . . , r
n
, and since a factor of a CHARE(a, a?, a#) expression can never
match a strict superstring of a
p
i
i
for i ∈ {1, . . . , n}, we have that m ≤ min{r
i
 
i ∈ {1, . . . , n}}.
Furthermore, since w is matched by every expression r
1
, . . . , r
n
, no j
i
can
be larger than the largest integer occurring in r
1
, . . . , r
n
.
The np algorithm then consists of guessing a compressed string w of poly
nomial size and verifying whether w ∈
¸
n
i=1
L(r
i
). If we represent r
1
, . . . , r
n
by
their sequence normal form, this veriﬁcation step can be done in polynomial
time by Lemma 47.
Finally, we exhibit a tractable subclass with numerical occurrence con
straints:
Theorem 49. inclusion, equivalence, and intersection for CHARE(a,
a#
>0
) are in ptime.
Proof. The upper bound for equivalence is immediate from Theorem 46(2).
For inclusion, let r
1
and r
2
be two CHARE(a, a#
>0
)s in sequence normal
form. (as deﬁned in the proof of Theorem 46) Let r
1
= a
1
[k
1
, ℓ
1
] · · · a
n
[k
n
, ℓ
n
]
and r
2
= a
′
1
[k
′
1
, ℓ
′
1
] · · · a
′
n
′
[k
′
n
′
, ℓ
′
n
′
]. Notice that every number k
i
and k
′
j
is
greater than zero. We claim that L(r
1
) ⊆ L(r
2
) if and only if n = n
′
and for
every i ∈ {1, . . . , n}, a
i
= a
′
i
, k
i
≥ k
′
i
, and ℓ
i
≤ ℓ
′
i
.
Indeed, if n = n
′
, or if there exists an i such that a
i
= a
′
i
or k
i
< k
′
i
,
then a
k
1
1
· · · a
k
n
n
∈ L(r
1
) \ L(r
2
). If there exists an i such that ℓ
i
> ℓ
′
i
, then
a
ℓ
1
1
· · · a
ℓ
n
n
∈ L(r
1
) \ L(r
2
). Conversely, it is immediate that every string in
5.2. Complexity of Regular Expressions 73
L(r
1
) is also in L(r
2
). It is straightforward to test these conditions in linear
time.
For intersection, let, for every i ∈ {1, . . . , n}, r
i
= a
i,1
[k
i,1
, ℓ
i,1
] · · ·
a
i,m
i
[k
i,m
i
, ℓ
i,m
i
] be a CHARE(a, a#
>0
) in sequence normal form. Notice that
every number k
i,j
is greater than zero. We claim that
¸
n
i=1
L(r
i
) = ∅ if and
only if
(i) m
1
= m
2
= · · · = m
n
;
(ii) for every i, j ∈ {1, . . . , n} and x ∈ {1, . . . , m
1
}, a
i,x
= a
j,x
; and,
(iii) for every x ∈ {1, . . . , m
1
}, max{k
i,x
 1 ≤ i ≤ n} ≤ min{ℓ
i,x
 1 ≤ i ≤ n}.
Indeed, if the above conditions hold, we have that a
K
1
1,1
· · · a
K
m
1
1,m
1
is in
¸
n
i=1
L(r
i
),
where K
x
= max{k
i,x
 1 ≤ i ≤ n} for every x ∈ {1, . . . , m
1
}. If m
i
= m
j
for some i, j ∈ {1, . . . , n}, then the intersection between r
i
and r
j
is empty.
So assume that condition (i) holds. If a
i,x
= a
j,x
for some i, j ∈ {1, . . . , n}
and x ∈ {1, . . . , m
1
}, then we also have that the intersection between r
i
and
r
j
is empty. Finally, if condition (iii) does not hold, take i, j, and x such that
k
i,x
= max{k
i,x
 1 ≤ i ≤ n} and ℓ
j,x
= min{ℓ
i,x
 1 ≤ i ≤ n}. Then the
intersection between r
i
and r
j
is empty.
Finally, testing conditions (i)–(iii) can be done in linear time.
6
Deterministic Regular
Expressions with Counting
As mentioned before, XML Schema expands the vocabulary of regular expres
sion by a counting operator, but restricts them by requiring the expressions
to be deterministic. Although there is a notion of determinism which has be
come the standard in the context of standard regular expressions, there are in
fact two slightly diﬀerent notions of determinism. The most common notion,
weak determinism (also called oneunambiguity [17]), intuitively requires that,
when matching a string from left to right against an expression, it is always
clear against which position in the expression the next symbol in the string
must be matched. For example, the expression (a +b)
∗
a is not weakly deter
ministic, because it is not clear in advance to which a in the expression the
ﬁrst a of the string aaa must be matched. The reason is that, without look
ing ahead, we do not know whether the current a we read is the last symbol
of the string or not. On the other hand, the equivalent expression b
∗
a(b
∗
a)
∗
is weakly deterministic: the ﬁrst a we encounter must be matched against
the leftmost a in the expression, and all other a’s against the rightmost one
(similarly for the b’s). Strong determinism restricts regular expressions even
further. Intuitively, it requires additionally that it is also clear how to go from
one position to the next. For example, (a
∗
)
∗
is weakly deterministic, but not
strongly deterministic since it is not clear over which star one should iterate
when going from one a to the next.
Although the latter example illustrates the diﬀerence between the notions
76 Deterministic Regular Expressions with Counting
of weak and strong determinism, they in fact almost coincide for standard
regular expressions. Indeed, Br¨ uggemannKlein [15] has shown that any weak
deterministic expression can be translated into an equivalent strongly deter
ministic one in linear time.
1
However, this situation changes completely when
counting is involved. First, the algorithm for deciding whether an expression
is weakly deterministic is nontrivial [66]. For instance, (a
2,3
+b)
2,2
b is weakly
deterministic, but the very similar (a
2,3
+ b)
3,3
b is not. So, the amount of
nondeterminism introduced depends on the concrete values of the counters.
Second, as we will show, weakly deterministic expressions with counting are
strictly more expressive than strongly deterministic ones. Therefore, the aim
of this chapter is an indepth study of the notions of weak and strong determin
ism in the presence of counting with respect to expressiveness, succinctness,
and complexity.
We ﬁrst give a complete overview of the expressive power of the diﬀer
ent classes of deterministic expressions with counting. We show that strongly
deterministic expressions with counting are equally expressive as standard de
terministic expressions. Weakly deterministic expressions with counting, on
the other hand, are more expressive than strongly deterministic ones, except
for unary languages, on which they coincide. However, not all regular lan
guages are deﬁnable by weakly deterministic expressions with counting.
Then, we investigate the diﬀerence in succinctness between strongly and
weakly deterministic expressions with counting, and show that weakly deter
ministic expressions can be exponentially more succinct than strongly deter
ministic ones. This result prohibits an eﬃcient algorithm translating a weakly
deterministic expression into an equivalent strongly deterministic one, if such
an expression exists. This contrasts with the situation of standard expressions
where such a linear time algorithm exists [15].
We also present an automaton model extended with counters, counter
NFAs (CNFAs), and investigate the complexity of some related problems.
For instance, it is shown that boolean operations can be applied eﬃciently
to CDFAs, the deterministic counterpart of CNFAs. BruggemannKlein [15]
has shown that the Glushkov construction, translating regular expressions into
NFAs, yields a DFA if and only if the original expression is deterministic. We
investigate the natural extension of the Glushkov construction to expressions
with counters, converting expressions to CNFAs. We show that the resulting
automaton is deterministic if and only if the original expression is strongly de
terministic. Combining the results on CDFAs with the latter result then also
allows to infer better upper bounds on the inclusion and equivalence problem
1
In fact, Br¨ uggemannKlein did not study strong determinism explicitly. However, she
gives a procedure to transform expressions into star normal form which rewrites weakly
determinisistic expressions into equivalent strongly deterministic ones in linear time.
77
of strongly deterministic expressions with counting. Further, we show that
testing whether an expression with counting is strongly deterministic can be
done in cubic time, as is the case for weak determinism [66].
It should be said that XML Schema uses weakly deterministic expressions
with counting. However, it is also noted by SperbergMcQueen [101], one of
its developers, that
“Given the complications which arise from [weakly deterministic
expressions], it might be desirable to also require that they be
strongly deterministic as well [in XML Schema].”
The design decision for weak determinism is probably inspired by the fact
that it is the natural extension of the notion of determinism for standard
expressions, and a lack of a detailed analysis of their diﬀerences when counting
is allowed. A detailed examination of strong and weak determinism of regular
expressions with counting intends to ﬁll this gap.
Related work. Apart from the work already mentioned, there are several
automata based models for diﬀerent classes of expressions with counting with
as main application XML Schema validation, by Kilpelainen and Tuhkanen
[65], Zilio and Lugiez [26], and SperbergMcQueen [101]. Here, Sperberg
McQueen introduces the extension of the Glushkov construction which we
study in Section 6.5. We introduce a new automata model in Section 6.4 as
none of these models allow to derive all results in Sections 6.4 and 6.5. Further,
SperbergMcQueen [101] and Koch and Scherzinger [68] introduce a (slightly
diﬀerent) notion of strongly deterministic expression with and without count
ing, respectively. We follow the semantical meaning of SperbergMcQueen’s
deﬁnition, while using the technical approach of Koch and Scherzinger. Fi
nally, Kilpelainen [63] shows that inclusion for weakly deterministic expres
sions with counting is coNPhard; and Colazzo, Ghelli, and Sartiani [24] have
investigated the inclusion problem involving subclasses of deterministic ex
pressions with counting.
Concerning deterministic languages without counting, the seminal paper
is by BruggemannKlein and Wood [17] where, in particular, it is shown to be
decidable whether a language is deﬁnable by a deterministic regular expression.
Conversely, general regular expressions with counting have also received quite
some attention [40, 37, 64, 83].
Outline. In Section 6.1, we provide some additional deﬁnitions. In Sec
tion 6.2 we study the expressive power of weak and strong deterministic
expressions with counting, and in Section 6.3 their relative succinctness. In
Section 6.4, we deﬁne CNFA, and in Section 6.5 we study the Glushkov
78 Deterministic Regular Expressions with Counting
construction translating RE(#) to CNFA. In Sections 6.6 and 6.7 we con
sider the complexity of testing whether an expression with counting is strongly
deterministic and decision problems for deterministic RE(#). We conclude in
Section 6.8.
6.1 Preliminaries
We introduce some additional notation concerning (deterministic) RE(#) ex
pressions. An RE(#) expression r is nullable if ε ∈ L(r). We say that an
RE(#) r is in normal form if for every nullable subexpression s
k,l
of r we have
k = 0. Any RE(#) can easily be normalized in linear time. Therefore, we
assume that all expressions used in this chapter are in normal form. Some
times we will use the following observation, which follows directly from the
deﬁnitions:
Remark 50. A subexpression r
k,ℓ
is nullable if and only if k = 0.
For a regular expression r, we say that a subexpression of r of the form s
k,ℓ
is an iterator of r. Let r and s be expressions such that s is a subexpression
of r. Then we say that s is a factor of r, whenever ﬁrst(s) ⊆ ﬁrst(r) and
last(s) ⊆ last(r). When r and s are iterators, this intuitively means that
a number of iterations of s are suﬃcient to satisfy r. For instance, in the
expression s = (a
0,3
1
b
1,2
1
)
3,4
, b
1,2
1
is a factor of s, but a
0,3
1
is not.
For two symbols x, y of a marked expression r, we denote by lca(x, y) the
smallest subexpression of r containing both x and y. Further, we write s r
when s is a subexpression of r and s ≺ r when s r and s = r. For an
iterator s
k,ℓ
, let base(s
k,ℓ
) := s, lower(s
k,ℓ
) := k, and upper(s
k,ℓ
) := ℓ. We say
that s
k,ℓ
is bounded when ℓ ∈ N, otherwise it is unbounded.
Weak determinism. For completeness, we deﬁne here the notion of weak
determinism. This notion, however, is exactly the notion of determinism as
deﬁned in Section 2.2. Recall that for an expression r, r denotes a marking of
r, and Char(r) denotes the symbols occurring in r.
Deﬁnition 51. An RE(#) expression r is weakly deterministic if, for all
strings u, v, w ∈ Char(r)
∗
and all symbols a, b ∈ Char(r), the conditions
uav, ubw ∈ L(r) and a = b imply that a = b.
A regular language is weakly deterministic with counting if it is deﬁned by
some weakly deterministic RE(#) expression. The classes of all weakly deter
ministic languages with counting, respectively, without counting, are denoted
by DRE
#
W
, respectively, DRE
W
.
6.1. Preliminaries 79
·
·
0, 3
0, 1
a
+
0, ∞
b
0, ∞
c
d
1
2 3 4
Figure 6.1: Parse tree of (a
0,1
)
0,3
(b
0,∞
+c
0,∞
)d. Counter nodes are numbered
from 1 to 4.
Intuitively, an expression is weakly deterministic if, when matching a string
against the expression from left to right, we always know against which symbol
in the expression we must match the next symbol, without looking ahead in the
string. For instance, (a +b)
∗
a and (a
2,3
+b)
3,3
b are not weakly deterministic,
while b
∗
a(b
∗
a)
∗
and (a
2,3
+b)
2,2
b are.
Strong determinism. Intuitively, an expression is weakly deterministic if,
when matching a string from left to right, we always know where we are in
the expression. For a strongly deterministic expression, we will additionally
require that we always know how to go from one position to the next. Thereto,
we distinguish between going forward in an expression and backward by iter
ating over a counter. For instance, in the expression (ab)
1,2
going from a to b
implies going forward, whereas going from b to a iterates backward over the
counter.
Therefore, an expression such as ((a + ε)(b + ε))
1,2
will not be strongly
deterministic, although it is weakly deterministic. Indeed, when matching ab,
we can go from a to b by either going forward or by iterating over the counter.
By the same token, also (a
1,2
)
3,4
is not strongly deterministic, as we have a
choice of counters over which to iterate when reading multiple a’s. Conversely,
(a
2,2
)
3,4
is strongly deterministic as it is always clear over which counter we
must iterate.
For the deﬁnition of strong determinism, we follow the semantic meaning of
the deﬁnition by SperbergMcQueen [101], while using the formal approach of
Koch and Scherzinger [68] (who called the notion strong oneunambiguity)
2
.
We denote the parse tree of an RE(#) expression r by pt(r). Figure 6.1
contains the parse tree of the expression (a
0,1
)
0,3
(b
0,∞
+c
0,∞
)d.
A bracketing of a regular expression r is a labeling of the counter nodes of
2
The diﬀerence with Koch and Scherzinger is that we allow diﬀerent derivations of ε while
they forbid this. For instance, a
∗
+ b
∗
is strongly deterministic in our deﬁnition, but not in
theirs, as ε can be matched by both a
∗
and b
∗
.
80 Deterministic Regular Expressions with Counting
pt(r) by distinct indices. Concretely, we simply number the nodes according to
the depthﬁrst lefttoright ordering. The bracketing ¯ r of r is then obtained by
replacing each subexpression s
k,ℓ
of r with index i with ([
i
s]
i
)
k,ℓ
. Therefore, a
bracketed regular expression is a regular expression over alphabet Σ⊎Γ, where
Γ := {[
i
, ]
i
 i ∈ N}. For example, ([
1
([
2
a]
2
)
0,1
]
1
)
0,3
(([
3
b]
3
)
0,∞
+ ([
4
c]
4
)
0,∞
)d is
a bracketing of (a
0,1
)
0,3
(b
0,∞
+ c
0,∞
)d, for which the parse tree is shown in
Figure 6.1. We say that a string w in Σ ⊎ Γ is correctly bracketed if w has no
substring of the form [
i
]
i
. That is, we do not allow a derivation of ε in the
derivation tree.
Deﬁnition 52. A regular expression r is strongly deterministic with counting
if r is weakly deterministic and there do not exist strings u, v, w over Σ ∪ Γ,
strings α = β over Γ, and a symbol a ∈ Σ such that uαav and uβaw are both
correctly bracketed and in L(¯ r).
A standard regular expression (without counting) is strongly deterministic
if the expression obtained by replacing each subexpression of the form r
∗
with
r
0,∞
is strongly deterministic with counting. The class DRE
#
S
, respectively,
DRE
S
, denotes all languages deﬁnable by a strongly deterministic expressions
with, respectively, without, counting.
6.2 Expressive Power
Br¨ uggemannKlein and Wood [17] proved that, for any alphabet Σ, DRE
W
forms a strict subclass of the regular languages, denoted REG(Σ). The com
plete picture of the relative expressive power of the diﬀerent classes depends
on the size of Σ, as shown in Figure 6.2.
Figure 6.2: An overview of the expressive power of diﬀerent classes of deter
ministic regular languages, depending on the alphabet size. Numbers refer to
the theorems proving the (in)equalities.
DRE
S
[15]
= DRE
W
54
= DRE
#
S
55
= DRE
#
W
60
REG(Σ) (if Σ = 1)
DRE
S
[15]
= DRE
W
54
= DRE
#
S
59
DRE
#
W
60
REG(Σ) (if Σ ≥ 2)
The equality DRE
S
= DRE
W
is already implicit in the work of Br¨ uggemann
Klein [15].
3
By this result and by deﬁnition, all inclusions from left to right
3
Strong determinism was not explicitly considered in [15]. However, the paper gives a
transformation of weakly deterministic regular expressions into equivalent expressions in star
normal form; and every expression in star normal form is strongly deterministic.
6.2. Expressive Power 81
already hold. The rest of this section is devoted to proving the other equal
ities and inequilaties in Theorems 54, 55, 59, and 60, while the intermediate
lemmas are only used in proving the former theorems.
We need a small lemma to prepare for the equality between DRE
S
and
DRE
#
S
.
Lemma 53. Let r
1
r
2
be a factor of r
k,ℓ
with r
1
, r
2
nullable, L(r
1
) = {ε},
L(r
2
) = {ε}, and ℓ ≥ 2. Then r
k,ℓ
is not strongly deterministic.
Proof. If r
k,ℓ
is not weakly deterministic, the lemma already holds. Therefore,
assume that r
k,ℓ
is weakly deterministic. Let r
k,ℓ
be a marked version of r
k,ℓ
.
Let
¯
s := [
1
¯
r]
k,ℓ
1
denote the bracketed version of r
k,ℓ
. Denote by
¯
r
1
and
¯
r
2
the
subexpressions of
¯
s that correspond to r
1
and r
2
.
Let
¯
u (respectively,
¯
v) be a nonempty string in L(
¯
r
1
) (respectively, L(
¯
r
2
)).
These strings exist as L(r
1
) and L(r
2
) are not equal to {ε}. Let
¯
w = [
1
¯
u
¯
v]
1
.
Deﬁne x := max{0, k − 2}. Then, both [
1
¯
u]
1
[
1
¯
v]
1
¯
w
x
and [
1
¯
u
¯
v]
1
¯
w
x+1
are in
L(
¯
s) and thereby show that r
k,ℓ
is not strongly deterministic.
Theorem 54. Let Σ be an arbitrary alphabet. Then,
DRE
S
= DRE
#
S
.
Proof. Let r be an arbitrary strongly deterministic regular expression with
counting. We can assume without loss of generality that no subexpressions of
the form s
1,1
or ε occur. Indeed, iteratively replacing any subexpression of the
form s
1,1
, sε or εs by s; s+ε or ε+s by s
0,1
and ε
k,ℓ
by ε, yields an expression
which is still strongly deterministic and which is either equal to ε or does not
contain s
1,1
or ε. As in the former case we are done, we can hence assume the
latter.
We recursively transform r into a strongly deterministic expression EC(r)
without counting such that L(r) = L(EC(r)), using the rules below. In these
rules, EC stands for “eliminate counter” and ECE for “eliminate counter and
epsilon”:
(a) for all a ∈ Σ, EC(a) := a
(b) EC(r
1
+r
2
) := EC(r
1
) + EC(r
2
)
(c) EC(r
1
r
2
) := EC(r
1
)EC(r
2
)
(d) if r nullable then EC(r
0,1
) := EC(r)
(e) if r not nullable then EC(r
0,1
) := ECE(r) +ε
(f) EC(r
k,∞
) := ECE(r) · · · ECE(r) · ECE(r)
∗
82 Deterministic Regular Expressions with Counting
(g) if ℓ ∈ N\{1} then EC(r
k,ℓ
) := ECE(r) · · · ECE(r)·
ECE(r)(ECE(r)(· · · )+
ε) +ε
In the last two rules, ECE(r) · · · ECE(r) denotes a kfold concatenation of
ECE(r), and the recursion in
ECE(r)(ECE(r)(· · · ) + ε) + ε
contains ℓ − k
occurrences of r.
(a’) for all a ∈ Σ, ECE(a) := a
(b’) ECE(r
1
+r
2
) := ECE(r
1
) + ECE(r
2
)
(c’) ECE(r
1
r
2
) := EC(r
1
)EC(r
2
)
(d’) if k = 0, then ECE(r
k,ℓ
) := EC(r
k,ℓ
)
(e’) ECE(r
0,1
) := ECE(r)
(f’) ECE(r
0,∞
) := ECE(r)ECE(r)
∗
(g’) if k = 0 and ℓ ∈ N \ {1}, then ECE(r
k,ℓ
) := ECE(r)
ECE(r)(· · · ) +ε
Similarly as above, the recursion
ECE(r)(· · · )+ε
contains ℓ−1 occurrences of
r. We prove that L(r) = L(EC(r)) and that EC(r) is a strongly deterministic
regular expression.
To show L(r) = L(EC(r)), we prove by simultaneous induction on the
application of the rules (a)–(g) and (a’)–(g’) that L(EC(r)) = L(r), for any
strongly deterministic expression r, and that L(ECE(r)) = L(r) \ {ε}, for all
expressions r to which ECE(r) can be applied. The latter is important as
ECE(r) does not equal L(r) \ {ε} for all r. For instance, ECE(a
0,1
b
0,1
) =
(a + ε)(b + ε), and L(a
0,1
b
0,1
) = L((a + ε)(b + ε)). However, as EC is always
applied to a strongly deterministic expression, we will see that ECE will never
be applied to such a subexpression. This is why we need Lemma 53.
The base cases of the induction are (a) EC(a) := a and (a’) ECE(a) :=
a. These cases are clear. For the induction, the cases (b)–(e) are trivial.
Correctness for cases (f) and (g) is immediate from Remark 50. Case (b’) is
also trivial.
The ﬁrst nontrivial case is (c’). If ε / ∈ L(r
1
r
2
) then (c’) is clearly correct.
Towards a contradiction, suppose that ε ∈ L(r
1
r
2
). This implies that r
1
and r
2
are nullable. Notice that we only apply rule (c’) when rewriting r if either (i)
r
1
r
2
is a factor of some s
0,1
with s not nullable (case (e)), or (ii) r
1
r
2
is a factor
of a subexpression of the form s
k,ℓ
with ℓ ≥ 2 (cases (f,g)). Here, r
1
r
2
must
in both cases be a factor since, whenever ECE is applied to a subexpression,
this subexpression is a factor; and the factor relation is clearly transitive. The
reason that there must always be such a superexpression to which case (e), (f),
6.2. Expressive Power 83
or (g) is applied is that the recursive process starts by applying EC. Therefore,
we must have applied (e), (f), or (g) to some superexpression of which r
1
r
2
is
a factor before applying ECE to r
1
r
2
.
In case (i), r
1
and r
2
nullable implies that r in rule (e) must be nullable
(because r
1
r
2
is a factor of r), which contradicts the precondition of rule
(e). In case (ii), Lemma 53 claims that s
k,ℓ
and therefore r is not strongly
deterministic, which is also a contradiction.
By Remark 50, (d’) is also correct. Cases (e’)–(g’) are also trivial. This
concludes the proof that L(EC(r)) = L(r).
We next prove that EC(r) and ECE(r) are strongly deterministic regular
expression, whenever r is strongly deterministic. We prove this by induction
on the reversed order of application of the rewrite rules. The induction base
rules are (a) and (a’), which are immediate. Furtermore, rules (b)–(e) and
(b’)–(e’) are also immediate. For instance, for rule (b) we know by induction
that EC(r
1
) and EC(r
2
) are strongly deterministic. As L(r
1
) = L(EC(r
1
)),
L(r
2
) = L(EC(r
2
)), and r
1
+ r
2
is strongly deterministic, it follows that also
EC(r
1
) + EC(r
2
) is strongly deterministic.
Cases (f), (g), (f’), and (g’) are more involved. We only investigate case (f)
because these four cases are all very similar. Let EC(r
k,∞
) denote a marked
version of EC(r
k,∞
) and let EC(r
k,∞
) = ECE(r)
1
· · · ECE(r)
k
ECE(r)
∗
k+1
. Fur
ther, let r
k,∞
be a marking of r
k,∞
and let f : Char(EC(r
k,∞
)) → Char(r
k,∞
)
be the natural mapping associating each symbol in EC(r
k,∞
) to its correspond
ing symbol in Char(r
k,∞
). For instance, when r = (aba)
1,∞
, then EC(r) =
(aba)(aba)
∗
, r = (a
1
b
1
a
2
)
1,∞
, EC(r) = (a
1
b
1
a
2
)(a
3
b
2
a
4
)
∗
, and f(a
1
) = a
1
,
f(b
1
) = b
1
, f(a
2
) = a
2
, f(a
3
) = a
1
, f(b
2
) = b
1
, and f(a
4
) = a
2
. By abuse of
notation, we also let f denote its natural extension mapping strings to strings.
Now, assume, towards a contradiction, that EC(r
k,∞
) is not strongly de
terministic, and assume ﬁrst that this is due to the fact that EC(r
k,∞
) is not
weakly deterministic. Hence, there exist marked strings u, v, w and marked
symbols x and y such that uxv and uyw in L(EC(r
k,∞
)) with x = y and
dm(x) = dm(y). We distinguish two cases. First, assume f(x) = f(y). Then,
as both f(u)f(x)f(v) and f(u)f(x)f(w) are in L(r
k,∞
) we immediately obtain a
contradiction with the weak determinism, and thus also the strong determin
ism, of r
k,∞
. Second, assume f(x) = f(y). Then, x and y cannot occur in the
same ECE(r)
i
, for any i ∈ [1, k + 1]. Indeed, otherwise ECE(r) would not be
weakly deterministic, contradicting the induction hypothesis. Hence we can
assume x ∈ Char(ECE(r)
i
) and y ∈ Char(ECE(r)
j
), with 1 ≤ i < j ≤ k + 1.
But then, consider the strings w
1
= dm(uxv) and w
2
= dm(uyw) and notice
that dm(ux) = dm(uy). Let [
m
be the opening bracket corresponding to the
iterator r
k,∞
in the bracketed version of r
k,∞
. Then, we can construct two
84 Deterministic Regular Expressions with Counting
correctly bracketed words deﬁned by the bracketing of r
k,∞
by adding brack
ets to w
1
and w
2
such that (i) in the preﬁx dm(ux) of w
1
, i [
m
brackets occur
(indicating i iterations of the outermost iterator) and (ii) in the preﬁx dm(uy)
of w
2
, j [
m
brackets occur. But, as dm(ux) = dm(uy) this implies that r
k,∞
is not strongly deterministic, a contradiction.
Hence, if EC(r
k,∞
) is not strongly deterministic, this is due to the failure
of the second reason in Deﬁnition 52. It is easily seen that, as this reason
concerns the iterators, and all subexpressions are strongly deterministic due
to the induction hypothesis, it must be the case that ECE(r)
∗
is not strongly
deterministic. Let [
m
1
be the opening bracket corresponding to the iterator
ECE(r)
∗
. Since ECE(r) is strongly deterministic but ECE(r)
∗
is not, we can
take strings
¯
u
1
α
1
¯
av
1
and
¯
u
1
β
1
¯
aw
1
, so that
•
¯
u
1
α
1
¯
av
1
and
¯
u
1
β
1
¯
aw
1
are correctly bracketed and accepted by the brack
eted version of ECE(r)
∗
,
• α
1
, β
1
∈ Γ
∗
, and
• ]
m
1
[
m
1
is a substring of α
1
but not of β
1
,
Consider the expression r
0,∞
(which has a lower bound 0 instead of k) and let
[
m
2
be the opening bracket corresponding to its outermost iterator. According
to the above, there also exist correctly bracketed strings
¯
u
2
α
2
¯
av
2
and
¯
u
2
β
2
¯
aw
2
accepted by the bracketed version of r
0,∞
such that ]
m
2
[
m
2
is a substring of
α
2
but not of β
2
. This proves that r
0,∞
and therefore r
k,∞
is not strongly
deterministic. This hence leads to the desired contradiction, and shows that
EC(r) is indeed strongly deterministic.
Theorem 55. Let Σ be an alphabet with Σ = 1. Then,
DRE
#
S
= DRE
#
W
Proof. By deﬁnition, every strongly deterministic expression is also weakly
deterministic. Hence, it suﬃces to show that, over a unary alphabet, every
weakly deterministic language with counting can also be deﬁned by a strongly
deterministic expression with counting. We will do so by characterizing the
weakly deterministic languages with counting over a unary alphabet in terms of
their corresponding minimal DFAs. Thereto, we ﬁrst introduce some notation.
The following notions come from, e.g., Shallit [91], but we repeat them here
for completeness. (Shallit used tail to refer to what we call a chain.)
Deﬁnition 56. We say that a DFA over Σ = {a} is a chain if its start state is
q
0
and its transition function is of the form δ(q
0
, a) = q
1
, . . . , δ(q
n−1
, a) = q
n
,
where q
i
= q
j
when i = j. A DFA is a chain followed by a cycle if its transition
6.2. Expressive Power 85
function is of the form δ(q
0
, a) = q
1
, . . . , δ(q
n−1
, a) = q
n
, δ(q
n
, a) = q
n+1
, . . . ,
δ(q
n+m−1
, a) = q
n+m
, δ(q
n+m
, a) = q
n
, where q
i
= q
j
when i = j. The cycle
states of the latter DFA are q
n
, . . . , q
n+m
.
We say that a unary regular language L is ultimately periodic if L is inﬁnite
and its minimal DFA is a chain followed by a cycle, for which at most one of
the cycle states is ﬁnal.
The crux of the proof then lies in Lemma 57. It is well known (see, e.g.,
[91]) and easy to see that the minimal DFA for a regular language over a unary
alphabet is deﬁned either by a simple chain of states, or a chain followed by
a cycle. To this fact, the following lemma adds that for languages deﬁned by
weakly deterministic regular expressions only one node in this cycle can be
ﬁnal.
Lemma 57. Let Σ = {a}, and L ∈ REG(Σ), then L ∈ DRE
#
W
if and only if
L is either ﬁnite or ultimately periodic.
Before we prove this lemma, note that it implies Theorem 55. Indeed, any
ﬁnite language can clearly be deﬁned by a strongly deterministic expression,
while an inﬁnite but ultimately periodic language can be deﬁned by a strongly
deterministic expression of the form a
n
1
(a
n
2
(· · · a
n
k−1
(a
n
k
)
∗
· · · +ε)+ε), where
a
n
i
denotes the n
i
fold concatenation of a. Hence, it only remains to prove
Lemma 57. Thereto, we ﬁrst need a more reﬁned notion of ultimate periodicity
and an additional lemma, after which we conclude with the proof of Lemma 57.
We say that L over alphabet {a} is (n
0
, x)periodic if
(i) L ⊆ L((a
x
)
∗
), and
(ii) for every n ∈ N such that nx ≥ n
0
, L contains the string a
nx
, i.e., the
string of a’s of length nx.
We say that L is ultimately xperiodic if L is (n
0
, x)periodic for some n
0
∈ N.
Notice that these notions imply that L is inﬁnite. Clearly, any ultimately
xperiodic language is also ultimately periodic. However, the opposite does
not always hold. Indeed, in an ultimately xperiodic language all strings have
lengths which are multiples of x, i.e., they have length 0 (modulo x). In an
ultimately periodic language only all suﬃciently long string must have the
same length y (modulo x), for a ﬁxed y, which, moreover, can be diﬀerent
from 0.
Lemma 58. Let L be a language, x ∈ N, k ∈ N∪ {0}, and ℓ ∈ N∪ {∞} with
k ≤ ℓ. If L is ultimately xperiodic, then L
k,ℓ
is also ultimately xperiodic.
86 Deterministic Regular Expressions with Counting
Proof. Every string in L
k,ℓ
is a concatenation of (possibly empty) strings in
L. Since the length of every string in L is a multiple of x, it follows that the
length of every string in L
k,ℓ
is a multiple of x. Therefore, L
k,ℓ
⊆ L((a
x
)
∗
).
Furthermore, take n
0
∈ N such that L is (n
0
, x)periodic. Let k
0
:=
max{k, 1}. We will show that L
k,ℓ
is (k
0
(n
0
+ x), x)periodic, which proves
the lemma. Take n ∈ N such that nx ≥ k
0
(n
0
+ x). Hence, (
n
k
0
− 1)x ≥ n
0
since k
0
≥ 1 and therefore ⌊
n
k
0
⌋x ≥ n
0
. Hence, a
nx
= a
⌊n/k
0
⌋x
· · · a
⌊n/k
0
⌋x
a
r
0
x
,
where the dots abbreviate a k
0
fold concatenation of a
⌊n/k
0
⌋x
and r
0
:= n
mod ⌊n/k
0
⌋. Since L is (n
0
, x)periodic, a
⌊n/k
0
⌋x
∈ L and a
(⌊n/k
0
⌋+r
0
)x
∈ L.
Hence, a
nx
∈ L
k
0
, which implies that a
nx
∈ L
k,ℓ
.
We are now ﬁnally ready to prove Lemma 57 and thus conclude the proof
of Theorem 55
Proof. [of Lemma 57] Clearly, any ﬁnite language is in DRE
#
W
and any ulti
mately periodic language can already be deﬁned by a strongly deterministic
expression, and hence is also in DRE
#
W
. Hence, we only need to show that,
if an inﬁnite language is in DRE
#
W
, then it is ultimately periodic, i.e., can
be deﬁned with a DFA of the correct form. It is wellknown that, for every
inﬁnite unary language, the minimal DFA is a chain followed by a cycle. We
therefore only need to argue that, in this cycle, at most one of the nodes is
ﬁnal.
Let Σ = {a} and let r be a weakly deterministic regular expression with
counting over {a}, such that L(r) is inﬁnite. We can assume without loss of
generality that r does not contain concatenations with ε. Since r is over a
unary alphabet, we can make the following observations:
If r has a disjunction r
1
+r
2
then either L(r
1
) = {ε} or L(r
2
) = {ε}. (6.1)
There are no subexpressions of the form r
1
r
2
in r in which L(r
1
) > 1. (6.2)
Indeed, if these statements do not hold, then r is not weakly deterministic.
Our ﬁrst goal is to bring r into a normal form. More speciﬁcally, we want to
write r as
r = (r
1
(r
2
(· · · (r
n
)
p
n
,1
· · · )
p
3
,1
)
p
2
,1
)
p
1
,1
,
where
(a) for each i = 1, . . . , n, p
i
∈ {0, 1};
(b) r has no occurrences of ε;
(c) r
n
is a tower of counters, that is, it only has a single occurrence of a; and
6.2. Expressive Power 87
(d) L(r
1
), . . . , L(r
n−1
) are singletons, and L(r
n
) is inﬁnite.
Notice that, if r has the normal form and one of L(r
i
), with 1 ≤ i ≤ n −1, is
not a singleton, then this would immediately violate (6.2). In order to achieve
this normal form, we iteratively replace
(i) all subexpressions of the form (s
k,k
)
ℓ,ℓ
with s
kℓ,kℓ
;
(ii) all subexpressions of the form s
k
1
,k
1
s
k
2
,ℓ
2
with s
k
1
+k
2
,k
1
+ℓ
2
; and
(iii) all subexpressions of the form (s +ε) and (ε +s) with s
0,1
(iv) all subexpressions of the form a
k,k
s, where s = a
k
1
,ℓ
1
with a
k,k
s
1,1
until no such subexpressions occur anymore. These replacements preserve
the language deﬁned by r and preserve weak determinism. Due to (6.1) and
(6.2), these replacements turn r in the normal form above adhering to the
syntactic constraints (a)–(c). Furthermore, we know that condition (d) must
hold because r is weakly deterministic.
Hence, we can assume that r = (r
1
(r
2
(· · · (r
n
)
p
n
,1
· · · )
p
3
,1
)
p
2
,1
)
p
1
,1
in which
only L(r
n
) is inﬁnite. Notice that we can translate the regular expression
r
′
= (r
1
(r
2
(· · · (r
n−1
(X)
p
n
,1
)
p
n−1
,1
· · · )
p
3
,1
)
p
2
,1
)
p
1
,1
over alphabet {a, X} into a DFA which is a chain and which reads the symbol
X precisely once, at the end of the chain. Therefore, it suﬃces to prove now
that we can translate r
n
into a DFA which is a chain followed by a cycle, in
which at most one of the cycle nodes is ﬁnal. Indeed, if A
1
and A
2
are the
DFAs for r
′
and r
n
respectively, then we can intuitively obtain the DFA A for
r by concatenating these two DFAs. More formally, if q
1
is the unique state
in A
1
which has the transition δ(q
1
, X) = q
2
(and q
2
is ﬁnal in A
1
), and q
3
is the initial state of A
2
, then we can obtain A by taking the union of the
states and transitions of A
1
and A
2
, removing state q
2
, and merging states q
1
and q
3
into a new state, while preserving incoming and outgoing transitions.
The initial state of A is the initial state of A
1
and its ﬁnal state set is the
union of the ﬁnal state sets in A
1
and A
2
. Since A
1
is a chain, L(A) = L(r)
is ultimately periodic if and only if L(A
2
) = L(r
n
) is ultimately periodic. It
thus only remains to show that L(r
n
) is ultimately periodic.
Let s
k,∞
be the smallest subexpression of r
n
in which the upper bound
is unbounded. (Such an expression must exist, since L(r
n
) is inﬁnite.) We
will ﬁrst prove that L(s
k,∞
) is ultimately xperiodic for some x ∈ N. Due to
Lemma 58 and the structure of r
n
, this also proves that L(r
n
) is ultimately
xperiodic, and thus ultimately periodic, and concludes our proof.
88 Deterministic Regular Expressions with Counting
It therefore only remains to show that L(s
k,∞
) is ultimately xperiodic
for some x ∈ N. To this end, there are two possibilities: either s contains
a subexpression of the form (s
′
)
y
1
,y
2
with y
1
< y
2
, or it does not. If not,
then L(s
k,∞
) is of the form (a
x,x
)
k,∞
due to the replacement (i) above in our
normalization. In this case, L(s
k,∞
) is clearly (xk, x)periodic.
If s
k,∞
contains a subexpression of the form (s
′
)
y
1
,y
2
with y
1
< y
2
, then it is
of the form (((a
x,x
)
y
1
,y
2
)
h
1
1
,h
1
2
) · · · )
h
n
1
,h
n
2
)
k,∞
, where x can equal 1, and the h
i
1
,
h
i
2
denote a nesting of iterators. It is immediate that L(s
k,∞
) ⊆ L((a
x,x
)
∗
),
i.e., the length of every string in L(s
k,∞
) is a multiple of x. To show that
there also exists an n
0
such that s
k,∞
deﬁnes all strings of length mx with
m ∈ N and mx ≥ n
0
, let z = h
1
1
· h
2
1
· · · h
n
1
, or z = 1 when n = 0. Clearly,
(((a
x,x
)
y,y+1
)
z,z
)
k,∞
deﬁnes a subset of s
k,∞
and, hence, it suﬃces to show
that (((a
x,x
)
y,y+1
)
z,z
)
k,∞
deﬁnes all such strings of length mx ≥ n
0
.
Let n
0
= y
2
xkz. We show that for any m such that mx ≥ y
2
xkz, a
mx
is
deﬁned by (((a
x,x
)
y,y+1
)
z,z
)
k,∞
. Take any m ≥ y
2
kz, and note that it suﬃces to
show that a
m
is deﬁned by ((a
y,y+1
)
z,z
)
k,∞
. The latter is the case if there exists
an ℓ ≥ k and positive natural numbers y
0
, y
1
such that m = y
0
y + y
1
(y + 1)
and y
0
+ y
1
= ℓz. That is, ℓ denotes the number of iterations the topmost
iterator does, and hence must be at least as big as k, while y
0
(respectively,
y
1
) denotes the number of times the inner iterator reads y (respectively, y +1)
a’s. We now show that such ℓ, y
0
, and y
1
indeed exist.
Thereto, let ℓ be the biggest natural number such that yzℓ ≤ m, and
observe that, as m ≥ y
2
zk, it must hold that ℓ ≥ y and ℓ ≥ k. Then, let
y
1
= m− yzℓ and y
0
= ℓz − y
1
. It remains to verify that ℓ, y
0
and y
1
satisfy
the desired conditions. We already observed that ℓ ≥ k, and, by deﬁnition of
y
0
, also y
0
+ y
1
= ℓz. Further, to show that m = y
0
y + y
1
(y + 1), note that
m = yzℓ +y
1
and thus m = y(y
0
+y
1
) +y
1
= y
0
y +y
1
(y +1). Finally, we must
show that y
0
and y
1
are positive numbers. For y
1
this is clear as y
1
= m−yzℓ,
and ℓ is chosen such that yzℓ ≤ m. For y
0
, recall that ℓ ≥ y, and, hence,
y
0
≥ yz − y
1
. Assuming y
0
to be negative implies that yz ≥ y
1
= m− yzℓ.
However, the latter implies that ℓ is not chosen to be maximal, as then also
yz(ℓ + 1) ≤ m, which is a contradiction.
This concludes the proof of Lemma 57.
Theorem 59. Let Σ be an alphabet with Σ ≥ 2. Then,
DRE
#
S
DRE
#
W
Proof. By deﬁnition, DRE
#
S
⊂ DRE
#
W
. Hence, it suﬃces to show that the
inclusion is strict for a binary alphabet Σ = {a, b}. A witness for this strict
inclusion is r = (a
2,3
(b + ε))
∗
. As r is weakly deterministic, L(r) ∈ DRE
#
W
.
By applying the algorithm of Br¨ uggemannKlein and Wood on r for testing
6.3. Succinctness 89
whether a regular language L(r) is in DRE
W
[17], it can be seen that L(r) / ∈
DRE
W
, and thus also not in DRE
S
. By Theorem 54, we therefore have that
L(r) / ∈ DRE
#
S
.
Theorem 60. Let Σ be an arbitrary alphabet. Then,
DRE
#
W
REG(Σ)
Proof. Clearly, every language deﬁned by a regular expression with counting
is regular. Hence, it suﬃces to show that the inclusion is strict. We prove
that the inclusion is strict already for a unary alphabet Σ = {a}. Thereto,
consider the expression r = (aaa)
∗
(a + aa). We can easily see that L(r) / ∈
DRE
W
by applying the algorithm of Br¨ uggemannKlein and Wood [17] to
L(r). As r is over a unary alphabet, it follows from Theorems 54 and 55 that
L((aaa)
∗
(a +aa)) / ∈ DRE
#
W
, and hence DRE
#
W
REG(Σ).
6.3 Succinctness
In Section 6.2 we learned that DRE
#
W
strictly contains DRE
#
S
, prohibiting a
translation from weakly to strongly deterministic expressions with counting.
However, one could still hope for an eﬃcient algorithm which, given a weakly
deterministic expression known to be equivalent to a strongly deterministic
one, constructs this expression. However, this is not the case:
Theorem 61. For every n ∈ N, there exists an RE(#) expression r over
alphabet {a} which is weakly deterministic and of size O(n) such that every
strongly deterministic expression s, with L(r) = L(s), is of size at least 2
n
.
Before proving this theorem, we ﬁrst give a lemma used in its proof.
Lemma 62. Let r be a strongly deterministic regular expression over alphabet
{a} with only one occurrence of a. Then L(r) can be deﬁned by one of the
following expressions:
(1) (a
k,k
)
x,y
(2) (a
k,k
)
x,y
+ε
where k ∈ N, x ∈ N ∪ {0}, and y ∈ N ∪ {∞}.
Proof. First, suppose that L(r) does not contain ε. Since r has one occurrence
of a, r is a nesting of iterators. However, since r is strongly deterministic, r can
not have subexpressions of the form s
x,y
with L(s) > 1 and (x, y) = (1, 1). It
90 Deterministic Regular Expressions with Counting
follows immediately that r is of the form ((· · · (a
k
1
,k
1
)
k
2
,k
2
· · · )
k
n
,k
n
)
x,y
. Setting
k := k
1
×· · · ×k
n
proves this case.
Second, suppose that L(r) contains ε. If r is of the form s
1
+s
2
, then one
of s
1
and s
2
, say s
2
, does not contain an a, and hence L(s
2
) = {ε}. Then,
if L(s) does not contain ε, we are done due to the above argument. So, the
only remaining case is that r is a nesting of iterators, possibly deﬁning ε. We
can assume without loss of generality that r does not have subexpressions of
the form s
1,1
. Again, since r is strongly deterministic, it cannot have subex
pressions of the form s
x,y
with (x, y) = (0, 1) and L(s) −{ε} > 1. Hence, by
replacing subexpressions of the form (s
k,k
)
ℓ,ℓ
by s
kℓ,kℓ
and (s
0,1
)
0,1
by s
0,1
, r
can either be rewritten to (a
k,k
)
x,y
or ((a
k,k
)
x,y
)
0,1
. In the ﬁrst case, the lemma
is immediate and, in the second case, we can rewrite r as ((a
k,k
)
x,y
) +ε.
We are now ready to prove Theorem 61.
Proof. [of Theorem 61] Let r be the weakly deterministic expression (a
N+1,2N
)
1,2
for N = 2
n
. Clearly, the size of r is O(n). Note that r deﬁnes all strings of
length at least N+1 and at most 4N, except a
2N+1
. These expressions, in fact,
where introduced by Kilpel¨ ainen [63] when studying the inclusion problem for
weakly deterministic expressions with counting.
We prove that every strongly deterministic expression for L(r) is of size
at least 2
n
. Thereto, let s be a strongly deterministic regular expression for
L(r) with a minimal number of occurrences of the symbol a and, among those
minimal expressions, one of minimal size.
We ﬁrst argue that s is in a similar but more restricted normal form as
the one we have in the proof of Lemma 57. If s is minimal and strongly
deterministic, then
(a) if s has a disjunction of the form s
1
+ s
2
then either L(s
1
) = {ε} or
L(s
2
) = {ε};
(b) there are no subexpressions of the form s
1
s
2
in s in which L(s
1
) > 1 or
ε ∈ L(s
1
);
(c) there are no subexpressions of the forms
1,1
1
, (s
0,1
1
)
0,1
, aa
k,ℓ
, a
k,ℓ
a, a
k
1
,ℓ
1
a
k
2
,ℓ
2
,
or (a
k,k
)
ℓ,ℓ
;
(d) there are no subexpressions of the form (s
1
)
k,ℓ
with (k, ℓ) = (0, 1) and
L(s
1
) −{ε} > 1;
The reasons are as follows:
• (a),(b): otherwise, s is not weakly deterministic;
6.3. Succinctness 91
• (c): otherwise, s is not minimal; and
• (d): otherwise, by (c), ℓ ≥ 2, which implies that s is not strongly deter
ministic.
Due to (a), we can assume without loss of generality that s does not have any
disjunctions. Indeed, when L(s
2
) = {ε}, we can replace every subexpression
s
1
+ s
2
or s
2
+ s
1
with (s
1
)
0,1
since the latter expression has the same or a
smaller size as the former ones. From (a)–(d), and the fact that ε / ∈ L(s), it
then follows that s is of the form
s = s
1
(s
2
(· · · (s
m
)
0,1
· · · )
0,1
)
0,1
where each s
i
(i = 1, . . . , m− 1) is of the form a
k,k
. Notice that s
m
can still
be nontrivial according to (a)–(d), such as (a
7,7
)
8,9
or a
7,7
((a
2,2
)
0,1
)
2,2
. We
now argue that s
m
is either of the form s
′
m
or s
′′
m
s
′
m
, where s
′
m
and s
′′
m
have
only one occurrence of the symbol a. We already observed above that s
m
does
not have any disjunctions. So, the only remaining operators are counting and
conjunction.
Suppose, towards a contradiction, that s
m
has at least two conjunctions
(and, therefore, at least three occurrences of a). Because of (b) and (c),
s
m
cannot be of the form p
1
p
2
p
3
, since then L(p
1
p
2
) = 1 and p
1
p
2
can be
rewritten. Hence, s
m
is of the form p
1
p
2
, where either p
1
or p
2
is an iterator
which contains a conjunction. If p
1
contains a conjunction we know due to (b)
that L(p
1
) = 1 and then p
1
can be rewritten. If p
2
is an iterator that contains
a conjunction then p
2
is of the form p
0,1
3
due to (d) and since L(s
m
) = ∞. Due
to (c), p
3
cannot be of the form p
0,1
4
. Due to (d), p
3
cannot be of the form p
k,ℓ
4
with (k, ℓ) = (0, 1). Hence, p
3
= p
4
p
5
. But this means that s
m
= p
1
(p
4
p
5
)
0,1
which violates the deﬁnition of s
m
, being the innermost expression in the
normal form for s above (i.e., s
m
should have been p
4
p
5
). This shows that s
m
has at most one conjunction.
It now follows analogously as in the reasoning for p
3
above that s
m
is either
of the form s
′
m
or s
′′
m
s
′
m
, where s
′
m
and s
′′
m
have only one occurrence of the
symbol a.
We will now argue that, for each string of the form a
N+1
, . . . , a
2N
, its last
position is matched onto a diﬀerent symbol in s. Formally, let u
N+1
, . . . , u
2N
be the unique strings in L(s) such that u
i
 = i, for every i ∈ [N + 1, 2N].
We claim that the last symbol in each u
i
is diﬀerent. This implies that s
contains at least N occurrences of a, making the size of s at least N = 2
n
, as
desired. Thereto, let w
1
and w
2
be two diﬀerent strings in {u
N+1
, . . . , u
2N
}.
Without loss of generality, assume w
1
to be the shorter string. Towards a
contradiction, assume that w
1
and w
2
end with the same symbol x. Let
92 Deterministic Regular Expressions with Counting
s = s
1
(s
2
(· · · (s
m
)
0,1
· · · )
0,1
)
0,1
. Due to the structure of s and since both w
1
and w
2
are in L(s), this implies that x cannot occur in s
1
, . . . , s
m−1
and that
x must be the rightmost symbol in s. As s
m
is either of the form s
′
m
or s
′′
m
s
′
m
this implies x occurs in s
′
m
. Again due to the structure of s and s
m
, this means
that s
′
m
must always deﬁne a language of the form {w  vw ∈ L(r)}, where
v is a preﬁx of dm(w
1
). Considering the language deﬁned by s, L(s
′
m
) hence
contains the strings a
i
, . . . , a
i+k
, a
i+k+2
, . . . , a
i+k+2N
for some i ≥ 1 and k ≥ 0.
However, as s
′
m
only contains a single occurrence of the symbol a, Lemma 62
says that it cannot deﬁne such a language. This is a contradiction.
It follows that the size of s is at least 2
n
.
6.4 Counter Automata
Let C be a set of counter variables and α : C → N be a function assigning a
value to each counter variable. We inductively deﬁne guards over C, denoted
Guard(C), as follows: for every cv ∈ C and k ∈ N, we have that true, false,
cv = k, and cv < k are in Guard(C). Moreover, when φ
1
, φ
2
∈ Guard(C),
then so are φ
1
∧φ
2
, φ
1
∨φ
2
, and ¬φ
1
. For φ ∈ Guard(C), we denote by α = φ
that α models φ, i.e., that applying the value assignment α to the counter
variables results in satisfaction of φ.
An update is a set of statements of the form cv++ and reset(cv) in which
every cv ∈ C occurs at most once. By Update(C) we denote the set of all
updates.
Deﬁnition 63. A nondeterministic counter automaton (CNFA) is a 6tuple
A = (Q, q
0
, C, δ, F, τ) where Q is the ﬁnite set of states; q
0
∈ Q is the ini
tial state; C is the ﬁnite set of counter variables; δ : Q × Σ × Guard(C) ×
Update(C) × Q is the transition relation; F : Q → Guard(C) is the accep
tance function; and τ : C → N assigns a maximum value to every counter
variable.
Intuitively, A can make a transition (q, a, φ, π, q
′
) whenever it is in state q,
reads a, and guard φ is true under the current values of the counter variables.
It then updates the counter variables according to the update π, in a way
we explain next, and moves into state q
′
. To explain the update mechanism
formally, we introduce the notion of conﬁguration. Thereto, let max(A) =
max{τ(c)  c ∈ C}. A conﬁguration is a pair (q, α) where q ∈ Q is the
current state and α : C → {1, . . . , max(A)} is the function mapping counter
variables to their current value. Finally, an update π transforms α into π(α)
by setting cv := 1, when reset(cv) ∈ π, and cv := cv + 1 when cv++ ∈ π and
α(cv) < τ(cv). Otherwise, the value of cv remains unaltered.
6.4. Counter Automata 93
Let α
0
be the function mapping every counter variable to 1. The initial
conﬁguration γ
0
is (q
0
, α
0
). A conﬁguration (q, α) is ﬁnal if α = F(q). A
conﬁguration γ
′
= (q
′
, α
′
) immediately follows a conﬁguration γ = (q, α) by
reading a ∈ Σ, denoted γ →
a
γ
′
, if there exists (q, a, φ, π, q
′
) ∈ δ with α = φ
and α
′
= π(α).
For a string w = a
1
· · · a
n
and two conﬁgurations γ and γ
′
, we denote by
γ ⇒
w
γ
′
that γ →
a
1
· · · →
a
n
γ
′
. A conﬁguration γ is reachable if there exists
a string w such that γ
0
⇒
w
γ. A string w is accepted by A if γ
0
⇒
w
γ
f
where
γ
f
is a ﬁnal conﬁguration. We denote by L(A) the set of strings accepted by
A.
A CNFA A is deterministic (or a CDFA) if, for every reachable conﬁgu
ration γ = (q, α) and for every symbol a ∈ Σ, there is at most one transition
(q, a, φ, π, q
′
) ∈ δ such that α = φ. Note that, as this deﬁnition only concerns
reachable conﬁgurations, it is not straightforward to test whether an CNFA
is deterministic (as will also become clear in Theorem 64(6)). However, as
nonreachable conﬁgurations have no inﬂuence on the determinism of runs on
the automaton, we choose not to take them into account.
The size of a transition θ or acceptance condition F(q) is the number of
symbols which occur in it plus the size of the binary representation of each
integer occcurring in it. By the same token, the size of A, denoted by A, is
Q +
¸
q∈Q
log τ(q) +F(q) +
¸
θ∈δ
θ.
Theorem 64. 1. Given CNFAs A
1
and A
2
, a CNFA A accepting the union
or intersection of A
1
and A
2
can be constructed in polynomial time.
Moreover, when A
1
and A
2
are deterministic, then so is A.
2. Given a CDFA A, a CDFA which accepts the complement of A can be
constructed in polynomial time.
3. membership for word w and CDFA A is in time O(wA).
4. membership for nondeterministic CNFA is npcomplete.
5. emptiness for CDFAs and CNFAs is pspacecomplete.
6. Deciding whether a CNFA A is deterministic is pspacecomplete.
Proof. We ﬁrst note that a CNFA A = (Q, q
0
, C, δ, F, τ) can easily be com
pleted. For q ∈ Q, a ∈ Σ deﬁne Formulas(q, a) = {φ  (q, a, φ, π, q
′
) ∈ δ}.
We say that A is complete if, for any value function α, it holds that α =
φ∈Formulas(q,a)
φ. That is, for any conﬁguration (q, α) and symbol a, there is
always a transition which can be followed by reading a.
Deﬁne the completion A
c
of A as A
c
= (Q
c
, q
0
, C, δ
c
, F
c
, τ
c
) with Q
c
=
Q∪{q
s
}, δ
c
is δ extended with (q
s
, a, true, ∅, q
s
) and, for all q ∈ Q, a ∈ Σ, with
94 Deterministic Regular Expressions with Counting
the tuples (q, a, φ
c
a,q
, φ, q
s
), where φ
c
a,q
= ¬
φ∈Formulas(q,a)
φ. Finally, F
c
is F
extended with F
c
(q
s
) = false. Note also that A
c
is deterministic if and only
if A is deterministic. Hence, from now on we can assume that all CNFAs and
CDFAs under consideration are complete. We now prove the six statements.
(1) Given two complete CNFAs A
1
, A
2
, where A
i
= (Q
i
, q
i
, C
i
, δ
i
, F
i
, τ
i
), their
union can be deﬁned as A = (Q
1
×Q
2
, (q
1
, q
2
), C
1
⊎ C
2
, δ, F, τ
1
∪ τ
2
). Here,
• δ = {((s
1
, s
2
), a, φ
1
∧ φ
2
, π
1
∪ π
2
, (s
′
1
, s
′
2
))  (s
i
, a, φ
i
, π
i
, s
′
i
) ∈ δ
i
for i =
1, 2}; and
• F(s
1
, s
2
) = F
1
(s
1
) ∨ F
2
(s
2
).
For the intersection of A
1
and A
2
, the deﬁnition is completely analogue, only
the acceptance condition diﬀers: F(s
1
, s
2
) = F(s
1
)∧F(s
2
). It is easily veriﬁed
that if A
1
and A
2
are both deterministic, then A is also deterministic.
(2) Given a complete CDFA A = (Q, q, C, δ, F, τ), the CDFA A
′
= (Q, q, C, δ,
F
′
, τ), where F
′
(q) = ¬F(q), for all q ∈ Q, deﬁnes the complement of A.
(3) Since A is a CDFA, from every reachable conﬁguration only one transition
can be followed when reading a symbol. Hence, we can match the string w
from left to right, maintaining at any time the current conﬁguration of A.
Hence, we need to make w + 1 transitions, while every transition can be
made in time O(A). Therefore, testing membership can be done in time
O(wA).
(4) Obviously, we can decide in nondeterministic polynomial time whether
a string w is accepted by a CNFA A. It suﬃces to guess a sequence of w
transitions, and check whether this sequence forms a valid run of A on w.
To show that it is nphard, we do a reduction from bin packing [36]. This
is the problem, given a set of weights W = {n
1
, . . . , n
k
}, a packet size p, and
the maximum number of packets m; decide whether there is a partitioning of
W = S
1
⊎S
2
⊎· · · ⊎S
m
such that for each i ∈ [1, m], we have that
¸
n∈S
i
n ≤ p.
The latter problem is NPcomplete, even when all integers are given in unary.
We will construct a string w and a CNFA A such that w ∈ L(A) if and
only if there exists a proper partitioning for W = {n
1
, . . . , n
k
}, p, and m. Let
Σ = {#, 1}, and w = #1
n
1
##1
n
2
#· · · #1
n
k
#. Further, A will have an initial
state q
0
and for every j ∈ [1, m] a state q
j
and countervariable cv
j
. When
reading the string w the automaton will nondeterministically guess for each
n
i
to which of the m sets n
i
is assigned (by going to its corresponding state)
and maintaining the running sum of the diﬀerent sets in the countervariables.
(As the initial value of the countervariables is 1, we actually store the running
sum plus 1) In the end it can then easily be veriﬁed whether the chosen
partitioning satisﬁes the desired properties.
6.4. Counter Automata 95
Formally, A = (Q, q
0
, {cv
1
, . . . , cv
m
}, δ, F, τ) can be deﬁned as follows:
• Q = {q
0
, . . . , q
m
};
• For all i ∈ [1, m], (q
0
, #, true, ∅, q
i
), (q
i
, 1, true, {cv
i
++}, q
i
), (q
i
, #, true,
∅, q
0
) ∈ δ;
• F(q
0
) =
i∈[1,m]
cv
i
≤ p + 1, and for q = q
0
, F(q) = false; and
• for all i ∈ [1, m], τ(cv
i
) = p + 2.
(5) We show that emptiness is in pspace for CNFAs, and pspacehard for
CDFAs. The theorem then follows.
The algorithm for the upperbound guesses a string w which is accepted
by A. Instead of guessing w at once, we guess it symbol by symbol and
store one conﬁguration γ, such that A can be in γ after reading the already
guessed string w. When γ is an accepting conﬁguration, we have guessed a
string which is accepted by A, and have thus shown that L(A) is nonempty.
Since any conﬁguration can be stored in space polynomial in A (due to the
maximum values τ on the countervariables), we hence have a nondeterministic
polynomial space algorithm for the complement of the emptiness problem.
As npspace = pspace, and pspace is closed under complement, it follows
that emptiness for CNFAs is in pspace.
We show that the emptiness problem for CDFAs is pspacehard by a
reduction from reachability of 1safe Petri nets.
A net is a triple N = (S, T, E), where S and T are ﬁnite sets of places and
transitions, and E ⊆ (S × T) ∪ (T × S) → {0, 1}. The preset of a transition
t is denoted by •t, and deﬁned by •t = {s  E(s, t) = 1}. A marking is a
mapping M : S → N. A Petri net is a pair N = (N, M
0
), where N is a net
and M
0
is the initial marking. A transition t is enabled at a marking M if
M(s) > 0 for every s ∈ •t. If t is enabled at M, then it can ﬁre, and its
ﬁring leads to the successor marking M
′
which is deﬁned for every place s by
M
′
(s) = M(s) + E(t, s) − E(s, t). The expression M →
t
M
′
denotes that M
enables transition t, and that the marking reached by the ﬁring of t is M
′
.
Given a (ﬁring) sequence σ = t
1
· · · t
n
, M ⇒
σ
M
′
denotes that there exist
markings M
1
, M
2
, . . . , M
n−1
such that M →
t
1
M
1
· · · M
n−1
→
t
n
M
′
. A Petri
net is 1safe if M(s) ≤ 1 for every place s and every reachable marking M.
reachability for 1safe Petri nets is the problem, given a 1safe Petri net
(N, M
0
) and a marking M, is M reachable from M
0
. That is, does there exist
a sequence σ such that M
0
⇒
σ
M. The latter problem is pspacecomplete
[32].
We construct a CDFA A such that L(A) = ∅ if and only if M is not
reachable from M
0
. That is, A will accept all strings at
1
· · · t
k
such that
96 Deterministic Regular Expressions with Counting
M
0
⇒
t
1
···t
k
M. To achieve this, A will simulate the working of the Petri
net on the ﬁring sequence given by the string. Therefore, we maintain a
countervariable for every place s, which will have value 1 when M
′
(s) = 0 and
2 when M
′
(s) = 1, where M
′
is the current marking. With every transition of
the Petri net, we associate a transition in A. Here, the guard φ is used to check
whether the preconditions for the ﬁring of this transition are satisﬁed, and the
updates are used to update the values of the countervariables according to the
transition. The string is accepted if after executing this ﬁring sequence, the
countervariables correspond to the given marking M.
Formally, let N = (S, T, E). Then, the CDFA A = ({q
0
, q
1
}, q
0
, S, δ, F, τ)
is deﬁned as follows:
• (q
0
, a, true, π, q
1
) ∈ δ, where π = {s++  M
0
(s) = 1};
• For every t ∈ T: (q
1
, t, φ, π, q
1
) ∈ δ, where φ =
s∈•t
s = 2 and π =
{reset(s)  E(s, t) > E(t, s)} ∪ {s++  E(t, s) > E(s, t)};
• F(q
0
) = false, and F(q
1
) =
M(s)=0
s = 1 ∧
M(s)=1
s = 2; and
• For every s ∈ S, τ(s) = 2.
(6) To show that it is in pspace, we guess a string w character by character and
only store the current conﬁguration. If at any time it is possible to follow more
than one transition at once, A is nondeterministic. Again, since npspace =
pspace and pspace is closed under complement, the result follows.
To show that the problem is pspacehard, we do a reduction fromone run
of 1safe Petri nets. This is the problem, given a 1safe Petri net N = (N, M
0
),
is there exactly one run on N. The latter problem is pspacecomplete [32].
We construct a CNFA A which is deterministic if and only if one run is
true for N. To do this, we set the alphabet of A to {a, t} and A will accept
strings of the form at
∗
, and simulates the working of N. The set of states
Q = {q
0
, q
c
, t
1
, . . . , t
n
}, when T = {t
1
, . . . , t
n
}. Furthermore, there is one
countervariable for every place: C = S.
The automaton now works as follows. From its initial state q
0
it goes to
its central state q
c
and sets the values of the countervariables to the initial
marking M
0
. From q
c
, there is a transition to every state t
i
. This transition
can be followed by reading a t if and only if the values of the countervariables
satisfy the necessary conditions to ﬁre the transition t
i
. Then, from the states
t
i
, there is a transition back to q
c
, which can be followed by reading a t and
is used to update the countervariables according to the ﬁring of t
i
. As in
the previous proof, the countervariable for place s will have value 1 when
M
′
(s) = 0 and 2 when M
′
(s) = 1, where M
′
is the current marking.
More formally, A = ({q
0
, q
c
} ∪ T, q
0
, S, δ, F, τ) is constructed as follows:
6.5. From RE(#) to CNFA 97
• (q
0
, a, true, π, q
c
) ∈ δ, where π = {s++  M
0
(s) = 1};
• For all t
i
∈ T, (q
c
, t, φ, ∅, t
i
) ∈ δ, where φ =
s∈•t
i
s = 2;
• For all t
i
∈ T, (t
i
, t, true, π, q
c
) ∈ δ, where π = {reset(s)  E(s, t
i
) >
E(t
i
, s)} ∪ {s++  E(t
i
, s) > E(s, t
i
)};
• For all s ∈ S, τ(s) = 2; and
• F(q) = true, for all q ∈ Q.
6.5 From RE(#) to CNFA
In this section, we show how an RE(#) expression r can be translated in
polynomial time into an equivalent CNFA G
r
by applying a natural extension
of the wellknown Glushkov construction [15]. We emphasize at this point that
such an extended Glushkov construction has already been given by Sperberg
McQueen [101]. Therefore, the contribution of this section lies mostly in the
characterization given below: G
r
is deterministic if and only if r is strongly
deterministic. Moreover, as seen in the previous section, CDFAs have desirable
properties which by this translation also apply to strongly deterministic RE(#)
expressions. We refer to G
r
as the Glushkov counting automaton of r.
We ﬁrst provide some notation and terminology needed in the construction
below. For a marked expression r, a marked symbol x and an iterator c of r we
denote by iterators
r
(x, c) the list of all iterators of c which contain x, except c
itself. For marked symbols x, y, we denote by iterators
r
(x, y) all iterators of r
which contain x but not y. Finally, let iterators
r
(x) be the list of all iterators
of r which contain x. Note that all such lists [c
1
, . . . , c
n
] contain a sequence
of nested subexpressions. Therefore, we will always assume that they are
ordered such that c
1
≺ c
2
≺ · · · ≺ c
n
. That is, they form a sequence of nested
subexpressions. Moreover, whenever r is clear from the context, we omit it
as a subscript. For example, if r = ((a
1,2
1
b
1
)
3,4
)
5,6
, then iterators(a
1
, r) =
[a
1,2
1
, (a
1,2
1
b
1
)
3,4
], iterators(a
1
, b
1
) = [a
1,2
1
], and iterators(a
1
) = [a
1,2
1
, (a
1,2
1
b
1
)
3,4
,
((a
1,2
1
b
1
)
3,4
)
5,6
].
We now deﬁne the set follow(r) for a marked regular expression r. As in
the standard Glushkov construction, this set lies at the basis of the transition
relation of G
r
. The set follow(r) contains triples (x, y, c), where x and y are
marked symbols and c is either an iterator or null. Intuitively, the states of
G
r
will be a designated start state plus a state for each symbol in Char(r).
A triple (x, y, c) then contains the information we need for G
r
to make a
98 Deterministic Regular Expressions with Counting
transition from state x to y. If c = null, this transition iterates over c and all
iterators in iterators(x, c) are reset by going to y. Otherwise, if c equals null,
the iterators in iterators(x, y) are reset. Formally, the set follow(r) contains
for each subexpression s of r,
• all tuples (x, y, null) for x in last(s
1
), y in ﬁrst(s
2
), and s = s
1
s
2
; and
• all tuples (x, y, s) for x in last(s
1
), y in ﬁrst(s
1
), and s = s
k,ℓ
1
.
We introduce a counter variable cv(c) for every iterator c in r whose value
will always denote which iteration of c we are doing in the current run on the
string. We deﬁne a number of tests and update commands on these counter
variables:
• valuetest([c
1
, . . . , c
n
]) :=
c
i
(lower(c
i
) ≤ cv(c
i
)) ∧ (cv(c
i
) ≤ upper(c
i
)).
When we leave the iterators c
1
, . . . , c
n
we have to check that we have
done an admissible number of iterations for each iterator.
• upperboundtest(c) := cv(c) < upper(c) when c is a bounded itera
tor and, otherwise, upperboundtest(c) := true. When iterating over a
bounded iterator, we have to check that we can still do an extra iteration.
• reset([c
1
, . . . , c
n
]) := {reset(cv(c
1
)), . . . , reset(cv(c
n
))}. When leaving
some iterators, their values must be reset. The counter variable is reset
to 1, because at the time we reenter this iterator, its ﬁrst iteration is
started.
• update(c) := {cv(c)++}. When iterating over an iterator, we start a
new iteration and increment its number of transitions.
We now deﬁne the Glushkov counting automaton G
r
= (Q, q
0
, C, δ, F, τ).
The set of states Q is the set of symbols in r plus an initial state, i.e., Q :=
{q
0
}⊎
¸
x∈Char(r)
q
x
. Let C = {cv(c)  c is an iterator of r}. We next deﬁne the
transition function. For all y ∈ ﬁrst(r), (q
0
, dm(y), true, ∅, q
y
) ∈ δ.
4
For every
element (x, y, c) ∈ follow(r), we deﬁne a transition (q
x
, dm(y), φ, π, q
y
) ∈ δ. If
c = null, then φ := valuetest(iterators(x, y)) and π := reset(iterators(x, y)).
If c = null, then φ := valuetest(iterators(x, c)) ∧ upperboundtest(c) and
π := reset(iterators(x, c)) ∪ update(c). The acceptance criteria of G
r
depend
on the set last(r). For any symbol x / ∈ last(r), F(q
x
) := false. For every
element x ∈ last(r), F(q
x
) := valuetest(iterators(x)). Here, we test whether
we have done an admissible number of iterations of all iterators in which x is
located. Finally, F(q
0
) := true if ε ∈ L(r). Lastly, for all bounded iterators
4
Recall that dm(y) denotes the demarking of y.
6.5. From RE(#) to CNFA 99
c, τ(cv(c)) = upper(c) since c never becomes larger than upper(c), and for all
unbounded iterators c, τ(cv(c)) = lower(c) as there are no upper bound tests
for cv(c).
Theorem 65. For every RE(#) expression r, L(G
r
) = L(r). Moreover, G
r
is
deterministic if and only if r is strongly deterministic.
This theorem will largely follow from Lemma 66, for which we ﬁrst intro
duce some notation. The goal will be to associate transition sequences of G
r
with correctly bracketed words in
¯
r, the bracketing of r. A transition sequence
σ = t
1
, . . . , t
n
of G
r
is simply a sequence of transitions of G
r
. It is accepting
if there exists a sequence of conﬁgurations γ
0
γ
1
, . . . , γ
n
such that γ
0
is the
inital conﬁguration, γ
n
is a ﬁnal conﬁguration, and, for every i = 1, . . . , n−1,
γ
i−1
→
a
i
γ
i
by using transition t
i
= (q
i−1
, a
i
, φ
i
, π
i
, q
i
). Here, for each i,
γ
i
= (q
i
, α
i
).
Recall that the bracketing
¯
r of an expression r is obtained by associating
indices to iterators in r and by replacing each iterator c := s
k,ℓ
of r with index
i with ([
i
s]
i
)
k,ℓ
. In the following, we use ind(c) to denote the index i associated
to iterator c. As every triple (x, y, c) in follow(r) corresponds to exactly one
transition t in G
r
, we also say that t is generated by (x, y, c).
We now want to translate transition sequences into correctly bracketed
words. Thereto, we ﬁrst associate a bracketed word word
G
r
(t) to each transi
tion t of G
r
. We distinguish a few cases:
(i) If t = (q
0
, a, φ, π, q
y
), with iterators(y) = [c
1
, . . . , c
n
], then word
G
r
(t) =
[
ind(c
n
)
· · · [
ind(c
1
)
y.
(ii) If t = (q
x
, a, φ, π, q
y
) and is generated by (x, y, c) ∈ follow(r), with c =
null. Let iterators(x, c) = [c
1
, . . . , c
n
] and iterators(y, c) = [d
1
, . . . , d
m
].
Then, word
G
r
(t) =]
ind(c
1
)
· · · ]
ind(c
n
)
[
ind(d
m
)
· · · [
ind(d
1
)
y.
(iii) If t = (q
x
, a, φ, π, q
y
) and is generated by (x, y, null) ∈ follow(r). Let
iterators(x, y) = [c
1
, . . . , c
n
] and let iterators(y, x) = [d
1
, . . . , d
m
]. Then,
word
G
r
(t) =]
ind(c
1
)
· · · ]
ind(c
n
)
[
ind(d
m
)
· · · [
ind(d
1
)
y.
Finally, to a marked symbol x with iterators(x) = [c
1
, . . . , c
n
], we associate
word
G
r
(x) =]
ind(c
1
)
· · · ]
ind(c
n
)
. Now, for a transition sequence σ = t
1
, . . . , t
n
,
where t
n
= (q
x
, a, φ, π, q
y
), we set brackseq
G
r
(σ) = word
G
r
(t
1
) · · · word
G
r
(t
n
)
word
G
r
(y). Notice that brackseq
G
r
(σ) is a bracketed word over a marked
alphabet. We sometimes also say that σ encodes brackseq
G
r
(σ). We usually
omit the subscript G
r
from word and brackseq when it is clear from the
context.
100 Deterministic Regular Expressions with Counting
For a bracketed word ¯ w, let strip(w) denote the word obtained from ¯ w
by removing all brackets. Then, for a transition sequence σ, deﬁne run(σ) =
strip(brackseq(σ)).
Lemma 66. Let r be a marked regular expression with counting and G
r
the
corresponding counting Glushkov automaton for r.
(1) For every string
¯
w, we have that
¯
w is a correctly bracketed word in L(
¯
r)
if and only if there exists an accepting transition sequence σ of G
r
such
that
¯
w = brackseq(σ).
(2) L(r) = {run(σ)  σ is an accepting transition sequence of G
r
}.
Proof. We ﬁrst prove (1) by induction on the structure of r. First, notice
that brackseq(σ) is a correctly bracketed word for every accepting transition
sequence σ on G
r
. Hence, we can restrict attention to correctly bracketed
words below.
For the induction below, we ﬁrst ﬁx a bracketing
¯
r of r and we assume
that all expressions
¯
s,
¯
s
1
,
¯
s
2
are correctly bracketed subexpressions of
¯
r.
• s = x. Then, also
¯
s = x, and L(
¯
s) = x. It is easily seen that the only
accepting transition sequence σ of G
s
consists of one transition t, with
brackseq(σ) = x.
• s = s
1
+ s
2
. Clearly, the set of all correctly bracketed words in L(
¯
s) is
the union of all correctly bracketed words in L(
¯
s
1
) and L(
¯
s
2
). Further,
observe that G
s
is constructed from G
s
1
and G
s
2
by identifying their
initial states and taking the disjoint union otherwise. As the initial states
only have outgoing, and no incoming, transitions, the set of accepting
transition sequences of G
s
is exactly the union of the accepting transition
sequences of G
s
1
and G
s
2
. Hence, the lemma follows from the induction
hypothesis.
• s = s
1
· s
2
. Let
¯
w be a correctly bracketed word. We distinguish a few
cases. First, assume
¯
w ∈ Char(
¯
s
1
)
∗
, i.e.,
¯
w contains only symbols of
¯
s
1
.
Then, if ε / ∈ L(s
2
),
¯
w / ∈ L(
¯
s), and, by construction of G
s
, there is no
accepting transition sequence σ on G
s
such that brackseq(σ) =
¯
w. This
is due to the fact that, with ε / ∈ L(s
2
), for all x ∈ Char(s
2
), we have
that x / ∈ last(s) and hence F(q
x
) = false. Hence, assume ε ∈ L(s
2
).
Then,
¯
w ∈ L(
¯
s) if and only if
¯
w ∈ L(
¯
s
1
) if and only if, by the induction
hypothesis, there is an accepting transition sequence σ on G
s
1
, with
brackseq(σ) =
¯
w. By construction of G
s
, and the fact that ε ∈ L(s
2
),
brackseq(σ) =
¯
w if and only if σ is also an accepting transition sequence
6.5. From RE(#) to CNFA 101
on G
s
. This settles the case
¯
w ∈ Char(
¯
s
1
)
∗
. The case
¯
w ∈ Char(
¯
s
2
)
∗
can be handled analogously.
Finally, consider the case that
¯
w contains symbols from both
¯
s
1
and
¯
s
2
. If
¯
w is not of the form
¯
w =
¯
w
1
¯
w
2
, with w
1
∈ Char(
¯
s
1
)
∗
and
w
2
∈ Char(
¯
s
2
)
∗
, then we immediately have that
¯
w / ∈ L(
¯
s) nor can there
be a transition sequence σ on G
s
encoding
¯
w. Hence, assume
¯
w is of
this form. Then,
¯
w ∈ L(
¯
s) if and only if
¯
w
i
∈ L(
¯
s
i
), for i = 1, 2 if
and only if, by the induction hypothesis, there exist accepting transition
sequences σ
1
(resp. σ
2
) on G
s
1
(resp. G
s
2
) encoding
¯
w
1
(resp.
¯
w
2
). Let
σ
1
= t
1
, . . . , t
n
and σ
2
= t
n+1
, . . . , t
m
, with q
x
the target state of t
n
,
and q
y
the target state of t
n+1
. Further, let t = (q
x
, a, φ, π, q
y
) be the
unique transition of G
s
generated by the tuple (x, y, null) ∈ follow(s).
We claim that σ = t
1
, . . . , t
n
, t, t
n+2
, . . . , t
m
is an accepting transition
sequence on G
s
with brackseq(σ) = brackseq(σ
1
)brackseq(σ
2
), and
hence brackseq(σ) =
¯
w. To see that σ is an accepting transition se
quence on G
s
note that the guard F(q
x
) (in G
s
1
) is equal to φ, the
guard of t. Hence, the fact that σ
1
is an accepting transition sequence
on G
s
1
ensures that t can be followed in G
s
. Further, note that after
following transition t, all counters are reset, as they were after following
t
n+1
in G
s
2
. This ensures that σ
1
and σ
2
can indeed be composed to
the accepting transition sequence σ in this manner. Further, to see that
brackseq
G
s
(σ) = brackseq
G
s
1
(σ
1
)brackseq
G
s
2
(σ
2
) it suﬃces to observe
that word
G
s
1
(q
x
)word
G
s
2
(t
n+1
) = word
G
s
(t), by deﬁnition. Hence, σ is
an accepting transition sequence on G
s
, with brackseq
G
s
(σ) =
¯
w, as
desired. Conversely, we need to show that any such transition sequence
σ can be decomposed in accepting transition sequences σ
1
and σ
2
satis
fying the same conditions. This can be done using the same reasoning
as above.
• s = s
[k,ℓ]
1
. Let
¯
s = ([
j
¯
s
1
]
j
)
k,ℓ
and
¯
w be a correctly bracketed word.
First, assume
¯
w ∈ L(
¯
s), we show that there is an accepting transition
sequence σ on G
s
encoding
¯
w. As
¯
w is correctly bracketed, we can write
¯
w = [
j
¯
w
1
]
j
[
j
¯
w
2
]
j
· · · [
j
¯
w
n
]
j
, with n ∈ [k, ℓ], and
¯
w
i
∈ L(
¯
s
1
),
¯
w
i
= ε,
for all i ∈ [1, n]. Then, by the induction hypothesis, there exist valid
transition sequences σ
1
, . . . , σ
n
on G
s
1
encoding w
i
, for each i ∈ [1, n].
For all i ∈ [1, n], let σ
i
= t
i
1
, . . . , t
i
m
i
, for some m
i
, the target state of
t
i
1
be q
y
i
and the target state of t
i
m
i
be q
x
i
. For all i ∈ [1, n − 1], let
t
i
be the unique transition generated by (x
i
, y
i+1
, s) ∈ follow(s), and
deﬁne σ = t
1
1
, . . . , t
1
m
1
, t
1
, t
2
2
, . . . , t
2
m
2
, t
2
, t
3
2
, . . . , t
n−1
, t
n
2
, . . . , t
n
m
n
. It now
suﬃces to show that σ is an accepting transition sequence on G
s
and
102 Deterministic Regular Expressions with Counting
brackseq(σ) =
¯
w. The reasons are analogous to the ones for the con
structed transition sequence σ in the previous case (s = s
1
· s
2
). To see
that σ is an accepting transition sequence, note that we simply execute
the diﬀerent transition sequences σ
1
to σ
n
one after another, separated
by iterations over the topmost iterator s, by means of the transitions t
1
to t
n
. As these transitions reset all counter variables (except cv(s)) the
counter variables on each of these separate runs always have the same val
ues as they had in the runs on G
s
1
. Therefore, it suﬃces to argue that the
transitions t
i
can safely be followed in the run σ, and that we ﬁnally ar
rive in an accepting conﬁguration. This is both due to the fact that we do
n such iterations, with n ∈ [k, ℓ], and each iteration increments cv(s) by
exactly 1. To see that brackseq
G
s
(σ) =
¯
w, note that, by induction,
¯
w =
[
j
brackseq
G
s
1
(σ
1
)]
j
[
j
brackseq
G
s
1
(σ
2
)]
j
· · · [
j
brackseq
G
s
1
(σ
n
)]
j
. There
fore, it suﬃces to observe that word
G
s
(t
1
1
) = [
j
word
G
s
1
(t
1
1
), word
G
s
(t
n
m
n
)
= word
G
s
1
(t
n
m
n
)]
j
, for all i ∈ [1, n − 1], word
G
s
(t
i
) = word
G
s
1
(x
i
)]
j
[
j
word
G
s
1
(t
i+1
1
), and word
G
s
(t) = word
G
s
1
(t), for all other transitions t
occurring in σ.
Conversely, assume that there exists an accepting transition sequence σ
on G
s
encoding
¯
w. We must show
¯
w ∈ L(
¯
s). This can be done using
arguments analogous to the ones above. It suﬃces to note that σ contains
n transitions t
1
to t
n
, for some n ∈ [k, ℓ], generated by a tuple of the
form (x, y, s) ∈ follow(s). This allows to decompose σ into n accepting
transition sequences σ
1
to σ
n
and apply the induction hypothesis to
obtain the desired result.
To prove the second point, i.e. L(s) = {run(σ)  σ is an accepting transition
sequence on G
s
}, it suﬃces to observe that
L(s) = {strip(
¯
w) 
¯
w ∈ L(
¯
s) and
¯
w correctly bracketed}
= {strip(brackseq(σ))  σ is an accepting transition sequence on G
s
}
= {run(σ)  σ is an accepting transition sequence on G
s
}
Here, the ﬁrst equality follows immediately from Lemma 67 below, the second
equality from the ﬁrst point of this lemma, and the third is immediate from
the deﬁnitions.
The following lemma, which we state without proof, is immediate from
the deﬁnitions. However, notice that it is important in this lemma that, for
subexpressions s
k,ℓ
with s nullable, we have k = 0, as required by the normal
form for RE(#).
6.5. From RE(#) to CNFA 103
Lemma 67. Let r ∈ RE(#), and ¯ r be the bracketing of r. Then,
L(r) = {strip( ¯ w)  ¯ w ∈ L(¯ r) and ¯ w is correctly bracketed}.
We are now ready to prove Theorem 65.
Proof. [of Theorem 65] Let r ∈ RE(#), r a marking of r, and G
r
its corre
sponding counting Glushkov automaton.
We ﬁrst show that L(r) = L(G
r
). Thereto, observe that (1) L(r) =
{dm(w)  w ∈ L(r)}, by deﬁnition of r, and (2) L(G
r
) = {dm(run(σ))  σ
is an accepting transition sequence on G
r
} by deﬁnition of G
r
and the run
predicate. As, by Lemma 66, L(r) = {run(σ)  σ is an accepting transition
sequence on G
r
}, we hence obtain L(r) = L(G
r
).
We next turn to the second statement: r is strongly deterministic if and
only if G
r
is deterministic. For the right to left direction, suppose r is not
strongly deterministic, we show that then G
r
is not deterministic. Here, r
can be not strongly deterministic for two reasons: either it is not weakly
deterministic, or it is but violates the second criterion in Deﬁnition 52.
First, suppose r is not weakly deterministic, and hence there exists words
u, v, w and symbols x, y, with x = y but dm(x) = dm(y) such that uxv ∈
L(r), and uyw ∈ L(r). Then, by Lemma 66, there exist accepting transition
sequences σ
1
= t
1
, . . . , t
n
and σ
2
= t
′
1
, . . . , t
′
m
such that run(σ
1
) = uxv and
run(σ
2
) = uyw. Let u = i. Then, t
i+1
= t
′
i+1
, as t
i+1
is a transition to q
x
,
and t
′
i+1
to q
y
, with q
x
= q
y
. Let j ∈ [1, i+1] be the smallest number such that
t
j
= t
′
j
, which must exist as t
i+1
= t
′
i+1
. Let u = dm(ux). Then, after reading
the preﬁx of u of length j −1, we have that G
r
can be in some conﬁguration
γ and can both follow transition t
j
and t
′
j
while reading the jth symbol of u.
Hence, G
r
is not deterministic.
Second, suppose r is weakly deterministic, but there exist words ¯ u, ¯ v, ¯ w
over Σ∪Γ, words α = β over Γ, and symbol a ∈ Σ such that ¯ uαa¯ v and ¯ uβa ¯ w
are correctly bracketed and in L(¯ r). Consequently, there exist words
¯
u,
¯
v,
and
¯
w and x such that
¯
uαx
¯
v and
¯
uβx
¯
w are correctly bracketed and in L(
¯
r).
Here, both words must use the same marked symbol x for a as r is weakly
deterministic. Then, by Lemma 66, there exist transition sequences σ
1
=
t
1
, . . . , t
n
, σ
2
= t
1
, . . . , t
′
n
such that brackseq(σ
1
) =
¯
uαx
¯
v and brackseq(σ
2
) =
¯
uβx
¯
w. Without loss of generality we can assume that α and β are chosen to
be maximal (i.e.
¯
u is either empty or ends with a marked symbol). But then,
let i = strip(
¯
u) and observe that word(t
i+1
) = αx and word(t
′
i+1
) = βx, and
thus as word(t
i+1
) = word(t
′
i+1
) also t
i+1
= t
′
i+1
. By reasoning exactly as in
the previous case, it then follows that G
r
is not deterministic.
Before proving the converse direction, we ﬁrst make two observations.
First, for any two transitions t and t
′
, with t = t
′
who share the same source
104 Deterministic Regular Expressions with Counting
state q
x
, we have that word(t) = word(t
′
). Indeed, observe that if the tar
get state of t is q
y
, and the target of t
′
is q
y
′ , that then word(t) = αy and
word(t
′
) = βy
′
, with α, β words over Γ. Hence, if y = y
′
, we immediately have
word(t) = word(t
′
). Otherwise, when y = y
′
, we know that t is computed
based on either (1) (x, y, null) ∈ follow(r) and t
′
on (x, y, c) ∈ follow(r) or (2)
(x, y, c) ∈ follow(r) and t
′
on (x, y, c
′
) ∈ follow(r), with c = c
′
. In both cases
it is easily seen that α = β. The second observation we make is that G
r
is
reduced, i.e., for every reachable conﬁguration γ, there is some string which
brings γ to an accepting conﬁguration. The latter is due to the upper bound
tests present in G
r
.
Now, assume that G
r
is not deterministic. We show that r is not strongly
deterministic. As G
r
is reduced and not deterministic there exist words u, v,
w and symbol a such that uav, uaw ∈ L(G
r
) witnessed by accepting transition
sequences σ
1
= t
1
, . . . , t
n
and σ
2
= t
′
1
, . . . , t
′
m
(for uav and uaw, respectively),
such that t
i
= t
′
i
, for all i ∈ [1, u], but t
u+1
= t
′
u+1
. Let q
x
be the target of
transition t
u+1
and q
y
the target of transition t
′
u+1
, and x and y their asso
ciated marked symbols. Note that dm(x) = dm(y) = a. We now distinguish
two cases.
First, assume x = y. By Lemma 66, both run(σ
1
) ∈ L(r) and run(σ
2
) ∈
L(r). Writing run(σ
1
) = uxv and run(σ
2
) = uyw (such that u = u, v = v
and w = w) we then obtain the strings u, v, w and symbols x, y suﬃcient
to show that r is not weakly deterministic, and hence also not strongly deter
ministic.
Second, assume x = y. We then consider brackseq(σ
1
) and brackseq(σ
2
)
which, by Lemma 66, are both correctly bracketed words in L(
¯
r). Further,
note that t
u+1
and t
′
u+1
share the same source state, and hence, by the ﬁrst
observation above, word(t
u+1
) = αx = word(t
′
u+1
) = βy. In particular, as
x = y, it holds that α = β. But then, writing brackseq(σ
1
) =
¯
uαx
¯
v, and
brackseq(σ
2
) =
¯
uαy
¯
w, we can see that dm(
¯
u), dm(
¯
v), dm(
¯
w), α, β, and a,
violate the condition in Deﬁnition 52 and hence show that r is not strongly
deterministic.
6.6 Testing Strong Determinism
Deﬁnition 52, deﬁning strong determinism, is of a semantic nature. Therefore,
we provide Algorithm 1 for testing whether a given expression is strongly deter
ministic, which runs in cubic time. To decide weak determinism, Kilpel¨ainen
and Tuhkanen [66] give a cubic algorithm for RE(#), while Br¨ uggemann
Klein [15] gives a quadratic algorithm for RE by computing its Glushkov
6.6. Testing Strong Determinism 105
Algorithm 1 isStrongDeterministic. Returns true if r is strong deter
ministic, false otherwise.
r ← marked version of r
2: Initialize Follow ← ∅
Compute ﬁrst(s), last(s), for all subexpressions s of r
4: if ∃x, y ∈ ﬁrst(r) with x = y and dm(x) = dm(y) then return false
for each subexpression s of r, in bottomup fashion do
6: if s = s
1
s
2
then
if last(s
1
) = ∅ and ∃x, y ∈ ﬁrst(s
1
) with x = y and dm(x) = dm(y)
then return false
8: F ← {(x, dm(y))  x ∈ last(s
1
), y ∈ ﬁrst(s
2
)}
else if s = s
[k,ℓ]
1
, with ℓ ≥ 2 then
10: if ∃x, y ∈ ﬁrst(s
1
) with x = y and dm(x) = dm(y) then return
false
F ← {(x, dm(y))  x ∈ last(s
1
), y ∈ ﬁrst(s
1
)}
12: if F ∩ Follow = ∅ then return false
if s = s
1
s
2
or s = s
k,ℓ
1
, with ℓ ≥ 2 and k < ℓ then
14: Follow ← Follow ⊎ F
return true
automaton and testing whether it is deterministic
5
.
We next show that Algorithm 1 is correct. Recall that we write s r
when s is a subexpression of r and s ≺ r when s r and s = r.
Theorem 68. For any r ∈ RE(#), isStrongDeterministic (r) returns true
if and only if r is strongly deterministic. Moreover, it runs in time O(r
3
).
Proof. Let r ∈ RE(#), r a marking of r, and G
r
the corresponding Glushkov
counting automaton. By Theorem 65 it suﬃces to show that the algorithm
isStrongDeterministic (r) returns true if and only if G
r
is deterministic.
Thereto, we ﬁrst extract from Algorithm 1 the reasons for isStrongDeter
ministic (r) to return false. This is the case if and only if there exist marked
symbols x, y, y
′
, with dm(y) = dm(y
′
), such that either
1. y, y
′
∈ ﬁrst(s) and y = y
′
(Line 4)
2. (x, y, c) ∈ follow(s), (x, y
′
, c) ∈ follow(s), y = y
′
and upper(c) ≥ 2
(Line 10)
5
There sometimes is some confusion about this result: Computing the Glushkov automa
ton is quadratic in the expression, while linear in the output automaton (consider, e.g.,
(a
1
+· · · +a
n
)(a
1
+· · · +a
n
)). Only when the alphabet is ﬁxed, is the Glushkov automaton
of a deterministic expression of size linear in the expression.
106 Deterministic Regular Expressions with Counting
3. (x, y, c) ∈ follow(s), (x, y
′
, c
′
) ∈ follow(s), c ≺ c
′
, and upper(c) ≥ 2,
upper(c
′
) ≥ 2, and upper(c) > lower(c) (Line 12)
4. (x, y, null) ∈ follow(s), (x, y
′
, c
′
) ∈ follow(s), lca(x, y) ≺ c
′
and upper(c
′
) ≥
2 (Line 12)
5. (x, y, c) ∈ follow(s), (x, y
′
, null) ∈ follow(s), c ≺ lca(x, y
′
), upper(c) ≥ 2,
and upper(c) > lower(c) (Line 12)
6. (x, y, null) ∈ follow(s), (x, y
′
, null) ∈ follow(s), y = y
′
and lca(x, y) =
lca(x, y
′
) (Line 7)
7. (x, y, null) ∈ follow(s), (x, y
′
, null) ∈ follow(s), and lca(x, y) ≺ lca(x, y
′
)
(Line 12)
We now show that G
s
is not deterministic if and only if one of the above
seven conditions holds. We ﬁrst verify the right to left direction by investigat
ing the diﬀerent cases.
Suppose case 1 holds, i.e., y, y
′
∈ ﬁrst(s) and y = y
′
. Then, there is a
transition from q
0
to q
y
and one from q
0
to q
y
′ , with q
y
= q
y
′ . These transitions
can both be followed when the ﬁrst symbol in a string is dm(y). Hence, G
s
is
not deterministic.
In each of the six remaining there are always two distinct tuples in follow(s)
which generate distinct transitions with as source state q
x
and target states
q
y
and q
y
′ , and dm(y) = dm(y
′
). Therefore, it suﬃces, in each of the cases,
to construct a reachable conﬁguration γ = (q
x
, α) from which both transitions
can be followed by reading dm(y). The reachability of this conﬁguration γ will
always follow from Lemma 69, but its precise form depends on the particular
case we are in. In each of the following cases, let iterators(x) = [c
1
, . . . , c
n
]
and, when applicable, set c = c
i
and c
′
= c
j
. When both c and c
′
occur, we
always have c ≺ c
′
, and hence i < j.
Case 2: Set γ = (q
x
, α) with α(cv(c
m
)) = lower(c
m
), for all m ∈ [1, i −1],
and α(cv(c
m
)) = 1, for all m ∈ [i, m]. We need to show that the transitions
generated by (x, y, c) and (x, y
′
, c) can both be followed from γ. This is due to
the fact that α = valuetest
[c
1
,...,c
i−1
]
and α = upperboundtest
[c
i
]
. The latter
because upper(c
i
) ≥ 2 > α(cv(c
i
)).
Case 3: Set γ = (q
x
, α) with α(cv(c
m
)) = lower(c
m
), for all m ∈ [1, j −1],
and α(cv(c
m
)) = 1, for all m ∈ [j, n]. To see that the transition gener
ated by (x, y, c) can be followed, note that again α = valuetest
[c
1
,...,c
i−1
]
and α = upperboundtest
[c
i
]
. The latter due to the fact that upper(c) >
lower(c) = α(cv(c
i
)). On the other hand, also α = valuetest
[c
1
,...,c
j−1
]
and
α = upperboundtest
[c
j
]
as upper(c
′
) ≥ 2 > α(cv(c
j
)). Hence, G
s
can also
follow the transition generated by (x, y
′
, c
′
).
6.6. Testing Strong Determinism 107
Case 4: Set γ = (q
x
, α) with α(cv(c
m
)) = lower(c
m
), for all m ∈ [1, j −1],
and α(cv(c
m
)) = 1, for all m ∈ [j, n]. Arguing as above, and using the facts
that (1) upper(c
′
) ≥ 2 and (2) iterators(x, y) = [c
1
, . . . , c
i
′ ], for some i
′
< j (as
lca(x, y) ≺ c
′
); we can deduce that G
s
can again follow both transitions.
Cases 5, 6, and 7: In each of these cases we can set γ = (q
x
, α) with
α(cv(c
m
)) = lower(c
m
), for all m ∈ [1, n]. Arguing as before it can be seen
that in each case both transitions can be followed from this conﬁguration.
We next turn to the left to right direction. Assume G
s
is not deterministic
and let γ be a reachable conﬁguration such that two distinct transitions t
1
, t
2
can be followed by reading a symbol a. We argue that in this case the condi
tions for one of the seven cases above must be fulﬁlled. First, if γ = (q
0
, α)
then t
1
and t
2
must be the initial transitions of a run, as there are no transi
tions returning to q
0
. As, furthermore, q
0
has exactly one transition to each
state q
x
, when x ∈ ﬁrst(s), we can see that the conditions in case 1 hold.
Therefore, assume γ = (q
x
, α), with x ∈ Char(r). We will investigate the
tuples in follow(s) which generated t
1
and t
2
and show that they fall into
one of the six remaining cases mentioned above, hence forcing isStrongDe
terministic (s) to return false. There are indeed six possibilities, ignoring
symmetries, which we immediately classify to the case they will belong to:
2. t
1
is generated by (x, y, c) and t
2
by (x, y
′
, c),
3. t
1
is generated by (x, y, c) and t
2
by (x, y
′
, c
′
) with c ≺ c
′
.
4. t
1
is generated by (x, y, null) and t
2
by (x, y
′
, c
′
) with lca(x, y) ≺ c
′
.
5. t
1
is generated by (x, y, c) and t
2
by (x, y
′
, null) and c ≺ lca(x, y).
6. t
1
is generated by (x, y, null) and t
2
by (x, y
′
, null) with lca(x, y) =
lca(x, y
′
).
7. t
1
is generated by (x, y, null) and t
2
by (x, y
′
, null) with lca(x, y) ≺
lca(x, y
′
).
Note that all subexpressions under consideration (i.e. lca(x, y), lca(x, y
′
), c,
and c
′
) contain x, and that lca(x, y) and lca(x, y
′
) are always subexpressions
whose topmost operator is a concatenation. These are the reasons why all
these expressions are in a subexpression relation, and why we never have, for
instance, lca(x, y) = c.
However, these six cases only give us the diﬀerent possibilities of how the
transitions t
1
and t
2
can be generated by tuples in follow(s). We still need to
argue that the additional conditions imposed by the cases mentioned above
apply. Thereto, we ﬁrst note that for all iterators c and c
′
under consideration,
108 Deterministic Regular Expressions with Counting
upper(c) ≥ 2 and upper(c
′
) ≥ 2 must surely hold. Indeed, suppose for instance
upper(c) = 1. Then, for any transition t generated by (x, y, c) the guard φ
contains the condition upperboundtest
c
:= cv(c) < 1, which can never be
true. Hence, such a transition t can never be followed and is not relevant.
This already shows that possibilities 4 and 7 above, indeed imply cases 4 and
7, respectively. For the additional cases, we argue on a case by case basis.
Cases 2 and 6: We additionally need to show y = y
′
, which is immediate
from the fact that t
1
= t
2
. Indeed, assuming y = y
′
, implies that t
1
and t
2
are
generated by the same tuple in follow(s) and are hence equal.
Cases 3 and 5: We need to show upper(c) > lower(c). In both cases, the
guard of transition t
1
contains the condition cv(c) < upper(c) as upperbound
test, whereas the guard of transition t
2
contains the condition cv(c) ≥ lower(c).
These can only simultaneously be true when upper(c) > lower(c).
This settles the correctness of the algorithm. We conclude by arguing that
the algorithm runs in time O(r
3
). Computing the ﬁrst and last sets for each
subexpression s of r can easily be done in time O(r
3
) as can the test on Line 4.
Further, the for loop iterates over a linear number of nodes in the parse tree of
r. To do each iteration of the loop in quadratic time, one needs to implement
the set Follow as a twodimensional boolean table. In each iteration we then
need to do at most a quadratic number of (constant time) lookups and writes
to the table. Altogether this yields a quadratic algorithm.
It only remains to prove the following lemma:
Lemma 69. Let x ∈ Char(r), and iterators(x) = [c
1
, . . . , c
n
]. Let γ = (q
x
, α),
be a conﬁguration such that α(cv(c
i
)) ∈ [1, upper(c
i
)], for i ∈ [1, n], and
α(cv(c)) = 1, for all other countervariables c. Then, γ is reachable in G
r
.
Proof. This lemma can easily be proved by induction on the structure of r.
However, as this is a bit tedious, we only provide some intuition by construct
ing, given such a conﬁguration γ, a string w which brings G
s
from its initial
conﬁguration to γ.
We construct w by concatenating several substrings. Thereto, for every
i ∈ [1, n], let v
i
be a nonempty string in L(base(c
i
)). Concatenating such a
string v
i
with itself allows to iterate c
i
. We further deﬁne, for every i ∈ [1, n], a
marked string w
i
which, intuitively, connects the diﬀerent iterators. Thereto,
let w
n
be a minimal (w.r.t. length) string such that w
n
∈ (Char(s)\Char(c
n
))
∗
and such that there exist u ∈ Char(c
n
)
∗
and v ∈ Char(s)
∗
such that w
n
uxv ∈
L(s). Similarly, for any i ∈ [1, n−1], let w
i
be a minimal (w.r.t. length) string
such that w
i
∈ (Char(c
i+1
) \ Char(c
i
))
∗
and there exist u ∈ Char(c
i
)
∗
and
v ∈ Char(c
i+1
)
∗
such that w
n
uxv ∈ L(c
i+1
). Finally, let w
0
be a string such
that there exists a u such that w
0
xu ∈ L(base(c
1
)). We require these strings
6.7. Decision Problems for RE(#) 109
w
1
to w
n
to be minimal to be sure that they do not allow to iterate over their
corresponding counter.
Then, the desired string w is
dm(w
n
)(v
n
)
α(cv(c
n
))
dm(w
n−1
)(v
n−1
)
α(cv(c
n−1
))
· · · dm(w
0
)dm(x) .
6.7 Decision Problems for RE(#)
We recall the following complexity results:
Theorem 70. 1. inclusion and equivalence for RE(#) are expspace
complete [83], intersection for RE(#) is pspacecomplete (Chapter 5)
2. inclusion and equivalence for DRE
W
are in ptime, intersection
for DRE
W
is pspacecomplete [75].
3. inclusion for DRE
#
W
is conphard [64].
By combining (1) and (2) of Theorem 70 we get the complexity of inter
section for DRE
#
W
and DRE
#
S
. This is not the case for the inclusion and
equivalence problem, unfortunately. By using the results of the previous
sections we can, for DRE
#
S
, give a pspace upperbound for both problems,
however.
Theorem 71. (1) equivalence and inclusion for DRE
#
S
are in pspace.
(2) intersection for DRE
#
W
and DRE
#
S
is pspacecomplete.
Proof. (1) We show that inclusion for DRE
#
S
is in pspace. Given two
strongly deterministic RE(#) expressions r
1
, r
2
, we construct two CDFAs
A
1
, A
2
for r
1
and r
2
using the construction of section 6.5, which by Theo
rem 65 are indeed deterministic. Then, we construct the CDFAs A
′
2
, A such
that A
′
2
accepts the complement of A
2
, and A is the intersection of A
1
and A
′
2
.
This can all be done in polynomial time and preserves determinism according
to Theorem 64. Finally, L(r
1
) ⊆ L(r
2
) if and only if L(A) = ∅, which can be
decided using only polynomial space by Theorem 64(1).
(2) The result for DRE
#
W
is immediate from Theorem 70(1) and (2). This
result also carries over to DRE
#
S
. For the upper bound this is trivial, whereas
the lower bound follows from the fact that standard weakly deterministic reg
ular expressions can be transformed in linear time into equivalent strongly
deterministic expressions [15].
110 Deterministic Regular Expressions with Counting
6.8 Discussion
We investigated and compared the notions of strong and weak determinism
for regular expressions in the presence of counting. Weakly deterministic ex
pressions have the advantage of being more expressive and more succinct than
strongly deterministic ones. However, strongly deterministic expressions are
expressivily equivalent to standard deterministic expressions, a class of lan
guages much better understood than the weakly deterministic languages with
counting. Moreover, strongly deterministic expressions are conceptually sim
pler (as strong determinism does not depend on intricate interplays of the
counter values) and correspond naturally to deterministic Glushkov automata.
The latter also makes strongly deterministic expressions easier to handle as
witnessed by the pspace upperbound for inclusion and equivalence, whereas
for weakly deterministic expressions only a trivial expspace upperbound is
known. For these reasons, one might wonder if the weak determinism de
manded in the current standards for XML Schema should not be replaced by
strong determinism. The answer to some of the following open questions can
shed more light on this issue:
• Is it decidable if a language is deﬁnable by a weakly deterministic expres
sion with counting? For standard (weakly) deterministic expressions, a
decision algorithm is given by BruggemannKlein and Wood [17] which,
by our results, also carries over to strongly deterministic expressions with
counting.
• Can the Glushkov construction given in Section 6.5 be extended such
that it translates any weakly deterministic expression with counting into
a CDFA? Kilpelainen and Tuhkanen [65] have given a Glushkovlike con
struction translating weakly deterministic expressions into a determinis
tic automata model. However, this automaton model is mostly designed
for doing membership testing and complexity bounds can not directly
be derived from it, due to the lack of upperbounds on the counters.
• What are the exact complexity bounds for inclusion and equivalence of
strongly and weakly deterministic expressions with counting? For strong
determinism, this can still be anything between ptime and pspace,
whereas for weak determinism, conp and pspace are the most likely,
although expspace also remains possible.
Part II
Applications to XML Schema
Languages
111
7
Succinctness of PatternBased
Schema Languages
In formal language theoretic terms, an XML schema deﬁnes a tree language.
The for historical reasons still widespread Document Type Deﬁnitions (DTDs)
can then be seen as contextfree grammars with regular expressions at right
hand sides which deﬁne the local tree languages [16]. XML Schema [100]
extends the expressiveness of DTDs by a typing mechanism allowing content
models to depend on the type rather than only on the label of the parent.
Unrestricted application of such typing leads to the robust class of unranked
regular tree languages [16] as embodied in the XML schema language Relax
NG [22]. The latter language is commonly abstracted in the literature by
extended DTDs (EDTDs) [88]. The Element Declarations Consistent con
straint in the XML Schema speciﬁcation, however, restricts this typing: it
forbids the occurrence of diﬀerent types of the same element in the same
content model. Murata et al. [85] therefore abstracted XSDs by singletype
EDTDs. Martens et al. [77] subsequently characterized the expressiveness of
singletype EDTDs in several syntactic and semantic ways. Among them,
they deﬁned an extension of DTDs equivalent in expressiveness to singletype
EDTDs: ancestorguarded DTDs. An advantage of this language is that it
makes the expressiveness of XSDs more apparent: the content model of an
element can only depend on regular string properties of the string formed by
the ancestors of that element. Ancestorbased DTDs can therefore be used
as a typefree frontend for XML Schema. As they can be interpreted both
114 Succinctness of PatternBased Schema Languages
other semantics EDTD EDTD
st
DTD
P
∃
(Reg) 2exp (83(1)) exp (83(2)) exp (83(3)) exp* (83(5))
P
∀
(Reg) 2exp (83(6)) 2exp (83(7)) 2exp (83(8)) 2exp (83(10))
P
∃
(Lin) \ (85(1)) exp (85(2)) exp (85(3)) exp* (85(5))
P
∀
(Lin) \ (85(6)) 2exp (85(7)) 2exp (85(8)) 2exp (85(10))
P
∃
(SLin) poly (89(1)) poly (89(2)) poly (89(3)) poly (89(6))
P
∀
(SLin) poly (89(7)) poly (89(8)) poly (89(9)) poly (89(12))
P
∃
(DetSLin) poly (89(1)) poly (89(2)) poly (89(3)) poly (89(6))
P
∀
(DetSLin) poly (89(7)) poly (89(8)) poly (89(9)) poly (89(12))
Table 7.1: Overview of complexity results for translating patternbased
schemas into other schema formalisms. For all nonpolynomial complexities,
except the ones marked with a star, there exist examples matching this upper
bound. Theorem numbers are given between brackets.
in an existential and universal way, we study in this chapter the complexity
of translating between the two semantics and into the formalisms of DTDs,
EDTDs, and singletype EDTDs.
In the remainder of the chapter, we use the name patternbased schema,
rather than ancestorbased DTD, as it emphasizes the dependence on a par
ticular pattern language. A patternbased schema is a set of rules of the form
(r, s), where r and s are regular expressions. An XML tree is then existentially
valid with respect to a rule set if for each node there is a rule such that the
path from the root to that node matches r and the child sequence matches s.
Furthermore, it is universally valid if each node vertically matching r, horizon
tally matches s. The existential semantics is exhaustive, fully specifying every
allowed combination, and more DTDlike, whereas the universal semantics is
more liberal, enforcing constraints only where necessary.
Kasneci and Schwentick [62] studied the complexity of the satisﬁability and
inclusion problem for patternbased schemas under the existential (∃) and uni
versal (∀) semantics. They considered regular (Reg), linear (Lin), and strongly
linear (SLin) patterns. These correspond to the regular expressions, XPath
expressions with only child (/) and descendant (//), and XPath expressions
of the form //w or /w, respectively. Deterministic strongly linear (DetSLin)
patterns are strongly linear patterns in which additionally all horizontal ex
pressions s are required to be deterministic [17]. A snapshot of their results is
given in the third and fourth column of Table 7.2. These results indicate that
there is no diﬀerence between the existential and universal semantics.
We, however, show that with respect to succinctness there is a huge diﬀer
ence. Our results are summarized in Table 7.1. Both for the pattern languages
7.1. Preliminaries and Basic Results 115
simplification satisfiability inclusion
P
∃
(Reg) exptime (83(4)) exptime [62] exptime [62]
P
∀
(Reg) exptime (83(9)) exptime [62] exptime [62]
P
∃
(Lin) pspace (85(4)) pspace [62] pspace [62]
P
∀
(Lin) pspace (85(9)) pspace [62] pspace [62]
P
∃
(SLin) pspace (89(4)) pspace [62] pspace [62]
P
∀
(SLin) pspace (89(10)) pspace [62] pspace [62]
P
∃
(DetSLin) in ptime (89(5)) in ptime [62] in ptime [62]
P
∀
(DetSLin) in ptime (89(11)) in ptime [62] in ptime [62]
Table 7.2: Overview of complexity results for patternbased schemas. All re
sults, unless indicated otherwise, are completeness results. Theorem numbers
for the new results are given between brackets.
Reg and Lin, the universal semantics is exponentially more succinct than the
existential one when translating into (singletype) extended DTDs and ordi
nary DTDs. Furthermore, our results show that the general class of pattern
based schemas is illsuited to serve as a frontend for XML Schema due to
the inherent exponential or double exponential size increase after translation.
Only when resorting to SLin patterns, there are translations only requiring
polynomial size increase. Fortunately, the practical study in [77] shows that
the sort of typing used in XSDs occurring in practice can be described by such
patterns. Our results further show that the expressive power of the existential
and the universal semantics coincide for Reg and SLin, albeit a translation
can not avoid a double exponential size increase in general in the former case.
For linear patterns the expressiveness is incomparable. Finally, as listed in
Table 7.2, we study the complexity of the simpliﬁcation problem: given a
patternbased schema, is it equivalent to a DTD?
Outline. In Section 7.1, we introduce some additional deﬁnitions concerning
schema languages, and patternbased schemas. In Section 7.2, 7.3, and
7.4, we study patternbased schemas with regular, linear, and strongly linear
expressions, respectively.
7.1 Preliminaries and Basic Results
In this section, we introduce patternbased schemas, an alternative charac
terization of singletype EDTDs, recall some known theorems, and introduce
some additional terminology for describing succinctness results.
116 Succinctness of PatternBased Schema Languages
7.1.1 PatternBased Schema Languages
We recycle the following deﬁnitions from [62].
Deﬁnition 72. A patternbased schema P is a set {(r
1
, s
1
), . . . , (r
m
, s
m
)}
where all r
i
, s
i
are regular expressions.
Each pair (r
i
, s
i
) of a patternbased schema represents a schema rule. We
also refer to the r
i
and s
i
as the vertical and horizontal regular expressions,
respectively. There are two semantics for patternbased schemas.
Deﬁnition 73. A tree t is existentially valid with respect to a patternbased
schema P if, for every node v of t, there is a rule (r, s) ∈ P such that
ancstr(v) ∈ L(r) and chstr(v) ∈ L(s). In this case, we write P =
∃
t.
Deﬁnition 74. A tree t is universally valid with respect to a patternbased
schema P if, for every node v of t, and each rule (r, s) ∈ P it holds that
ancstr(v) ∈ L(r) implies chstr(v) ∈ L(s). In this case, we write P =
∀
t.
Denote by P
∃
(t) = {v ∈ Dom(t)  ∃(r, s) ∈ P, ancstr(v) ∈ L(r)∧chstr(v) ∈
L(s)} the set of nodes in t that are existentially valid. Denote by P
∀
(t) = {v ∈
Dom(t)  ∀(r, s) ∈ P, ancstr(v) ∈ L(r) ⇒ chstr(v) ∈ L(s)} the set of nodes in
t that are universally valid.
We denote the set of Σtrees which are existentially and universally valid
with respect to P by T
Σ
∃
(P) and T
Σ
∀
(P), respectively. We often omit Σ if it
is clear from the context what the alphabet is.
When for every string w ∈ Σ
∗
there is a rule (r, s) ∈ P such that w ∈ L(r),
then we say that P is complete. Further, when for every pair (r, s), (r
′
, s
′
) ∈ P
of diﬀerent rules, L(r) ∩ L(r
′
) = ∅, then we say that P is disjoint.
In some proofs, we make use of unary trees, which can be represented as
strings. In this context, we abuse notation and write for instance w ∈ T
∃
(P)
meaning that the unary tree which w represents is existentially valid with
respect to P. Similarly, we refer to the last position of w as the leaf of w.
We conclude this section with two basic results on patternbased schemas.
Lemma 75. For a patternbased schema P, a tree t and a string w,
1. t ∈ T
∀
(P) if and only if for every node v of t, v ∈ P
∀
(t).
2. if w ∈ T
∀
(P) then for every preﬁx w
′
of w and every nonleaf node v of
w
′
, v ∈ P
∀
(w
′
).
3. t ∈ T
∃
(P) if and only if for every node v of t, v ∈ P
∃
(t).
4. if w ∈ T
∃
(P) then for every preﬁx w
′
of w and every nonleaf node v of
w
′
, v ∈ P
∃
(w
′
).
7.1. Preliminaries and Basic Results 117
Proof. (1,3) These are in fact just a restatement of the deﬁnition of universal
and existential satisfaction and are therefore trivially true.
(2) Consider any nonleaf node v
′
of w
′
. Since w
′
is a preﬁx of w, there
must be a node v of w such that ancstr
w
(v) = ancstr
w
′
(v
′
) and chstr
w
(v) =
chstr
w
′
(v
′
). By Lemma 75(1), v ∈ P
∀
(w) and thus v
′
∈ P
∀
(w).
(4) The proof of (2) carries over literally for the existential semantics.
Lemma 76. For any complete and disjoint patternbased schema P, T
∃
(P) =
T
∀
(P).
Proof. We show that if P is complete and disjoint, then for any node v of
any tree t, v ∈ P
∃
(t) if and only if v ∈ P
∀
(t). The lemma then follows from
Lemma 75(1) and (3). First, suppose v ∈ P
∃
(t). Then, there is a rule (r, s) ∈ P
such that ancstr(v) ∈ L(r) and chstr(v) ∈ L(s), and by the disjointness of P,
ancstr(v) / ∈ L(r
′
) for any other vertical expression r
′
in P. It thus follows that
v ∈ P
∀
(t). Conversely, suppose v ∈ P
∀
(t). By the completeness of P there is
at least one rule (r, s) such that ancstr(v) ∈ L(r) and thus chstr(v) ∈ L(s).
It follows that v ∈ P
∃
(t).
7.1.2 Characterizations of (Extended) DTDs
We start by introducing another schema formalism equivalent to singletype
EDTDs. An automatonbased schema D over vocabulary Σ is a tuple (A, λ),
where A = (Q, q
0
, δ, F) is a DFA and λ is a function mapping states of A to
regular expressions. A tree t is accepted by D if for every node v of t, where
q ∈ Q is the state such that q
0
⇒
A,ancstr(v)
q, chstr(v) ∈ L(λ(q)). Because
the set of ﬁnal states F of A is not used, we often omit F and represent A as a
triple (Q, q
0
, δ). Automatonbased schemas are implicit in [77] and introduced
explicitly in [76].
Remark 77. Because DTDs and EDTDs only deﬁne tree languages in which
every tree has the same root element, we implicitly assume that this is also
the case for automatonbased schemas and the patternbased schemas deﬁned
next. Whenever we translate among patternbased schemas, we drop this as
sumption. Obviously, this does not inﬂuence any of the results of this chapter.
Lemma 78. Any automatonbased schema D can be translated into an equiv
alent singletype EDTD D
′
in time at most quadratic in the size of D, and
vice versa.
Proof. Let D = (A, λ), with A = (Q, q
0
, δ), be an automatonbased schema.
We start by making A complete. That is, we add a sink state q to Q and
for every pair q ∈ Q, a ∈ Σ, for which there is no transition (q, a, q
′
) ∈ δ,
118 Succinctness of PatternBased Schema Languages
t
1
v
1
∈ T
t
2
v
2
∈ T
⇒
t
1
v
2
∈ T
Figure 7.1: Closure under labelguarded subtree exchange
we add (q, a, q ) to δ. Further, λ(q ) = ∅. Construct D
′
= (Σ, Σ
′
, d, s
i
, µ)
as follows. Let s
i
be such that s is the root symbol of any tree deﬁned by
D and (q
0
, s, q
i
) ∈ δ. Let Q ∪ {q } = {q
0
, . . . , q
n
} for some n ∈ N, then
Σ
′
= {a
i
 a ∈ Σ ∧ q
i
∈ Q} and µ(a
i
) = a. Finally, d(a
i
) = λ(q
i
), where any
symbol a ∈ Σ is replaced by a
j
when (q
i
, a, q
j
) ∈ δ. Since A is complete, a
j
is guaranteed to exist and since A is a DFA a
j
is uniquely deﬁned. For the
time complexity of the algorithm, we see that the number of types in D
′
can
never be exceeded by the number of transitions in A. Then, to every type one
regular expression from D
′
is assigned which yields a quadratic algorithm.
Conversely, let D = (Σ, Σ
′
, d, s, µ) be a singletype EDTD. The equivalent
automatonbased schema D = (A, λ) with A = (Q, q
0
, δ) is constructed as
follows. Let Q = Σ
′
, q
0
= s, and for a
i
, b
j
∈ Σ
′
, (a
i
, b, b
j
) ∈ δ if µ(b
j
) = b and
b
j
occurs in d(a
i
). Note that since D is a singletype EDTD, A is guaranteed
to be deterministic. Finally, for any type a
i
∈ Σ
′
, λ(a
i
) = µ(d(a
i
)).
A regular tree language T is closed under labelguarded subtree exchange if
it has the following property: if two trees t
1
and t
2
are in T , and there are two
nodes v
1
in t
1
and v
2
in t
2
with the same label, then t
1
[v
1
← subtree
t
2
(v
2
)] is
also in T . This notion is graphically illustrated in Figure 7.1.
Lemma 79 ([88]). A regular tree language is deﬁnable by a DTD if and only
if it is closed under labelguarded subtree exchange.
An EDTD D = (Σ, Σ
′
, d, s
d
, µ) is trimmed if for for every a
i
∈ Σ
′
, there
exists a tree t ∈ L(d) and a node u ∈ Dom(t) such that lab
t
(u) = a
i
.
Lemma 80 ([77]). 1. For every EDTD D, a trimmed EDTD D
′
, with
L(D) = L(D
′
), can be constructed in time polynomial in the size of
D.
2. Let D be a trimmed EDTD. For any type a
i
∈ Σ
′
and any string
w ∈ L(d(a
i
)) there exists a tree t ∈ L(d) which contains a node v with
lab
t
(v) = a
i
and chstr
t
(v) = w.
7.1. Preliminaries and Basic Results 119
7.1.3 Theorems
We use the following theorem of Glaister and Shallit [45].
Theorem 81 ([45]). Let L ⊆ Σ
∗
be a regular language and suppose there
exists a set of pairs M = {(x
i
, w
i
)  1 ≤ i ≤ n} such that
• x
i
w
i
∈ L for 1 ≤ i ≤ n; and
• x
i
w
j
/ ∈ L for 1 ≤ i, j ≤ n and i = j.
Then any NFA accepting L has at least n states.
We make use of the following results on transformations of regular expres
sions.
Theorem 82. 1. Let r
1
, . . . , r
n
, s
1
, . . . , s
m
be regular expressions. A regu
lar expression r, with L(r) =
¸
i≤n
L(r
i
)\
¸
i≤m
L(s
i
), can be constructed
in time double exponential in the sum of the sizes of all r
i
, s
j
, i ≤ n,
j ≤ m.
2. For any regular expressions r and alphabet ∆ ⊆ Σ, an expression r
−
,
such that L(r
−
) = L(r) ∩ ∆
∗
, can be constructed in time linear in the
size of r.
Proof. (1) First, for every i ≤ n, construct an NFA A
i
, such that L(r
i
) =
L(A
i
), which can be done in polynomial time according to Theorem 2(4).
Then, let A be the DFA accepting
¸
i≤n
L(A
i
) obtained from the A
i
by deter
minization followed by a product construction. For k the size of the largest
NFA, this can be done in time O(2
k·n
). For every i ≤ m, construct an NFA
B
i
, with L(s
i
) = B
i
, and let B
i
be the DFA accepting
¸
i≤m
L(B
i
) again ob
tained from the B
i
by means of determinization and a product construction.
Similarly, B can also be computed in time exponential in the size of the input.
Then, compute the DFA B
′
for the complement of B by making B complete
and exchanging ﬁnal and nonﬁnal states in B, which can be done in time
polynomial in the size of B. Then, the DFA C accepts L(A) ∩ L(B
′
) and
can again be obtained by a product construction on A and B
′
which requires
polynomial time in the sizes of A and B
′
. Therefore, C is of exponential size
in function of the input. Finally, r, with L(r) =
¸
i≤n
L(r
i
) \
¸
i≤m
L(s
i
) can
be obtained from C in time exponential in the size of C (Theorem 2(1)) and
thus yields a double exponential algorithm in total.
(2) The algorithm proceeds in two steps. First, replace every symbol a / ∈ ∆
in r by ∅. Then, use the following rewrite rules on subexpressions of r as often
as possible: ∅
∗
= ε, ∅s = s∅ = ∅, and ∅ + s = s + ∅ = s. This gives us r
−
which is equal to ∅ or does not contain ∅ at all, with L(r
−
) = L(r) ∩ ∆
∗
.
120 Succinctness of PatternBased Schema Languages
7.1.4 Succinctness
As the focus in this chapter is on succinctness results we introduce some ad
ditional notation to allow for a succinct notation of such results.
For a class S and S
′
of representations of schema languages, and F a class
of functions from N to N, we write S
F
→ S
′
if there is an f ∈ F such that for
every s ∈ S there is an s
′
∈ S
′
with L(s) = L(s
′
) which can be constructed in
time f(s). This also implies that s
′
 ≤ f(s). By L(s) we mean the set of
trees deﬁned by s.
We write S
F
⇒ S
′
if S
F
→ S
′
and there is an f ∈ F, a monotonically
increasing function g : N → N and an inﬁnite family of schemas s
n
∈ S with
s
n
 ≤ g(n) such that the smallest s
′
∈ S
′
with L(s) = L(s
′
) is at least of size
f(g(n)). By poly, exp and 2exp we denote the classes of functions
¸
k,c
cn
k
,
¸
k,c
c2
n
k
and
¸
k,c
c2
2
n
k
, respectively.
Further, we write S → S
′
if there exists an s ∈ S such that for every
s
′
∈ S
′
, L(s
′
) = L(s). In this case we also write S
F
→ S
′
and S
F
⇒ S
′
whenever
S
F
→ S
′
and S
F
⇒ S
′
, respectively, hold for those elements in S which do have
an equivalent element in S
′
.
7.2 Regular PatternBased Schemas
In this section, we study the full class of patternbased schemas which we de
note by P
∃
(Reg) and P
∀
(Reg). The results are shown in Theorem 83. Notice
that the translations among schemas with diﬀerent semantics, and the trans
lation from a patternbased schema under universal semantics to an EDTD
are double exponential, whereas the translation from a schema under existen
tial semantics to an EDTD is “only” exponential. Essentially all these double
exponential lower bounds are due to the fact that in these translations one
necessarily has to apply operations, such as intersection and complement, on
regular expressions, which, as shown in Chapter 3, yields double exponential
lower bounds. In the translation from a patternbased schema under existen
tial semantics to an EDTD such operations are not necessary which allows for
an easier translation.
Theorem 83. 1. P
∃
(Reg)
2exp
⇒ P
∀
(Reg);
2. P
∃
(Reg)
exp
⇒ EDTD;
3. P
∃
(Reg)
exp
⇒ EDTD
st
;
4. simplification for P
∃
(Reg) is exptimecomplete;
7.2. Regular PatternBased Schemas 121
5. P
∃
(Reg)
exp
→ DTD;
6. P
∀
(Reg)
2exp
⇒ P
∃
(Reg);
7. P
∀
(Reg)
2exp
⇒ EDTD;
8. P
∀
(Reg)
2exp
⇒ EDTD
st
;
9. simplification for P
∀
(Reg) is exptimecomplete; and,
10. P
∀
(Reg)
2exp
⇒ DTD.
Proof. (1) We ﬁrst show P
∃
(Reg)
2exp
→ P
∀
(Reg). Let P = {(r
1
, s
1
), . . . ,
(r
n
, s
n
)}. We show that we can construct a complete and disjoint pattern
based schema P
′
such that T
∃
(P) = T
∃
(P
′
) in time double exponential in the
size of P. By Lemma 76, T
∃
(P
′
) = T
∀
(P
′
) and thus T
∃
(P) = T
∀
(P
′
).
For any nonempty set C ⊆ {1, . . . , n}, denote by r
C
the regular expression
which deﬁnes the language
¸
i∈C
L(r
i
)\
¸
1≤i≤n,i / ∈C
L(r
i
) and by r
∅
the expres
sion deﬁning Σ
∗
\
¸
1≤i≤n
L(r
i
). That is, r
C
deﬁnes any word w which is deﬁned
by all vertical expressions contained in C but is not deﬁned by any vertical ex
pression not contained in C. Denote by s
C
the expression deﬁning the language
¸
i∈C
L(s
i
). Then, P
′
= {(r
∅
, ∅)} ∪ {(r
C
, s
C
)  C ⊆ {1, . . . , n} ∧ C = ∅}. Here,
P
′
is disjoint and complete. We show that T
∃
(P) = T
∃
(P
′
). By Lemma 75(3),
it suﬃces to prove that for any node v of any tree t, v ∈ P
∃
(t) if and only if
v ∈ P
′
∃
(t):
• v ∈ P
∃
(t) ⇒ v ∈ P
′
∃
(t): Let C = {i  ancstr(v) ∈ L(r
i
)}. Since
v ∈ P
∃
(t), C = ∅ and there is an i ∈ C with chstr(v) ∈ L(s
i
). But then,
by deﬁnition of r
C
and s
C
, ancstr(v) ∈ L(r
C
) and chstr(v) ∈ L(s
C
),
and thus v ∈ P
′
∃
(t).
• v ∈ P
′
∃
(t) ⇒ v ∈ P
∃
(t): Let C ⊆ {1, . . . , n} be the unique set for
which ancstr(v) ∈ L(r
C
) and chstr(v) ∈ L(s
C
), and choose some i ∈ C
for which chstr(v) ∈ L(s
i
). By deﬁnition of s
C
, such an i must exist.
Then, ancstr(v) ∈ L(r
i
) and chstr(v) ∈ L(s
i
), from which it follows
that v ∈ P
∃
(t).
We conclude by showing that P
′
can be constructed from P in time double
exponential in the size of P. By Theorem 82(1), the expressions r
C
can be
constructed in time double exponential in the size of the r
i
and s
i
. The ex
pressions s
C
can easily be constructed in linear time by taking the disjunction
122 Succinctness of PatternBased Schema Languages
of the right expressions. So, any rule (r
C
, s
C
) requires at most double expo
nential time to construct, and we must construct an exponential number of
these rules, which yields and algorithm of double exponential time complexity.
To show that P
∃
(Reg)
2exp
⇒ P
∀
(Reg), we slightly extend Theorem 15.
Lemma 84. For every n ∈ N, there is a regular expressions r
n
of size linear
in n such that any regular expression r deﬁning Σ
∗
\ L(r
n
) is of size at least
double exponential in r. Further, r
n
has the property that for any string
w / ∈ L(r
n
), there exists a string u such that wu ∈ L(r
n
).
Proof. Let n ∈ N. By Theorem 15, there exists a regular expression s
n
of
size linear in n over an alphabet Σ such that any regular expression deﬁning
Σ
∗
\ L(s
n
) must be of size at least double exponential in n. Let Σ
a
= Σ⊎{a}.
Deﬁne r
n
= s
n
+ Σ
∗
a
a as all strings which are deﬁned by s
n
or have a as last
symbol. First, note that r
n
satisﬁes the extra condition: for every w / ∈ L(r
n
),
wa ∈ L(r
n
). We show that any expression r deﬁning the complement of r
n
must be of size at least double exponential in n. This complement consists
of all strings which do not have a as last symbol and are not deﬁned by s
n
.
But then, the expression s which deﬁnes L(r) ∩Σ
∗
deﬁnes exactly L(s
n
) \ Σ
∗
,
the complement of L(s
n
). Furthermore, by Theorem 15, s must be of size at
least double exponential in n and by Theorem 82(2), s can be computed from
r in time linear in the size of r. It follows that r must also be of size at least
double exponential in n.
Now, let n ∈ N and let r
n
be a regular expression over Σ satisfying the
conditions of Lemma 84. Then, deﬁne P
n
= {(r
n
, ε), (Σ
∗
, Σ)}. Here, T
∃
(P
n
)
deﬁnes all unary trees w for which w ∈ L(r
n
).
Let P be a patternbased schema with T
∃
(P
n
) = T
∀
(P). Deﬁne U = {r 
(r, s) ∈ P ∧ ε / ∈ L(s)} as the set of vertical regular expressions in P whose
corresponding horizontal regular expression does not contain the empty string.
Finally, let r be the disjunction of all expressions in U. We now show that
L(r) = Σ
∗
\ L(r
n
), thereby proving that the size of P must be at least double
exponential in n.
First, let w / ∈ L(r
n
) and towards a contradiction suppose w / ∈ L(r). Then,
w / ∈ T
∃
(P
n
) = T
∀
(P). By Lemma 84, there exists a string u such that wu ∈
L(r
n
), and thus wu ∈ T
∃
(P
n
) by deﬁnition of P
n
and so wu ∈ T
∀
(P). By
Lemma 75(2), for every nonleaf node v of w, v ∈ P
∀
(w). As w is not deﬁned
by any expression in U, for any rule (r
′
, s
′
) ∈ P with w ∈ L(r
′
) it holds that
ε ∈ L(s
′
), and thus for the leaf node v of w, v ∈ P
∀
(w). So, by Lemma 75(1),
w ∈ T
∀
(P) which leads to the desired contradiction.
Conversely, suppose w ∈ L(r
′
), for some r
′
∈ U, and again towards a
contradiction suppose w ∈ L(r
n
). Then, w ∈ T
∃
(P) = T
∀
(P). But, since
7.2. Regular PatternBased Schemas 123
w ∈ L(r
′
), and by deﬁnition of U for the rule (r
′
, s
′
) in P it holds that ε / ∈ L(s
′
).
It follows that the leaf node v of w is not in P
∀
(w). Therefore, w / ∈ T
∀
(P) by
Lemma 75(1), which again gives us the desired contradiction. This concludes
the proof of Theorem 83(1).
(23) We ﬁrst show P
∃
(Reg)
exp
→ EDTD
st
, which implies P
∃
(Reg)
exp
→ EDTD.
Thereto, let P = {(r
1
, s
1
), . . . , (r
n
, s
n
)}. We construct an automatonbased
schema D = (A, λ) such that L(D) = T
∃
(P). By Lemma 78, D can then be
translated into an equivalent singletype EDTD in polynomial time and the
theorem follows. First, construct for every r
i
a DFA A
i
= (Q
i
, q
i
, δ
i
, F
i
), such
that L(r
i
) = L(A
i
). Then, A = (Q
1
×· · · ×Q
n
, (q
1
, . . . , q
n
), δ) is the product
automaton for A
1
, . . . , A
n
. Finally, λ((q
1
, . . . , q
n
)) =
¸
i≤n,q
i
∈F
i
L(s
i
), and
λ((q
1
, . . . , q
n
)) = ∅ if none of the q
i
are accepting states for their automaton.
Here, if m is the size of the largest vertical expression in P, then A is of
size O(2
m·n
). Furthermore, an expression for
¸
i≤n,q
i
∈F
i
L(s
i
) is simply the
disjunction of these s
i
and can be constructed in linear time. Therefore, the
total construction can be carried out in exponential time.
Further, P
∃
(Reg)
exp
⇒ EDTD already holds for a restricted version of pattern
based schemas, which is shown in Theorem 85(2). The latter then implies
P
∃
(Reg)
exp
⇒ EDTD
st
.
(4) For the upper bound, we combine a number of results of Kasneci and
Schwentick [62] and Martens et. al [77]. In the following, an NTA(NFA) is a
nondeterministic tree automaton where the transition relation is represented
by an NFA. Recall also that DTD(NFA) is a DTD where content models are
deﬁned by NFAs.
Given a patternbased schema P, we ﬁrst construct an NTA(NFA) A
P
with L(A
P
) = T
∃
(P), which can be done in exponential time (Proposition 3.3
in [62]). Then, Martens et. al. [77] have shown that given any NTA(NFA) A
P
it is possible to construct, in time polynomial in the size of A
P
, a DTD(NFA)
D
P
such that L(A
P
) ⊆ L(D
P
) is always true and L(A
P
) = L(D
P
) holds if and
only if L(A
P
) is deﬁnable by a DTD. Summarizing, D
P
is of size exponential
in P, T
∃
(P) ⊆ L(D
P
) and T
∃
(P) is deﬁnable by a DTD if and only if T
∃
(P) =
L(D
P
).
Now, construct another NTA(NFA) A
¬P
which deﬁnes the complement of
T
∃
(P). This can again be done in exponential time (Proposition 3.3 in [62]).
Since T
∃
(P) ⊆ L(D
P
), T
∃
(P) = L(D
P
) if and only if L(D
P
) ∩ L(A
¬P
) = ∅.
Here, D
P
and A
¬P
are of size at most exponential in the size of P, and testing
the nonemptiness of their intersection can be done in time polynomial in the
size of D
P
and A
¬P
. This gives us an exptime algorithm overall.
For the lower bound, we reduce from satisfiability of patternbased
schemas, which is exptimecomplete [62]. Let P be a patternbased schema
124 Succinctness of PatternBased Schema Languages
over the alphabet Σ, deﬁne Σ
P
= {a, b, c, e} ⊎ Σ, and deﬁne the pattern
based schema P
′
= {(a, b + c), (ab, e), (ac, e), (abe, ε), (ace, ε)} ∪ {(acer, s) 
(r, s) ∈ P}. We show that T
∃
(P
′
) is deﬁnable by a DTD if and only if P is
not existentially satisﬁable. Since exptime is closed under complement, the
theorem follows.
If T
∃
(P) = ∅, then the following DTD d deﬁnes T
∃
(P
′
): d(a) = b + c,
d(b) = e, d(c) = e, d(e) = ε.
Conversely, if there exists some tree t ∈ T
∃
(P), suppose towards a con
tradiction that there exists a DTD D such that L(D) = T
∃
(P
′
). Then,
a(b(e)) ∈ L(D), and a(c(e(t))) ∈ L(D). Since every DTD is closed under
labelguarded subtree exchange (Lemma 79), a(b(e(t))) ∈ L(D) also holds,
but a(b(e(t))) / ∈ T
∃
(P
′
) which yields the desired contradiction.
(5) First, P
∃
(Reg) → DTD already holds for a restricted version of pattern
based schemas (Theorem 89(6)). We show P
∃
(Reg)
exp
→ DTD.
Simply translating the DTD(NFA), obtained in the previous proof, into a
normal DTD by means of state elimination would give us a double exponential
algorithm. Therefore, we use the following similar approach which does not
need to translate regular expressions into NFAs and back. First, construct
a singletype EDTD D
1
such that L(D
1
) = T
∃
(P). This can be done in
exponential time according to Theorem 83(3). Then, use the polynomial time
algorithm of Martens et al [77], to construct an equivalent DTD D. In this
algorithm, all expressions of D deﬁne unions of the language deﬁned by the
expressions in D
1
. This can, of course, be done by taking the disjunction of
expressions in D
1
. In total, D is constructed in exponential time.
(6) We ﬁrst show P
∀
(Reg)
2exp
→ P
∃
(Reg). We take the same approach as
in the proof of Theorem 83(1), but have to make some small changes. Let
P = {(r
1
, s
1
), . . . , (r
n
, s
n
)}, and for any nonempty set C ⊆ {1, . . . , n} let r
C
be the regular expression deﬁning
¸
i∈C
L(r
i
)\
¸
1≤i≤n,i / ∈C
L(r
i
). Let r
∅
deﬁne
Σ
∗
\
¸
i≤n
L(r
i
) and let s
C
be the expression deﬁning the language
¸
i∈C
L(s
i
).
Deﬁne P
′
= {(r
∅
, Σ
∗
)} ∪ {(r
C
, s
C
 C ⊆ {1, . . . , n} ∧ C = ∅}. Here, P
′
is disjoint and complete and, by the same argumentation as in the proof of
Theorem 83(1), can be constructed in time double exponential in the size of
P
′
. So, by Lemma 76, T
∃
(P
′
) = T
∀
(P
′
). We show that T
∀
(P) = T
∀
(P
′
) from
which T
∀
(P) = T
∃
(P
′
) then follows. By Lemma 75(1), it suﬃces to prove that
for any node v of any tree t, v ∈ P
∀
(t) if and only if v ∈ P
′
∀
(t):
• v ∈ P
∀
(t) ⇒ v ∈ P
′
∀
(t): Let C = {i  ancstr(v) ∈ L(r
i
)}. If C = ∅, then
ancstr(v) ∈ L(r
∅
) and the horizontal regular expression Σ
∗
allows every
childstring. Because of the disjointness of P
′
no other vertical regular
expression in P
′
can deﬁne ancstr(v) and thus v ∈ P
′
∀
(t). If C = ∅,
7.2. Regular PatternBased Schemas 125
since v ∈ P
∀
(t), for all i ∈ C, chstr(v) ∈ L(s
i
). But then, by deﬁnition
of r
C
and s
C
, ancstr(v) ∈ L(r
C
) and chstr(v) ∈ L(s
C
), combined with
the disjointness of P
′
gives v ∈ P
′
∀
(t).
• v ∈ P
′
∀
(t) ⇒ v ∈ P
∀
(t): Let C ⊆ {1, . . . , n} be the unique set for
which (r
C
, s
C
) ∈ P
′
, ancstr(v) ∈ L(r
C
) and chstr(v) ∈ L(s
C
). Since
v ∈ P
′
∀
(t) and by the disjointness and completeness of P
′
there indeed
exists exactly one such set. If C = ∅, then ancstr(v) is not deﬁned by
any vertical expression in P and thus v ∈ P
∀
(t). If C = ∅, then for
all i ∈ C, ancstr(v) ∈ L(r
i
) and chstr(v) ∈ L(s
i
), and for all i / ∈ C,
ancstr(v) / ∈ L(r
i
). It follows that v ∈ P
∀
(t).
We now show that P
∀
(Reg)
2exp
⇒ P
∃
(Reg). Let n ∈ N. According to
Theorem 20(3), there exist a linear number of regular expressions r
1
, . . . , r
m
of size linear in n such that any regular expression deﬁning
¸
i≤m
L(r
i
) must
be of size at least double exponential in n. For brevity, deﬁne K =
¸
i≤m
L(r
i
).
Deﬁne P
n
over the alphabet Σ
a
= Σ ⊎ {a}, for a / ∈ Σ, as P
n
= {(a, r
i
) 
i ≤ m} ∪ {(ab, ε)  b ∈ Σ} ∪ {(b, ∅)  b ∈ Σ}. That is, T
∀
(P
n
) contains all trees
a(w), where w ∈ K.
Let P be a patternbased schema with T
∀
(P
n
) = T
∃
(P). For an expression
s, denote by s
−
the expression deﬁning all words in L(s) ∩ Σ
∗
. According
to Theorem 82(2), s
−
can be constructed from s in linear time. Deﬁne U =
{s
−
 (r, s) ∈ P ∧ a ∈ L(r)} as the set of horizontal regular expressions whose
corresponding vertical regular expressions contains the string a. Finally, let
r
K
be the disjunction of all expressions in U. We now show that L(r
K
) = K,
thereby proving that the size of P must be at least double exponential in n.
First, let w ∈ K. Then, t = a(w) ∈ T
∀
(P
n
) = T
∃
(P). Therefore, by
Lemma 75(3), the root node v of t is in P
∃
(t). It follows that there must be a
rule (r, s) ∈ P, with a ∈ L(r) and w ∈ L(s). Now w ∈ Σ
∗
implies w ∈ L(s
−
),
and thus, by deﬁnition of U and r
K
, w ∈ L(r
K
).
Conversely, suppose w ∈ L(s
−
) for some s
−
∈ U. We show that t = a(w) ∈
T
∃
(P) = T
∀
(P
n
), which implies that w ∈ K. By Lemma 75(3), it suﬃces to
show that every node v of t is in P
∃
(t). For the root node v of t, we know that
chstr(v) = w ∈ L(s
−
), and by deﬁnition of U, that ancstr(v) = a ∈ L(r),
where r is the corresponding vertical expression for s. Therefore, v ∈ P
∃
(t).
All other nodes v are leaf nodes with chstr(v) = ε and ancstr(v) = ab, where
b ∈ Σ since w ∈ L(s
−
). To show that any node with these child and ancestor
strings must be in P
∃
(t), note that for every symbol b ∈ Σ there exists a string
w
′
∈ K such that w
′
contains a b. Otherwise b is useless and can be removed
from Σ. Then, t
′
= a(w
′
) ∈ T
∀
(P
n
) = T
∃
(P) and thus there is a leaf node v
′
in t
′
for which ancstr(v
′
) = ab and chstr(v
′
) = ε. Since, by Lemma 75(3)
126 Succinctness of PatternBased Schema Languages
v
′
∈ P
∃
(t
′
), also any leaf node v of t with ancstr(v) = ab is in P
∃
(t). It follows
that t ∈ T
∃
(P) = T
∀
(P
n
).
(78) We ﬁrst show P
∀
(Reg)
2exp
→ EDTD
st
, which implies P
∀
(Reg)
2exp
→ EDTD.
Thereto, let P = {(r
1
, s
1
), . . . , (r
n
, s
n
)}. We construct an automatonbased
schema D = (A, λ) such that L(D) = T
∀
(P). By Lemma 78, D can then be
translated into an equivalent singletype EDTD and the theorem follows. We
construct A in exactly the same manner as in the proof of Theorem 83(3). For
λ, let λ((q
1
, . . . , q
n
)) =
¸
i≤n,q
i
∈F
i
L(s
i
), and λ((q
1
, . . . , q
n
)) = Σ
∗
if none of the
q
i
are accepting states for their automaton. We already know that A can be
constructed in exponential time, and by Theorem 20(1) a regular expression
for λ((q
1
, . . . , q
n
)) =
¸
i≤n,q
i
∈F
i
L(s
i
) can be constructed in double exponential
time. It follows that the total construction can be done in double exponential
time.
Further, P
∀
(Reg)
2exp
⇒ EDTD already holds for a restricted version of
patternbased schemas, which is shown in Theorem 85(7). The latter implies
P
∀
(Reg)
2exp
⇒ EDTD
st
.
(9) The proof is along the same lines as that of Theorem 83(4).
(10) First, P
∀
(Reg) → DTD already holds for a restricted version of
patternbased schemas (Theorem 89(12)).
We ﬁrst show P
∀
(Reg)
2exp
→ DTD. Notice that the DTD(NFA) D con
structed in the above proof, conform the proof of Theorem 83(4), is con
structed in time exponential in the size of P. To obtain an actual DTD, we
only have to translate the NFAs in D into regular expressions, which can be
done in exponential time (Theorem 2(1)). This yields a total algorithm of
double exponential time complexity.
Finally, P
∀
(Reg)
2exp
⇒ DTD already holds for a more restricted version of
patternbased schemas, which is shown in Theorem 85(10).
7.3 Linear PatternBased Schemas
In this section, following [62], we restrict the vertical expressions to XPath
expressions using only descendant and child axes. For instance, an XPath
expression \\a\\b\c captures all nodes that are labeled with c, have b as
parent and have an a as ancestor. This corresponds to the regular expression
Σ
∗
aΣ
∗
bc.
Formally, we call an expression linear if it is of the formw
0
Σ
∗
· · · w
n−1
Σ
∗
w
n
,
with w
0
, w
n
∈ Σ
∗
, and w
i
∈ Σ
+
for 1 ≤ i < n. A patternbased schema is
linear if all its vertical expressions are linear. Denote the classes of linear
7.3. Linear PatternBased Schemas 127
schemas under existential and universal semantics by P
∃
(Lin) and P
∀
(Lin),
respectively.
Theorem 85 lists the results for linear schemas. The complexity of simpli
fication improves slightly, pspace instead of exptime. Further, we show
that the expressive power of linear schemas under existential and univer
sal semantics becomes incomparable, but that the complexity of translating
to DTDs and (singletype) EDTDs is in general not better than for regular
patternbased schemas.
Theorem 85. 1. P
∃
(Lin) → P
∀
(Lin);
2. P
∃
(Lin)
exp
⇒ EDTD;
3. P
∃
(Lin)
exp
⇒ EDTD
st
;
4. simplification for P
∃
(Lin) is pspacecomplete;
5. P
∃
(Lin)
exp
→ DTD;
6. P
∀
(Lin) → P
∃
(Lin);
7. P
∀
(Lin)
2exp
⇒ EDTD;
8. P
∀
(Lin)
2exp
⇒ EDTD
st
;
9. simplification for P
∀
(Lin) is pspacecomplete; and,
10. P
∀
(Lin)
2exp
⇒ DTD.
Proof. (1) We ﬁrst prove the following simple lemma. Given an alphabet Σ,
and a symbol b ∈ Σ, denote Σ \ {b} by Σ
b
.
Lemma 86. There does not exist a set of linear regular expression r
1
, . . . , r
n
such that
¸
1≤i≤n
L(r
i
) is an inﬁnite language and
¸
1≤i≤n
L(r
i
) ⊆ L(Σ
∗
b
).
Proof. Suppose to the contrary that such a list of linear expressions does
exist. Then, one of these expressions must contain Σ
∗
because otherwise
¸
1≤i≤n
L(r
i
) would be a ﬁnite language. However, if an expression contains
Σ
∗
, then it also deﬁnes words containing b, which gives us the desired contra
diction.
Now, let P = {(Σ
∗
bΣ
∗
, ε), (Σ
∗
, Σ)}. Then, T
∃
(P) deﬁnes all unary trees
containing at least one b. Suppose that P
′
is a linear schema such that T
∃
(P) =
T
∀
(P
′
). Deﬁne U = {r  (r, s) ∈ P
′
and ε / ∈ L(s)} as the set of all vertical
128 Succinctness of PatternBased Schema Languages
regular expressions in P
′
whose horizontal regular expressions do not contain
the empty string. We show that the union of the expressions in U deﬁnes an
inﬁnite language and is a subset of Σ
∗
b
, which by Lemma 86 proves that such
a schema P
′
can not exist.
First, to show that the union of these expressions deﬁnes an inﬁnite lan
guage, suppose that it does not. Then, every expression r ∈ U is of the form
r = w, for some string w. Let k be the length of the longest such string w.
Now, a
k+1
b ∈ T
∃
(P) = T
∀
(P
′
) and thus by Lemma 75(2) every nonleaf node
v of a
k+1
is in P
′
∀
(a
k+1
). Further, a
k+1
/ ∈ L(r) for all vertical expressions in U
and thus the leaf node of a
k+1
is also in P
′
∀
(a
k+1
). But then, by Lemma 75(1),
a
k+1
∈ T
∀
(P
′
) which leads to the desired contradiction.
Second, let w ∈ L(r), for some r ∈ U, we show w ∈ Σ
∗
b
. Towards a
contradiction, suppose w / ∈ Σ
∗
b
, which means that w contains at least one b and
thus w ∈ T
∃
(P) = T
∀
(P
′
). But then, for the leaf node v of w, ancstr(v) = w ∈
L(r), and by deﬁnition of U, chstr(v) = ε / ∈ L(s), where s is the corresponding
horizontal expression for r. Then, v / ∈ P
′
∀
(w) and thus by Lemma 75(1),
w / ∈ T
∀
(P
′
), which again gives the desired contradiction.
(23) First, P
∃
(Lin)
exp
→ EDTD
st
follows immediately from Theorem 83(3). We
show P
∃
(Lin)
exp
⇒ EDTD, which then implies both statements. Thereto, we
ﬁrst characterize the expressive power of EDTDs over unary tree languages.
Lemma 87. For any EDTD D for which L(D) is a unary tree language, there
exists an NFA A such that L(D) = L(A). Moreover, A can be computed from
D in time linear in the size of D.
Proof. Let D = (Σ, Σ
′
, d, s, µ) be an EDTD, such that L(D) is a unary tree
language. Then, deﬁne A = (Q, q
0
, δ, F) as Q = {q
0
} ∪ Σ
′
, δ = {(q
0
, s, s)} ∪
{(a, µ(b), b)  a, b ∈ Σ
′
∧ b ∈ L(d(a))}, and F = {a  a ∈ Σ
′
∧ ε ∈ d(a)}.
Now, let n ∈ N. Deﬁne Σ
n
= {$, #
1
, #
2
} ∪
¸
1≤i≤n
{a
0
i
, a
1
i
, b
0
i
, b
1
i
} and
K
n
= {#
1
a
i
1
1
a
i
2
2
· · · a
i
n
n
$b
i
1
1
b
i
2
2
· · · b
i
n
n
#
2
 i
k
∈ {0, 1}, 1 ≤ k ≤ n}. It is not hard
to see that any NFA deﬁning K
n
must be of size at least exponential in n.
Indeed deﬁne M = {(x, w)  xw ∈ K
n
∧x = n+1} which is of size exponential
in n, and satisﬁes the conditions of Theorem 81. Then, by Lemma 87, every
EDTD deﬁning the unary tree language K
n
must also be of size exponential
in n. We conclude the proof by giving a patternbased schema P
n
, such that
T
∃
(P
n
) = K
n
, which is of size linear in n. It contains the following rules:
• #
1
→ a
0
1
+a
1
1
• For any i < n:
– #
1
Σ
∗
a
0
i
→ a
0
i+1
+a
1
i+1
7.3. Linear PatternBased Schemas 129
– #
1
Σ
∗
a
1
i
→ a
0
i+1
+a
1
i+1
– #
1
Σ
∗
a
0
i
Σ
∗
b
0
i
→ b
0
i+1
+b
1
i+1
– #
1
Σ
∗
a
1
i
Σ
∗
b
1
i
→ b
0
i+1
+b
1
i+1
• #
1
Σ
∗
a
0
n
→ $
• #
1
Σ
∗
a
1
n
→ $
• #
1
Σ
∗
$ → b
0
1
+b
1
1
• #
1
Σ
∗
a
0
n
Σ
∗
b
0
n
→ #
2
• #
1
Σ
∗
a
1
n
Σ
∗
b
1
n
→ #
2
• #
1
Σ
∗
#
2
→ ε
(4) For the lower bound, we reduce from universality of regular expressions.
That is, deciding for a regular expression r whether L(r) = Σ
∗
. The latter
problem is known to be pspacecomplete [102]. Given r over alphabet Σ,
let Σ
P
= {a, b, c, d} ⊎ Σ, and deﬁne the patternbased schema P = {(a, b +
c), (ab, e), (ac, e), (abe, Σ
∗
), (ace, r)} ∪ {(abeσ, ε), (aceσ, ε)  σ ∈ Σ}. We show
that there exists a DTD D with L(D) = T
∃
(P) if and only if L(r) = Σ
∗
.
If L(r) = Σ
∗
, then the following DTD d deﬁnes T
∃
(P): d(a) = b + c,
d(b) = e, d(c) = e, d(e) = Σ
∗
, and d(σ) = ε for every σ ∈ Σ.
Conversely, if L(r) Σ
∗
, we show that T
∃
(P) is not closed under label
guarded subtree exchange. From Lemma 79, it then follows that T
∃
(P) is not
deﬁnable by a DTD. Let w, w
′
be strings such that w / ∈ L(r) and w
′
∈ L(r).
Then, a(b(e(w))) ∈ L(D), and a(c(e(w
′
))) ∈ L(D) but a(c(e(w))) / ∈ T
∃
(P).
For the upper bound, we again make use of the closure under labelguarded
subtree exchange property of DTDs. Observe that T
∃
(P), which is a regular
tree language, is not deﬁnable by any DTD if and only if there exist trees
t
1
, t
2
∈ T
∃
(P) and nodes v
1
and v
2
in t
1
and t
2
, respectively, with lab
t
1
(v
1
) =
lab
t
2
(v
2
), such that the tree t
3
= t
1
[v
1
← subtree
t
2
(v
2
)] is not in T
∃
(P). We
refer to such a tuple (t
1
, t
2
) as a witness to the DTDundeﬁnability of T
∃
(P),
or simply a witness tuple.
Lemma 88. If there exists a witness tuple (t
1
, t
2
) for a linear schema P, then
there also exists a witness tuple (t
′
1
, t
′
2
) for P, where t
′
1
and t
′
2
are of depth
polynomial in the size of P.
Proof. We make use of techniques introduced by Kasneci and Schwentick [62].
When P, P
′
are two linear schemas, they stated that if there exists a tree t
with t ∈ T
∃
(P) but t / ∈ T
∃
(P
′
), then there exists a tree t
′
of depth polynomial
with the same properties. In particular, they obtained the following property.
130 Succinctness of PatternBased Schema Languages
Let P be a linear patternbased schema and t a tree. Then, to every node
v of t, a vector F
t
P
(v) over N can be assigned with the following properties:
• along a path in a tree, F
t
P
(v) can take at most polynomially many values
in the size of P;
• if v
′
is a child of v, then F
t
P
(v
′
) can be computed from F
t
P
(v) and the
label of v
′
in t; and
• v ∈ P
∃
(t) can be decided solely on the value of F
t
P
(v) and chstr(v).
Based on these properties it is easy to see that if there exists a tree t which
existentially satisﬁes P, then there exists a tree t
′
of polynomial depth which
existentially satisﬁes P. Indeed, t
′
can be constructed from t by searching
for nodes v and v
′
of t such that v
′
is a descendant of v, lab
t
(v) = lab
t
(v
′
)
and F
t
P
(v) = F
t
P
(v
′
), and replacing the subtree rooted at v by the one rooted
at v
′
. By applying this rule as often as possible, we get a tree which is still
existentially valid with respect to P and where no two nodes on a path in the
tree have the same vector and label and which thus is of polynomial depth.
We will also use this technique, but have to be a bit more careful in the
replacements we carry out. Thereto, let (t
1
, t
2
) be a witness tuple for P
and ﬁx nodes v
1
and v
2
of t
1
and t
2
, respectively, such that t
3
, deﬁned as
t
1
[v
1
← subtree
t
2
(v
2
)], is not in T
∃
(P). Since t
3
/ ∈ T
∃
(P), by Lemma 75(3),
there must be some node v
3
of t
3
with v
3
/ ∈ P
∃
(t
3
). Furthermore, v
3
must
occur in the subtree under v
2
inherited from t
2
. Indeed, every node v not in
that subtree, has the same vector and childstring as its corresponding node
in t
1
, and since t
1
∈ T
∃
(P) also v ∈ P
∃
(t
1
) and thus v ∈ P
∃
(t
3
). So, ﬁx some
node v
3
, with v
3
/ ∈ P
∃
(t
3
), occurring in t
2
. Then, we can partition the trees t
1
and t
2
, and thereby also t
3
, in ﬁve diﬀerent parts as follows:
1. t
1
[v
1
← ()]: the tree t
1
without the subtree under v
1
;
2. subtree
t
1
(v
1
): the subtree under v
1
in t
1
;
3. t
2
[v
2
← ()]: the tree t
2
without the subtree under v
2
4. subtree
t
2
(v
2
)[v
3
← ()]: the subtree under v
2
in t
2
, without the subtree
under v
3
;
5. subtree
t
2
(v
3
): the subtree under v
3
in t
2
;
This situation is graphically illustrated in Figure 7.2.
Now, let t
′
1
and t
′
2
be the trees obtained from t
1
and t
2
by repeating the
following as often as possible: Search for two nodes v, v
′
such that v is an
ancestor of v
′
, v and v
′
are not equal to v
1
, v
2
or v
3
, v and v
′
occur in the same
7.3. Linear PatternBased Schemas 131
t
1
v
1
t
1
t
2
v
2
v
3
t
2
t
1
v
2
v
3
t
3
Figure 7.2: The ﬁve diﬀerent areas in t
1
and t
2
.
part of t
1
or t
2
, lab(v) = lab(v
′
) and F
t
1
P
(v) = F
t
1
P
(v
′
) (or F
t
2
P
(v) = F
t
2
P
(v
′
) if
v and v
′
both occur in t
2
). Then, replace v by the subtree under v
′
.
Observe that, by the properties of F, any path in one of the ﬁve parts
of t
′
1
and t
′
2
can have at most a polynomial depth, and thus t
′
1
and t
′
2
are
of at most a polynomial depth. Furthermore, t
′
1
, t
′
2
∈ T
∃
(P) still holds and
the original nodes v
1
, v
2
and v
3
still occur in t
′
1
and t
′
2
. Therefore, for t
′
3
=
t
′
1
[v
1
← subtree
t
′
2
(v
2
)], F
t
′
3
P
(v
3
) = F
t
3
P
(v
3
) and chstr
t
′
3
(v
3
) = chstr
t
3
(v
3
). But
then, v
3
/ ∈ P
∃
(t
′
3
), which by Lemma 75(3) gives us t
′
3
/ ∈ T
∃
(P). So, (t
′
1
, t
′
2
) is a
witness tuple in which t
′
1
and t
′
2
are of at most polynomial depth.
Now, using Lemma 88, we show that the problem is in pspace. We simply
guess a witness tuple (t
1
, t
2
) and check in polynomial space whether it is a
valid witness tuple. If it is, T
∃
(P) is not deﬁnable by a DTD. If T
∃
(P) is
deﬁnable by a DTD, there does not exist a witness tuple for P. Since pspace
is closed under complement, the theorem follows.
By Lemma 88, it suﬃces to guess trees of at most polynomial depth. There
fore, we guess t
1
and t
2
in depthﬁrst and lefttoright fashion, maintaining
for each tree and each level of the trees, the sets of states the appropriate
automata can be in. Here, t
1
and t
2
are guessed simultaneously and indepen
dently. That is, for each guessed symbol, we also guess whether it belongs to
t
1
or t
2
. At some point in this procedure, we guess that we are now at the
nodes v
1
and v
2
of t
1
and t
2
. From that point we maintain a third list of
states of automata, which are initiated by the values of these of t
1
, but the
subsequent subtree take the values of t
2
. If in the end, t
1
and t
2
are accepted,
but the third tree is not, then (t
1
, t
2
) is a valid witness for P.
(5) First, P
∃
(Lin) → DTD already holds for a restricted version of pattern
based schemas (Theorem 89(6)). Then, P
∃
(Lin)
exp
→ DTD follows immediately
from Theorem 83(5).
(6) Let Σ = {a, b, c} and deﬁne P = {(Σ
∗
bΣ
∗
c, b)}. Then, T
∀
(P) contains all
trees in which whenever a c labeled node v has a b labeled node as ancestor,
132 Succinctness of PatternBased Schema Languages
chstr(v) must be b. We show that any linear schema P
′
deﬁning all trees in
T
∀
(P) under existential semantics, must also deﬁne trees not in T
∀
(P).
Suppose there does exist a linear schema P
′
such that T
∀
(P) = T
∃
(P
′
).
Deﬁne w
ℓ
= a
ℓ
c for ℓ ≥ 1 and notice that w
ℓ
∈ T
∀
(P) = T
∃
(P
′
). Let (r, s) ∈ P
′
be a rule matching inﬁnitely many leaf nodes of the strings w
ℓ
. There must
be at least one as P
′
contains a ﬁnite number of rules. Then, ε ∈ L(s) must
hold and r is of one of the following forms:
1. a
n
1
Σ
∗
a
n
2
Σ
∗
· · · Σ
∗
a
n
k
c
2. a
n
1
Σ
∗
a
n
2
Σ
∗
· · · Σ
∗
a
n
k
cΣ
∗
3. a
n
1
Σ
∗
a
n
2
Σ
∗
· · · Σ
∗
a
n
k
Σ
∗
where k ≥ 2 and n
k
≥ 0.
Choose some N ∈ N with N ≥ P
′
 and deﬁne the unary trees t
1
=
a
N
ba
N
cb and t
2
= a
N
ba
N
c. Obviously, t
1
∈ T
∀
(P), and t
2
/ ∈ T
∀
(P). Then,
t
1
∈ T
∃
(P
′
) and since t
2
is a preﬁx of t
1
, by Lemma 75(4), every nonleaf node v
of t
2
is in P
′
∃
(t
2
). Finally, for the leaf node v of t
2
, ancstr(v) ∈ L(r) for any of
the three expressions given above and ε ∈ L(s) for its corresponding horizontal
expression. Then, v ∈ P
′
∃
(t
2
), and thus by Lemma 75(3), t
2
∈ T
∃
(P
′
) which
completes the proof.
(78) First, P
∀
(Lin)
2exp
→ EDTD
st
follows immediately from Theorem 83(3).
We show P
∀
(Lin)
2exp
⇒ EDTD, which then implies both statements.
Let n ∈ N. According to Theorem 20(3), there exist a linear number of
regular expressions r
1
, . . . , r
m
of size linear in n such that any regular expres
sion deﬁning
¸
i≤m
L(r
i
) must be of size at least double exponential in n. Set
K =
¸
i≤m
L(r
i
).
Next, we deﬁne P
n
over the alphabet Σ ⊎ {a} as P
n
= {(a, r
i
)  i ≤
m} ∪{(ab, ε)  b ∈ Σ} ∪{(b, ∅)  b ∈ Σ}. That is, T
∀
(P
n
) deﬁnes all trees a(w),
for which w ∈ K.
Let D = (Σ, Σ
′
, d, a, µ) be an EDTD with T
∀
(P) = L(D). By Lemma 80(a),
we can assume that D is trimmed. Let a → r be the single rule in D for the root
element a. Let r
K
be the expressions deﬁning µ(L(r)). Since D is trimmed, it
follows from Lemma 80(2) that r
K
cannot contain an a. But then, L(r
K
) = K,
which proves that the size of D must be at least double exponential in n.
(9) The proof is along the same lines as that of Theorem 85(4).
(10) First, P
∀
(Lin) → DTD already holds for a restricted version of pattern
based schemas (Theorem 89(12)). Then, P
∀
(Lin)
2exp
→ DTD follows immedi
ately from Theorem 83(10). For P
∀
(Lin)
2exp
⇒ DTD, let n ∈ N. In the proof
7.4. Strongly Linear PatternBased Schemas 133
of Theorem 85(7) we have deﬁned a linear patternbased schema P
n
of size
polynomial in n for which any EDTD D
′
with T
∀
(P
n
) = L(D
′
) must be of size
at least double exponential in n. Furthermore, every DTD is an EDTD and
the language T
∀
(P
n
) is deﬁnable by a DTD. It follows that any DTD D with
T
∀
(P
n
) = L(D) must be of size at least double exponential in n.
7.4 Strongly Linear PatternBased Schemas
In [77], it is observed that the type of a node in most realworld XSDs only
depends on the labels of its parents and grand parents. To capture this idea,
following [62], we say that a regular expression is strongly linear if it is of the
form w or Σ
∗
w, where w is nonempty. A patternbased schema is strongly
linear if it is disjoint and all its vertical expressions are strongly linear. Denote
the class of all strongly linear patternbased schemas under existential and
universal semantics by P
∃
(SLin) and P
∀
(SLin), respectively.
In [62], all horizontal expressions in a strongly linear schema are also re
quired to be deterministic, as is the case for DTDs and XML Schema. The
latter requirement is necessary to get ptime satisfiability and inclusion
which would otherwise be pspacecomplete for arbitrary regular expressions.
This is also the case for the simplification problem studied here, but not for
the various translation problems. Therefore, we distinguish between strongly
linear schemas, as deﬁned above, and strongly linear schemas where all hori
zontal expressions must be deterministic, which we call deterministic strongly
linear schemas and denote by P
∃
(DetSLin) and P
∀
(DetSLin).
Theorem 89 shows the results for (deterministic) strongly linear pattern
based schemas. First, observe that the expressive power of these schemas
under existential and universal semantics again coincides. Further, all con
sidered problems become tractable, which makes strongly linear schemas very
interesting from a practical point of view.
Theorem 89. 1. P
∃
(SLin)
poly
→ P
∀
(SLin) and
P
∃
(DetSLin)
poly
→ P
∀
(DetSLin);
2. P
∃
(SLin)
poly
→ EDTD and P
∃
(DetSLin)
poly
→ EDTD;
3. P
∃
(SLin)
poly
→ EDTD
st
and P
∃
(DetSLin)
poly
→ EDTD
st
;
4. simplification for P
∃
(SLin) is pspacecomplete;
5. simplification for P
∃
(DetSLin) is in ptime;
6. P
∃
(SLin)
poly
→ DTD and P
∃
(DetSLin)
poly
→ DTD;
134 Succinctness of PatternBased Schema Languages
7. P
∀
(SLin)
poly
→ P
∃
(SLin) and P
∀
(DetSLin)
poly
→ P
∃
(DetSLin);
8. P
∀
(SLin)
poly
→ EDTD and P
∀
(DetSLin)
poly
→ EDTD;
9. P
∀
(SLin)
poly
→ EDTD
st
and P
∀
(DetSLin)
poly
→ EDTD
st
;
10. simplification for P
∀
(SLin) is pspacecomplete;
11. simplification for P
∀
(DetSLin) is in ptime; and,
12. P
∀
(SLin)
poly
→ DTD and P
∀
(DetSLin)
poly
→ DTD.
Proof. (1) We ﬁrst show P
∃
(SLin)
poly
→ P
∀
(SLin). The key of this proof lies
in the following lemma:
Lemma 90. For each ﬁnite set R of disjoint strongly linear expressions, a
ﬁnite set S of disjoint strongly linear regular expressions can be constructed
in polynomial time such that
¸
s∈S
L(s) = Σ
∗
\
¸
r∈R
L(r).
Before we prove this lemma, we show how it implies the theorem. For
P = {(r
1
, s
1
), . . . , (r
n
, s
n
)}, let S be the set of strongly linear expressions
for R = {r
1
, . . . , r
n
} satisfying the conditions of Lemma 90. Set P
′
= P ∪
¸
s∈S
{(s, ∅)}. Here, T
∃
(P) = T
∃
(P
′
) and since P
′
is disjoint and complete it
follows from Lemma 76 that T
∃
(P
′
) = T
∀
(P
′
). This gives us T
∃
(P) = T
∀
(P
′
).
By Lemma 90, the set S is polynomial time computable and therefore, P
′
is
too.
Further, note that the regular expressions in P
′
are copies of these in P.
Therefore, P
∀
(DetSLin)
poly
→ P
∃
(DetSLin) also holds. We ﬁnally give the
proof of Lemma 90.
Proof. For R a set of strongly linear regular expressions, let Suﬃx(R) =
¸
r∈R
Suﬃx(r). Deﬁne U as the set of strings aw, a ∈ Σ, w ∈ Σ
∗
, such
that w ∈ Suﬃx(R), and aw / ∈ Suﬃx(R). Deﬁne V as Suﬃx(R) \
¸
r∈R
L(r).
We claim that S =
¸
u∈U
{Σ
∗
u} ∪
¸
v∈V
{v} is the desired set of regular ex
pressions. For instance, for R = {Σ
∗
abc, Σ
∗
b, bc} we have U = {bbc, cbc, ac, cc,
a} and V = {c} which gives us S = {Σ
∗
bbc, Σ
∗
cbc, Σ
∗
ac, Σ
∗
cc, Σ
∗
a, c}.
It suﬃces to show that, given R: (1) S is ﬁnite and polynomial time
computable; (2) the expressions in S are pairwise disjoint; (3)
¸
r∈R
L(r) ∩
¸
s∈S
L(s) = ∅; and, (4)
¸
r∈R∪S
L(r) = Σ
∗
.
We ﬁrst show (1). Every r ∈ R is of the form w or Σ
∗
w, for some w. Then,
for r there are only w suﬃxes in L(r) which can match the deﬁnition of U
or V . When a string w
′
, with w
′
 > w, is a suﬃx in L(r) then r must be of
the form Σ
∗
w and thus for every a ∈ Σ, aw is also a suﬃx in L(r), and thus
7.4. Strongly Linear PatternBased Schemas 135
aw / ∈ U. Further, w
′
/ ∈ V . So, the number of strings in U and V is bounded
by the number of rules in R times the length of the strings w occurring in the
expressions in R, times the number of alphabet symbols, which is a polynomial.
Obviously, we can also compute these strings in polynomial time.
For (2), we must check that the generated expressions are all pairwise
disjoint. First, every expression generated by V deﬁnes only one string, so
two expressions generated by V always have an empty intersection. For an
expression Σ
∗
aw generated by U and an string w
′
in V , suppose that their
intersection is nonempty and thus w
′
∈ L(Σ
∗
aw). Then, aw must be a suf
ﬁx of w
′
and we know by deﬁnition of V that w
′
∈ Suﬃx(R). But then,
also aw ∈ Suﬃx(R) which contradicts the deﬁnition of U. Third, suppose
that two expressions Σ
∗
aw, Σ
∗
a
′
w
′
generated by U have a nonempty inter
section. Then, aw must be a suﬃx of a
′
w
′
(or the other way around, but
that is perfectly symmetrical), and since aw = a
′
w
′
, aw must be a suﬃx of
w
′
. But w
′
∈ Suﬃx(R) and thus aw ∈ Suﬃx(R) must also hold, which again
contradicts the deﬁnition of U.
For (3), The strings in V are explicitly deﬁned such that their intersection
with
¸
r∈R
L(r) is empty. For the expression generated by U, observe that
they only deﬁne words which have suﬃxes that can not be suﬃxes of any
word deﬁned by any expression in R. Therefore,
¸
r∈R
L(r) ∩
¸
s∈S
L(s) = ∅.
Finally, we show (4). Let w / ∈ L(r), for any r ∈ R. We show that there
exists an s ∈ S, such that w ∈ L(s). If w ∈ V , we are done. So assume
w / ∈ V . Let w = a
1
· · · a
k
. Now, we go from left to right through w and
search for the rightmost l ≤ k + 1 such that w
l
= a
l
· · · a
k
∈ Suﬃx(R), and
w
l−1
= a
l−1
· · · a
k
/ ∈ Suﬃx(R). When l = k + 1, w
l
= ε. Then, w is accepted
by the expression Σ
∗
a
l−1
· · · a
k
, which by deﬁnition must be generated by U.
It is only left to show that there indeed exists such an index l for w. Thereto,
note that if l = k + 1, then it is easy to see that w
l
= ε is a suﬃx of every
string accepted by every r ∈ R. Conversely, if l = 1 we show that w
l
= w can
not be a suﬃx of any string deﬁned by any r ∈ R. Suppose to the contrary
that w ∈ Suﬃx(r), for some r ∈ R. Let r be w
r
or Σ
∗
w
r
. If w is a suﬃx
of w
r
, then w is accepted by an expression generated by V , which case we
already ruled out. If w is not a suﬃx of w
r
, then r must be of the form Σ
∗
w
r
and w
r
must be a suﬃx of w. But then, w ∈ L(r), which also contradicts
our assumptions. So, we can only conclude that w
1
/ ∈ Suﬃx(R). So, given
that w
k+1
∈ Suﬃx(R), and w
1
/ ∈ Suﬃx(R), we are guaranteed to ﬁnd some l,
1 < l ≤ k +1, such that w
l
∈ Suﬃx(R), and w
l−1
/ ∈ Suﬃx(R). This concludes
the proof of Lemma 90.
(7) For P = {(r
1
, s
1
), . . . , (r
n
, s
n
)}, let S = {r
′
1
, · · · , r
′
m
} be the set of strongly
linear expressions for R = {r
1
, . . . , r
n
} satisfying the conditions of Lemma 90.
136 Succinctness of PatternBased Schema Languages
Then, deﬁne P
′
= {(r
1
, s
1
), . . . , (r
n
, s
n
), (r
′
1
, Σ
∗
), . . . , (r
′
m
, Σ
∗
)}. Here, T
∀
(P) =
T
∀
(P
′
) and since P
′
is disjoint and complete it follows from Lemma 76 that
T
∃
(P
′
) = T
∀
(P
′
). This gives us T
∀
(P) = T
∃
(P
′
). By Lemma 90, the set S is
polynomial time computable and therefore, P
′
is too.
Further, note that the regular expressions in P
′
are copies of these in P.
Therefore, P
∃
(DetSLin)
poly
→ P
∀
(DetSLin) also holds.
(23),(89) We show P
∃
(SLin)
poly
→ EDTD
st
. Since deterministic strongly
linear schemas are a subset of stronglylinear schemas, since singletype EDTDs
are a subset of EDTDs and since we can translate a stronglylinear schema
with universal semantics into an equivalent one with existential semantics in
polynomial time (Theorem 89(7)), all other results follow.
Given P, we construct an automatonbased schema D = (A, λ) such that
L(D) = T
∃
(P). By Lemma 78, we can then translate D into an equivalent
singletype EDTD in polynomial time. Let P = {(r
1
, s
1
), . . . , (r
n
, s
n
)}. We
deﬁne D such that when A is in state q after reading w, λ(q) = s
i
if and
only if w ∈ L(r
i
) and λ(q) = ∅ otherwise. The most obvious way to construct
A is by constructing DFAs for the vertical expressions and combining these
by a product construction. However, this would induce an exponential blow
up. Instead, we construct A in polynomial time in a manner similar to the
construction used in Proposition 5.2 in [62].
First, assume that every r
i
is of the form Σ
∗
w
i
. We later extend the
construction to also handle vertical expressions of the form w
i
. Deﬁne S = {w 
w ∈ Preﬁx(w
i
), 1 ≤ i ≤ n}. Then, A = (Q, q
0
, δ) is deﬁned as Q = S ∪ {q
0
},
and for each a ∈ Σ,
• δ(q
0
, a) = a if a ∈ S, and δ(q
0
, a) = q
0
otherwise; and
• for each w ∈ S, δ(w, a) = w
′
, where w
′
is the longest suﬃx of wa in S,
and δ(w, a) = q
0
if no string in S is a suﬃx of wa.
For the deﬁnition of λ, let λ(q
0
) = ∅, and for all w ∈ S, λ(w) = s
i
if
w ∈ L(r
i
) and λ(w) = ∅ if w / ∈ L(r
i
) for all i ≤ n. Note that since the vertical
expression are disjoint, λ is welldeﬁned.
We prove the correctness of our construction using the following lemma
which can easily be proved by induction on the length of u.
Lemma 91. For any string u = a
1
· · · a
k
,
1. if q
0
⇒
A,u
q
0
, then no suﬃx of u is in S; and
2. if q
0
⇒
A,u
w, for some w ∈ S, then w is the biggest element in S which
is a suﬃx of u.
7.4. Strongly Linear PatternBased Schemas 137
3. q
0
⇒
A,u
q, with λ(q) = ∅, if and only if u / ∈ L(r
i
), for any i ≤ n; and
4. q
0
⇒
A,u
w, w ∈ S, with λ(w) = s
i
, if and only if u ∈ L(r
i
).
To show that L(D) = T
∃
(P), it suﬃces to prove that for any tree t, a node
v ∈ P
∃
(t) if and only if chstr(v) ∈ L(λ(q)) for q ∈ Q such that q
0
⇒
A,ancstr(v)
q.
First, suppose v ∈ P
∃
(t). Then, for some i ≤ n, ancstr(v) ∈ L(r
i
) and
chstr(v) ∈ L(s
i
). By Lemma 91(4), and the deﬁnition of λ, q
0
⇒
ancstr(v)
q,
with λ(q) = s
i
. But then, chstr(v) ∈ L(λ(q)).
Conversely, suppose that for q such that q
0
⇒
A,ancstr(v)
q, chstr(v) ∈
L(λ(q)) holds. Then, by Lemma 91(4), there is some i such that ancstr(v) ∈
L(r
i
), and by the deﬁnition of λ, chstr(v) ∈ L(s
i
). It follows that v ∈ P
∃
(t).
We have now shown that the construction is correct when all expressions
are of the form Σ
∗
w. We sketch the extension to the full class of strongly linear
expressions. Assume w.l.o.g. that there exists some m such that for i ≤ m,
r
i
= Σ
∗
w
i
and for i > m, r
i
= w
i
. Deﬁne S = {w  w ∈ Preﬁx(w
i
)∧1 ≤ i ≤ m}
in the same manner as above, and S
′
= {w  w ∈ Preﬁx(w
i
) ∧ m < i ≤ n}.
Deﬁne A = (Q, q
′
0
, δ), with Q = {q
0
, q
′
0
} ∪ S ∪ S
′
. Note that the elements of
S and S
′
need not be disjoint. Therefore, we denote the states corresponding
to elements of S
′
by primes, for instance ab ∈ S
′
corresponds to the state
a
′
b
′
. Then, for any symbol a ∈ Σ, δ(q
′
0
, a) = a
′
if a ∈ S
′
; δ(q
′
0
, a) = a if
a / ∈ S
′
∧a ∈ S; and δ(q
′
0
, a) = q
0
otherwise. For a string w ∈ S
′
, δ(w
′
, a) = w
′
a
′
if wa ∈ S
′
, δ(w
′
, a) is the longest suﬃx of wa in S if it exists and wa / ∈ S
′
,
and δ(w
′
, a) = q
0
otherwise. The transition function for q
0
and the states
introduced by S remains the same. So, we have added a subautomaton to
A which starts by checking whether w = w
i
, for some i > m, much like a
suﬃxtree, and switches to the normal operation of the original automaton if
this is not possible anymore.
Finally, the deﬁnition of λ again remains the same for q
0
and the states
introduced by S. Further, λ(q
′
0
) = ∅, and λ(w
′
) = r
i
if w ∈ L(r
i
) for some i,
1 ≤ i ≤ n, and λ(w
′
) = ∅ otherwise. The previous lemma can be extended
for this extended construction and the correctness of the construction follows
thereof.
(4),(10) This follows immediately from Theorem 85(4) and (9). The upper
bound carries over since every strongly linear schema is also a linear schema.
For the lower bound, observe that the schema used in the proofs of Theo
rem 85(4) and (9) is strongly linear.
(5),(11) We give the proof for the existential semantics. By Theorem 89(7)
the result carries over immediately to the universal semantics.
The algorithm proceeds in a number of steps. First, construct an automaton
based schema D
1
such that L(D
1
) = T
∃
(P). By Theorem 89(3) this can be
138 Succinctness of PatternBased Schema Languages
done in polynomial time. Furthermore, the regular expressions in D
1
are
copies of the horizontal expressions in P and are therefore also deterministic.
Then, translate D
1
into a singletype EDTD D
2
= (Σ, Σ
′
, d
2
, a, µ), which by
Lemma 78 can again be done in ptime and also maintains the determinism of
the used regular expressions. Then, we trim D
2
which can be done in polyno
mial time by Lemma 80(1) and also preserves the determinism of the expres
sions in D
2
. Finally, we claim that L(D
2
) = T
∃
(P) is deﬁnable by a DTD if and
only if for every two types a
i
, a
j
∈ Σ
′
it holds that L(µ(d(a
i
))) = L(µ(d(a
j
))).
Since all regular expressions in D
2
are deterministic, this can be tested in
polynomial time. We ﬁnally prove the above claim:
First, suppose that for every pair of types a
i
, a
j
∈ Σ
′
it holds that µ(d
2
(a
i
))
= µ(d
2
(a
j
)). Then, consider the DTD D = (Σ, d, s), where d(a) = µ(d
2
(a
i
))
for some a
i
∈ Σ
′
. Since all regular expression µ(d
2
(a
i
)), with µ(a
i
) = a, are
equivalent, it does not matter which type we choose. Now, L(D) = L(D
2
)
which shows that L(D
2
) is deﬁnable by a DTD.
Conversely, suppose that there exist types a
i
, a
j
∈ Σ
′
such that µ(L(d(a
i
)))
= µ(L(d(a
j
))). We show that L(D
2
) is not closed under ancestorguarded
subtree exchange. From Lemma 79 it then follows that L(D
2
) is not deﬁnable
by a DTD. Since µ(L(d(a
i
))) = µ(L(d(a
j
))), there exists a string w such that
w ∈ µ(L(d(a
i
))) and w / ∈ µ(L(d(a
j
))) or w / ∈ µ(L(d(a
i
))) and w ∈ µ(L(d(a
j
))).
We consider the ﬁrst case, the second is identical. Let t
1
∈ L(d
2
) be a tree
with some node v with lab
t
1
(v) = a
i
and chstr
t
1
(v) = w
′
where µ(w
′
) =
w. Further, let t
2
∈ L(d
2
) be a tree with some node u with lab
t
2
(u) = a
j
.
Since D
2
is trimmed, t
1
and t
2
must exist by Lemma 80(2). Now, deﬁne
t
3
= µ(t
2
)[u ← µ(subtree
t
1
(v))] which is obtained from µ(t
1
) and µ(t
2
) by
labelguarded subtree exchange. Because D
2
is a singletype EDTD, it must
assign the type a
j
to node u in t
3
. However, chstr
t
3
(u) = w / ∈ µ(L(d(a
j
)))
and thus t
3
/ ∈ L(D
3
). This shows that D
2
is not closed under labelguarded
subtree exchange.
(6),(12) We ﬁrst show that P
∀
(DetSLin) → DTD and then P
∃
(SLin)
poly
→
DTD. Since deterministic stronglylinear schemas are a subset of strongly
linear schemas and since we can translate a stronglylinear schema with univer
sal semantics into an equivalent one with existential semantics in polynomial
time (Theorem 89(7)), all other results follow.
First, to show that P
∀
(DetSLin) → DTD, let Σ
P
= {a, b, c, d, e, f} and
P = {(a, b + c), (ab, d), (ac, d), (abd, ε), (acd, f), (acdf, ε)}. Here, a(b(d)) ∈
T
∀
(P) and a(c(d(f))) ∈ T
∀
(P) but a(b(d(f))) / ∈ T
∀
(P). Therefore, T
∀
(P)
is not closed under ancestorguarded subtree exchange and by Lemma 79 is
not deﬁnable by a DTD.
To show that P
∃
(SLin)
poly
→ DTD, note that the algorithm in the above
7.4. Strongly Linear PatternBased Schemas 139
proof also works when the horizontal regular expressions are not deterministic.
The total algorithm then becomes pspace, because we have to test equivalence
of regular expressions. However, the DTD D is still constructed in polynomial
time, which completes this proof.
8
Optimizing XML Schema
Languages
As mentioned before, XML is the lingua franca for data exchange on the
Internet [1]. Within applications or communities, XML data is usually not
arbitrary but adheres to some structure imposed by a schema. The presence
of such a schema not only provides users with a global view on the anatomy
of the data, but far more importantly, it enables automation and optimiza
tion of standard tasks like (i) searching, integration, and processing of XML
data (cf., e.g., [28, 69, 73, 109]); and, (ii) static analysis of transformations
(cf., e.g., [5, 56, 74, 87]). Decision problems like equivalence, inclusion and
nonemptiness of intersection of schemas, hereafter referred to as the basic de
cision problems, constitute essential building blocks in solutions for the just
mentioned optimization and static analysis problems. Additionally, the basic
decision problems are fundamental for schema minimization (cf., e.g., [25, 78]).
Because of their widespread applicability, it is therefore important to estab
lish the exact complexity of the basic decision problems for the various XML
schema languages.
The most common schema languages are DTD, XML Schema [100], and
Relax NG [22] and can be modeled by grammar formalisms [85]. In partic
ular, DTDs correspond to contextfree grammars with regular expressions at
righthand sides, while Relax NG is usually abstracted by extended DTDs
(EDTDs) [88] or equivalently, unranked tree automata [16], deﬁning the regu
lar unranked tree languages. XML Schema is usually abstracted by singletype
142 Optimizing XML Schema Languages
shop → regular
∗
& discountbox
∗
regular → cd
discountbox → cd
10,12
price
cd → artist & title & price
Figure 8.1: A sample schema using the numerical occurrence and interleave
operators. The schema deﬁnes a shop that sells CDs and oﬀers a special price
for boxes of 10–12 CDs.
EDTDs. As detailed in [75], the relationship between schema formalisms and
grammars provides direct upper and lower bounds for the complexity of the
basic decision problems.
A closer inspection of the various schema speciﬁcations reveals that the
above abstractions in terms of grammars with regular expressions is too coarse.
Indeed, in addition to the conventional regular expression operators like con
catenation, union, and Kleenestar, the XML Schema and the Relax NG spec
iﬁcation allow two other operators as well. The XML Schema speciﬁcation
allows to express counting or numerical occurrence constraints which deﬁne
the minimal and maximal number of times a regular construct can be repeated.
Relax NG allows unordered concatenation through the interleaving operator,
which is also present in XML Schema in a restricted form. Finally, both DTD
and XML Schema require expressions to be deterministic.
We illustrate these additional operators in Figure 8.1. Although the new
operators can be expressed by the conventional regular operators, they can
not do so succinctly (see Chapter 4), which has severe implications on the
complexity of the basic decision problems.
The goal of this chapter is to study the impact of these counting and inter
leaving operators on the complexity of the basic decision problems for DTDs,
XSDs, and Relax NG. As observed in Section 8.1, the complexity of inclu
sion and equivalence of RE(#, &) expressions (and subclasses thereof) carries
over to DTDs and singletype EDTDs. Therefore, the results in Chapter 5,
concerning (subclasses) of RE(#, &) and Section 6.7, concerning deterministic
expressions with counting, immediately give us complexity bounds for the ba
sic decision problems for a wide range of subclasses of DTDs and singletype
EDTDs. For EDTDs, this immediate correspondence does not hold anymore.
Therefore, we study the complexity of the basic decision problems for EDTDs
extended with counting and interleaving operators. Finally, we revisit the
simpliﬁcation problem introduced in [77] for schemas with RE(#, &) expres
sions. This problem is deﬁned as follows: given an extended DTD, can it be
rewritten into an equivalent DTD or a singletype EDTD?
Outline. In Sections 8.1 and 8.2, we establish the complexity of the basic
8.1. Complexity of DTDs and SingleType EDTDs 143
decision problems for DTDs and singletype EDTDs, and EDTDs, respectively.
We discuss simpliﬁcation in Section 8.3.
8.1 Complexity of DTDs and SingleType EDTDs
In this section, we establish the complexity of the basic decision problems
for DTDs and singletype EDTDs with extended regular expressions. This is
mainly done by showing that the results on extended regular expressions of
Section 5.2 often carry over literally to DTDs and singletype EDTDs.
We call a complexity class C closed under positive reductions if the following
holds for every O ∈ C. Let L
′
be accepted by a deterministic polynomialtime
Turing machine M with oracle O (denoted L
′
= L(M
O
)). Let M further have
the property that L(M
A
) ⊆ L(M
B
) whenever L(A) ⊆ L(B). Then L
′
is also
in C. For a more precise deﬁnition of this notion we refer the reader to [54].
For our purposes, it is suﬃcient that important complexity classes like ptime,
np, conp, pspace, and expspace have this property, and that every such class
contains ptime. The following two propositions have been shown to hold for
standard regular expressions in [75]. However, their proofs carry over literally
to all subclasses of RE(#, &).
Proposition 92 ([75]). Let R be a subclass of RE(#, &) and let C be a
complexity class closed under positive reductions. Then the following are
equivalent:
(a) inclusion for R expressions is in C.
(b) inclusion for DTD(R) is in C.
(c) inclusion for EDTD
st
(R) is in C.
The corresponding statement holds for equivalence.
Clearly, any lower bound on the complexity of equivalence and inclu
sion of a class of regular expressions R also carries over to DTD(R) and
EDTD
st
(R). Hence the results on the equivalence and inclusion problem
of Section 5.2 and Section 6.7 all carry over to DTDs and singletype EDTDs.
For instance, equivalence and inclusion for EDTD
st
(#), EDTD
st
(#,&),
and DTD(&) are expspacecomplete, while these problems are in pspace for
EDTD
st
(DRE
#
S
).
The previous proposition can be generalized to intersection of DTDs as
well.
144 Optimizing XML Schema Languages
Proposition 93 ([75]). Let R be a subclass of RE(#, &) and let C be a
complexity class which is closed under positive reductions. Then the following
are equivalent:
(a) intersection for R expressions is in C.
(b) intersection for DTD(R) is in C.
Hence, also the results on intersection of Section 5.2 carry over to DTDs,
and any lower bound on intersection for R expressions carries over to
DTD(R) and EDTD
st
(R). Hence, for example intersection for DTD(#,&)
and DTD(DRE
#
W
) are pspacecomplete. The only remaining problem is in
tersection for singletype EDTDs. Although the above proposition does not
hold for singletype EDTDs, we can still establish complexity bounds for them.
Indeed, intersection for EDTD
st
(RE) is exptimehard and in the next sec
tion we will see that even for EDTD(#,&) intersection remains in exptime.
It immediately follows that intersection for EDTD
st
(#), EDTD
st
(&), and
EDTD
st
(#,&) is also exptimecomplete.
8.2 Complexity of Extended DTDs
We next consider the complexity of the basic decision problems for EDTDs
with numerical occurrence constraints and interleaving. Here, we do not con
sider deterministic expressions as EDTDs are the abstraction of Relax NG,
which does not require expressions to be deterministic. As the basic decision
problems are exptimecomplete for EDTD(RE), the straightforward approach
of translating every RE(#, &) expression into an NFA and then applying the
standard algorithms gives rise to a double exponential time complexity. By
using the NFA(#, &) introduced in Section 5.1, we can do better: expspace
for inclusion and equivalence, and, more surprisingly, exptime for inter
section.
Theorem 94. (1) equivalence and inclusion for EDTD(#,&) are in ex
pspace;
(2) equivalence and inclusion for EDTD(#) and EDTD(&) are expspace
hard; and,
(3) intersection for EDTD(#,&) is exptimecomplete.
Proof. (1) We show that inclusion is in expspace. The upper bound for
equivalence then immediately follows.
First, we introduce some notation. For an EDTD D = (Σ, Σ
′
, d, s, µ),
we will denote elements of Σ
′
, i.e., types, by τ. We denote by (D, τ) the
8.2. Complexity of Extended DTDs 145
EDTD D with start symbol τ. We deﬁne the depth of a tree t, denoted by
depth(t), as follows: if t = ε, then depth(t) = 0; and if t = σ(t
1
· · · t
n
), then
depth(t) = max{depth(t
i
)  i ∈ {1, . . . , n}} + 1.
Suppose that we have two EDTDs D
1
= (Σ, Σ
′
1
, d
1
, s
1
, µ
1
) and D
2
=
(Σ, Σ
′
2
, d
2
, s
2
, µ
2
). We provide an expspace algorithm that decides whether
L(D
1
) ⊆ L(D
2
). As expspace is closed under complement, the theorem fol
lows. The algorithm computes a set E of pairs (C
1
, C
2
) ∈ 2
Σ
′
1
× 2
Σ
′
2
where
(C
1
, C
2
) ∈ E if and only if there exists a tree t such that C
j
= {τ ∈ Σ
′
j
 t ∈
L((D
j
, τ))} for each j = 1, 2. That is, every C
j
is the set of types that can be
assigned by D
j
to the root of t. Or when viewing D
j
as a tree automaton, C
j
is the set of states that can be assigned to the root in a run on t. Therefore,
we say that t is a witness for (C
1
, C
2
). Notice that t ∈ L(D
1
) (respectively,
t ∈ L(D
2
)) if s
1
∈ C
1
(respectively, s
2
∈ C
2
). Hence, L(D
1
) ⊆ L(D
2
) if and
only if there exists a pair (C
1
, C
2
) ∈ E with s
1
∈ C
1
and s
2
∈ C
2
.
We compute the set E in a bottomup manner as follows:
1. Initially, set E
1
:= {(C
1
, C
2
)  ∃a ∈ Σ, τ
1
∈ Σ
′
1
, τ
2
∈ Σ
′
2
such that
µ
1
(τ
1
) = µ
2
(τ
2
) = a, and for i = 1, 2, C
i
= {τ ∈ Σ
′
i
 ε ∈ d
i
(τ) ∧ µ
i
(τ) =
a}}.
2. For every k > 1, E
k
is the union of E
k−1
and the pairs (C
1
, C
2
) for which
there are a ∈ Σ, n ∈ N and a string (C
1,1
, C
2,1
) · · · (C
1,n
, C
2,n
) in E
∗
k−1
such that
C
j
= {τ ∈ Σ
′
j
 µ
j
(τ) = a, ∃b
j,1
∈ C
j,1
, . . . , b
j,n
∈ C
j,n
with b
j,1
· · · b
j,n
∈ d
j
(τ)}, for each j = 1, 2.
Let E := E
ℓ
for ℓ = 2
Σ
′
1

· 2
Σ
′
2

. The algorithm then accepts when there is a
pair (C
1
, C
2
) ∈ E with s
1
∈ C
1
and s
2
∈ C
2
and rejects otherwise.
We argue that the algorithm is correct. As E
k
⊆ E
k+1
, for every k, it
follows that E
ℓ
= E
ℓ+1
. Hence, the algorithm computes the largest set of
pairs. The following lemma then shows that the algorithm decides whether
L(D
1
) ⊆ L(D
2
). The lemma can be proved by induction on k.
Lemma 95. For every k ≥ 1, (C
1
, C
2
) ∈ E
k
if and only if there exists a
witness tree for (C
1
, C
2
) of depth at most k.
It remains to show that the algorithm can be carried out using exponen
tial space. Step (1) reduces to a linear number of tests ε ∈ L(r), for some
RE(#, &) expressions r which is in ptime by [64]. Step (3) can be carried out
in exponential time, since the size of E is exponential in the input. For step
(2), it suﬃces to argue that, when E
k−1
is known, it is decidable in expspace
146 Optimizing XML Schema Languages
whether a pair (C
1
, C
2
) is in E
k
. As there are only an exponential number
of such possible pairs, the result follows. To this end, we need to verify that
there exists a string W = (C
1,1
, C
2,1
) · · · (C
1,n
, C
2,n
) in E
∗
k−1
such that for each
j = 1, 2,
(A) for every τ ∈ C
j
, there exist b
j,1
∈ C
j,1
, . . . , b
j,n
∈ C
j,n
with b
j,1
· · · b
j,n
∈
d
j
(τ); and,
(B) for every τ ∈ Σ
′
j
\ C
j
, there do not exist b
j,1
∈ C
j,1
, . . . , b
j,n
∈ C
j,n
with
b
j,1
· · · b
j,n
∈ d
j
(τ).
Assume that Σ
′
1
∩ Σ
′
2
= ∅. Let, for each j = 1, 2 and τ ∈ Σ
′
j
, N(τ) be the
NFA(#, &) accepting d
j
(τ). Intuitively, we guess the string W one symbol at
a time and compute the set of reachable conﬁgurations Γ
τ
for each N(τ).
Initially, Γ
τ
is the singleton set containing the initial conﬁguration of N(τ).
Suppose that we have guessed a preﬁx (C
1,1
, C
2,1
) · · · (C
1,m−1
, C
2,m−1
) of W
and that we guess a new symbol (C
1,m
, C
2,m
). Then, we compute the set
Γ
′
τ
= {γ
′
 ∃b ∈ C
j,m
, γ ∈ Γ
τ
such that γ ⇒
N(τ),b
γ
′
} and set Γ
τ
to Γ
′
τ
. Each
set Γ
′
τ
can be computed in exponential space from Γ
τ
. We accept (C
1
, C
2
) when
for every τ ∈ Σ
′
j
, τ ∈ C
j
if and only if Γ
τ
contains an accepting conﬁguration.
(2) It is shown by Mayer and Stockmeyer [79] and Meyer and Stockmeyer [83]
that equivalence and inclusion are expspacehard for RE(&)s and RE(#),
respectively. Hence, equivalence and inclusion are also expspacehard for
EDTD(&) and EDTD(#).
(3) The lower bound follows from [99]. We argue that the problem is in
exptime. Thereto, let, for each i ∈ {1, . . . , n}, D
i
= (Σ, Σ
′
i
, d
i
, s
i
, µ
i
) be an
EDTD(#,&). We assume w.l.o.g. that the sets Σ
′
i
are pairwise disjoint. We
also assume that the start type s
i
never appears at the righthand side of a
rule. Finally, we assume that no derivation tree consists of only the root.
For each type τ ∈ Σ
′
i
, let N(τ) denote an NFA(#, &) for d
i
(τ). According
to Theorem 42, N(τ) can be computed from d
i
(τ) in polynomial time. We
provide an alternating polynomial space algorithm that guesses a tree t and
accepts if t ∈ L(D
1
) ∩ · · · ∩ L(D
n
). As apspace = exptime [19], this proves
the theorem.
We guess t node by node in a topdown manner. For every guessed node
v, the following information is written on the tape of the Turing machine: for
every i ∈ {1, . . . , n}, the triple c
i
= (τ
i
v
, τ
i
p
, γ
i
) where τ
i
v
is the type assigned
to v by grammar D
i
, τ
i
p
is the type of the parent assigned by D
i
, and γ
i
is the
current conﬁguration N(τ
i
p
) is in after reading the string formed by the left
siblings of v. In the following, we say that τ ∈ Σ
′
i
is an atype when µ
i
(τ) = a.
The algorithm proceeds as follows:
8.3. Simpliﬁcation 147
1. As for each grammar the types of the roots are given, we start by guessing
the ﬁrst child of the root. That is, we guess an a ∈ Σ, and for each
i ∈ {1, . . . , n}, we guess an atype τ
i
and write the triple c
i
= (τ
i
, s
i
, γ
i
s
)
on the tape where γ
i
s
is the start conﬁguration of N(s
i
).
2. For i ∈ {1, . . . , n}, let c
i
= (τ
i
, τ
i
p
, γ
i
) be the triples on the tape. The
algorithm now universally splits into two parallel branches as follows:
(a) Downward extension: When for every i, ε ∈ d
i
(τ
i
) then the
current node can be a leaf node and the branch accepts. Otherwise,
guess an a ∈ Σ and for each i, guess an atype θ
i
. Replace every
c
i
by the triple (θ
i
, τ
i
, γ
i
s
) and proceed to step (2). Here, γ
i
s
is the
start conﬁguration of N(τ
i
).
(b) Extension to the right: For every i ∈ {1, . . . , n}, compute a
conﬁguration γ
′i
for which γ
i
⇒
N(τ
i
p
),τ
i γ
′i
. When every γ
′i
is a
ﬁnal conﬁguration, then we do not need to extend to the right
anymore and the algorithm accepts. Otherwise, guess an a ∈ Σ
and for each i, guess an atype θ
i
. Replace every c
i
by the triple
(θ
i
, τ
i
, γ
′i
) and proceed to step (2).
We argue that the algorithm is correct. If the algorithm accepts, we have
guessed a tree t and, for every i ∈ {1, . . . , n}, a tree t
′
i
with µ
i
(t
′
i
) = t and
t
′
i
∈ L(d
i
). Therefore, t ∈
¸
n
i=1
L(D
i
). For the other direction, suppose
that there exists a tree t ∈
¸
n
i=1
L(D
i
) and t is minimal in the sense that no
subtree t
0
of t is in
¸
n
i=1
L(D
i
). Then, there is a run of the above algorithm
that guesses t and guesses trees t
′
i
with µ
i
(t
′
i
) = t. The tree t must be minimal
since the algorithm stops extending the tree as soon as possible.
The algorithm obviously uses only polynomial space.
8.3 Simpliﬁcation
In this section, we study the simpliﬁcation problem a bit more broadly than
before. Given an EDTD, we are interested in knowing whether it has an
equivalent EDTD of a restricted type, i.e., an equivalent DTD or singletype
EDTD. In [77], this problem was shown to be exptimecomplete for EDTDs
with standard regular expressions. We revisit this problem in the context of
RE(#, &).
We need a bit of terminology. We say that a tree language L is closed under
ancestorguarded subtree exchange if the following holds. Whenever for two
trees t
1
, t
2
∈ L with nodes u
1
∈ Dom(t
1
) and u
2
∈ Dom(t
2
), ancstr
t
1
(u
1
) =
ancstr
t
2
(u
2
) implies t
1
[u
1
← subtree
t
2
(u
2
)] ∈ L.
We recall the following theorem from [77]:
148 Optimizing XML Schema Languages
Theorem 96 (Theorem 7.1 in [77]). Let L be a tree language deﬁned by an
EDTD. Then the following conditions are equivalent.
(a) L is deﬁnable by a singletype EDTD.
(b) L is closed under ancestorguarded subtree exchange.
We are now ready for the following theorem.
Theorem 97. Given an EDTD(#,&), deciding whether it is equivalent to an
EDTD
st
(#,&) or DTD(#,&) is expspacecomplete.
Proof. We ﬁrst show that the problem is hard for expspace. We use a reduc
tion from equivalence of RE(#), which is expspacecomplete [83].
Let r
1
, r
2
be RE(#) expressions over Σ and let b and s be two symbols
not occurring in Σ. By deﬁnition L(r
j
) = ∅, for j = 1, 2. Deﬁne D =
(Σ ∪ {b, s}, Σ ∪ {s, b
1
, b
2
}, d, s, µ) as the EDTD with the following rules:
s → b
1
b
2
b
1
→ r
1
b
2
→ r
2
,
where for every τ ∈ Σ ∪ {s}, µ(τ) = τ, and µ(b
1
) = µ(b
2
) = b. We claim that
D is equivalent to a singletype DTD or a DTD if and only if L(r
1
) = L(r
2
).
Clearly, if r
1
is equivalent to r
2
, then D is equivalent to the DTD (and therefore
also to a singletype EDTD)
s → bb
b → r
1
.
Conversely, suppose that there exists an EDTD
st
which deﬁnes the language
L(D). Towards a contradiction, assume that r
1
is not equivalent to r
2
. So,
there exists a string w
1
such that w
1
∈ L(r
1
) and w
1
/ ∈ L(r
2
), or w
1
/ ∈ L(r
1
)
and w
1
∈ L(r
2
). We only consider the ﬁrst case, the second is identical. Now,
let w
2
be a string in L(r
2
) and consider the tree t = s(b(w
1
)b(w
2
)). Clearly, t
is in L(D). However, the tree t
′
= s(b(w
2
)b(w
1
)) obtained from t by switching
its left and right subtree is not in L(D). According to Theorem 96, every tree
language deﬁned by a singletype EDTD is closed under such an exchange of
subtrees. So, this means that L(D) cannot be deﬁned by an EDTD
st
, which
leads to the desired contradiction.
We now proceed with the upper bounds. The following algorithms are
along the same lines as the EXPTIME algorithms in [77] for the simpliﬁcation
problem without numerical occurrence or interleaving operators. We ﬁrst give
an expspace algorithm which decides whether an EDTD is equivalent to an
8.3. Simpliﬁcation 149
EDTD
st
. Let D = (Σ, Σ
′
, d, s, µ) be an EDTD. Intuitively, we compute an
EDTD
st
D
0
= (Σ, Σ
′
0
, d
0
, s, µ
0
) which is the closure of D under the singletype
property. The EDTD
st
D
0
has the following properties:
(a) Σ
′
0
is in general exponentially larger than Σ
′
;
(b) the RE(#, &) expressions in the deﬁnition of d
0
are only polynomially
larger than the RE(#, &) expressions in the deﬁnition of d;
(c) L(D) ⊆ L(D
0
); and,
(d) L(D
0
) = L(D) ⇔ D is equivalent to a EDTD
st
.
Hence, D is equivalent to an EDTD
st
if and only if L(D
0
) ⊆ L(D).
We ﬁrst show how D
0
can be constructed. We can assume w.l.o.g. that,
for each type a
i
∈ Σ
′
, there exists a tree t
′
∈ L(d) such that a
i
is a label in t
′
.
Indeed, every useless type can be removed from D in a simple preprocessing
step. Then, for a string w ∈ Σ
∗
and a ∈ Σ let types(wa) be the set of all types
a
i
∈ Σ
′
, for which there is a tree t and a tree t
′
∈ L(d) with µ(t
′
) = t, and a
node v in t such that ancstr
t
(v) = wa and the type of v in t
′
is a
i
. We show
how to compute types(wa) in exponential time. To this end, we enumerate
all sets types(w). Let s = c
1
. Initially, set W := {c}, Types(c) := {c
1
} and
R := {{c
1
}}. Repeat the following until W becomes empty:
(1) Remove a string wa from W.
(2) For every b ∈ Σ, let Types(wab) contain all b
i
for which there exists an
a
j
in Types(wa) and a string in d(a
j
) containing b
i
. If Types(wab) is not
empty and not already in R, then add it to R and add wab to W.
Since we add every set only once to R, the algorithm runs in time exponential
in the size of D. Moreover, we have that Types(w) = types(w) for every w,
and that R = Σ
′
0
.
For each a ∈ Σ, let types(D, a) be the set of all nonempty sets types(wa),
with w ∈ Σ
∗
. Clearly, each types(D, a) is ﬁnite. We next deﬁne D
0
=
(Σ, Σ
′
0
, d
0
, s, µ
0
). Its set of types is Σ
′
0
:=
¸
a∈Σ
types(D, a). Note that s ∈ Σ
′
0
.
For every τ ∈ types(D, a), set µ
0
(τ) = a. In d
0
, the righthand side of the rule
for each types(wa) is the disjunction of all d(a
i
) for a
i
∈ types(wa), with each
b
j
in d(a
i
) replaced by types(wab).
We show that properties (a)–(d) hold. Since Σ
′
0
⊆ 2
Σ
′
, we immediately
have that (a) holds. The RE(#, &) expressions that we constructed in D
0
are
unions of a linear number of RE(#, &) expressions in D, but have types in
2
Σ
′
rather than in Σ
′
. Hence, the size of the RE(#, &) expressions in D
0
is at
150 Optimizing XML Schema Languages
most quadratic in the size of D. Finally, we note that it has been shown in
Theorem 7.1 in [77] that (c) and (d) also hold.
It remains to argue that it can be decided in expspace that L(D
0
) ⊆
L(D). A direct application of the expspace algorithm in Theorem 94(1)
leads to a 2expspace algorithm to test whether L(D
0
) ⊆ L(D), due to the
computation of C
1
. Indeed, the algorithm remembers, given the EDTDs D
0
=
(Σ, Σ
′
0
, d
0
, s
0
, µ
0
) and D = (Σ, Σ
′
, d, s, µ), all possible pairs (C
1
, C
2
) such that
there exists a tree t with C
1
= {τ ∈ Σ
′
0
 t ∈ L((D
0
, τ))} and C
2
= {τ ∈
Σ
′
 t ∈ L((D, τ))}. It then accepts if there exists such a pair (C
1
, C
2
) with
s
0
∈ C
1
and s ∈ C
2
. However, when we use nondeterminism, notice that it is
not necessary to compute the entire set C
1
. Indeed, as we only test whether
there exist elements in C
1
in the entire course of the algorithm, we can adapt
the algorithm to compute pairs (c
1
, C
2
), where c
1
is an element of C
1
, rather
than the entire set. Since nexpspace = expspace, we can use this adaption
to test whether L(D
0
) ⊆ L(D) in expspace.
Finally, we give the algorithm which decides whether an EDTD D = (Σ, Σ
′
,
d, s, µ) is equivalent to a DTD. We compute a DTD (Σ, d
0
, s
d
) which is equiv
alent to D if and only if L(D) is deﬁnable by a DTD. Thereto, let for each
a
i
∈ Σ
′
, r
a,i
be the expression obtained from d(a
i
) by replacing each symbol
b
j
in d(a
i
) by b. For every a ∈ Σ, deﬁne d
0
(a) =
¸
a
i
∈Σ
′ r
a,i
. Again, it is
shown in [77] that L(D) = L(d
0
) if and only if L(D) is deﬁnable by a DTD.
By Theorem 94(1) and since d
0
is of size polynomial in the size of D, this can
be tested in expspace.
9
Handling NonDeterministic
Regular Expressions
At several places already in this thesis we have discussed the requirement in
DTD and XML Schema of regular expressions to be deterministic. Unfortu
nately, the Unique Particle Attribution (UPA) constraint, as determinism is
called in XML Schema, is a highly nontransparent one. The sole motivation
for this restriction is backward compatibility with SGML, a predecessor of
XML, where it was introduced for the reason of fast unambiguous parsing of
content models (without lookahead) [103]. Sadly this notion of unambiguous
parsing is a semantic rather than a syntactic one, making it diﬃcult for de
signers to interpret. Speciﬁcally, the XML Schema speciﬁcation mentions the
following deﬁnition of UPA:
A content model must be formed such that during validation of an
element information item sequence, the particle component con
tained directly, indirectly or implicitly therein with which to at
tempt to validate each item in the sequence in turn can be uniquely
determined without examining the content or attributes of that
item, and without any information about the items in the remain
der of the sequence. [italics mine]
In most books (c.f. [103]), the UPA constraint is usually explained in terms
of a simple example rather than by means of a clear syntactical deﬁnition.
The latter is not surprising as to date there is no known easy syntax for de
terministic regular expressions. That is, there are no simple rules a user can
152 Handling NonDeterministic Regular Expressions
apply to deﬁne only (and all) deterministic regular expressions. So, when after
the schema design process, one or several content models are rejected by the
schema checker on account of being nondeterministic, it is very diﬃcult for
nonexpert
1
users to grasp the source of the error and almost impossible to
rewrite the content model into an admissible one. The purpose of the present
chapter is to investigate methods for transforming nondeterministic expres
sions into concise and readable deterministic ones deﬁning either the same
language or constituting good approximations. We propose the algorithm
supac (Supportive UPA Checker) which can be incorporated in a responsive
XSD tester which in addition to rejecting XSDs violating UPA also suggests
plausible alternatives. Consequently, the task of designing an XSD is relieved
from the burden of the UPA restriction and the user can focus on designing
an accurate schema. In addition, our algorithm can serve as a plugin for any
model management tool which supports export to XML Schema format [6].
Deterministic regular expressions were investigated in a seminal paper by
Br¨ uggemannKlein and Wood [17]. They show that deciding whether a given
regular expression is deterministic can be done in quadratic time. In addition,
they provide an algorithm, that we call bkw
dec
, to decide whether a regular
language can be represented by a deterministic regular expression. bkw
dec
runs in time quadratic in the size of the minimal deterministic ﬁnite automaton
and therefore in time exponential in the size of the regular expression. We
prove in this chapter that the problem is hard for pspace thereby eliminating
much of the hope for a theoretically tractable algorithm. We tested bkw
dec
on a large and diverse set of regular expressions and observed that it runs
very fast (under 200ms for expressions with 50 alphabet symbols). It turns
out that, for many expressions, the corresponding minimal DFA is quite small
and far from the theoretical worstcase exponential size increase. In addition,
we observe that bkw
dec
is ﬁxedparameter tractable in the maximal number
of occurrences of the same alphabet symbol. As this number is very small
for the far majority of realworld regular expressions [9], applying bkw
dec
in
practice should never be a problem.
Deciding existence of an equivalent deterministic regular expression or ef
fectively constructing one, are entirely diﬀerent matters. Indeed, while the
decision problem is in exptime, Br¨ uggemannKlein and Wood [17] provide
an algorithm, which we will call bkw, which constructs deterministic regular
expressions whose size can be double exponential in the size of the original
expression. In this chapter, we measure the size of an expression as the total
number of occurrences of alphabet symbols. The ﬁrst exponential size increase
stems from creating the minimal deterministic automaton A
r
equivalent to the
1
In formal language theory.
153
given nondeterministic regular expression r. The second one stems from trans
lating the automaton into an expression. Although it is unclear whether this
double exponential size increase can be avoided, examples are known for which
a single exponential blowup is necessary [17]. We deﬁne an optimized version
of bkw, called bkwopt, which optimizes the second step in the algorithm
and can produce exponentially smaller expressions than bkw. Unfortunately,
the obtained expressions can still be very large. For instance, as detailed in
the experiments section, for input expressions of size 15, bkw and bkwopt
generate equivalent deterministic expressions of average size 1577 and 394,
respectively. To overcome this, we propose the algorithm grow. The idea
underlying this algorithm is that small deterministic regular expressions cor
respond to small Glushkov automata [17]: indeed, every deterministic regular
expression r can be translated in a Glushkov automaton with as many states
as there are alphabet symbols in r. Therefore, when the minimal automaton
A
r
is not Glushkov, grow tries to extend A
r
such that it becomes Glushkov.
To translate the Glushkov automaton into an equivalent regular expression,
we use the existing algorithm rewrite [9]. Our experiments show that when
grow succeeds in ﬁnding a small equivalent deterministic expression its size
is always roughly that of the input expression. In this respect, it is greatly
superior to bkw and bkwopt. Nevertheless, its success rate is inversely
proportional to the size of the input expression (we refer to Section 9.4.2 for
details).
Next, we focus on the case when no equivalent deterministic regular expres
sion can be constructed for a given nondeterministic regular expression and
an adequate superapproximation is needed. We start with a fairly negative
result: we show that there is no smallest superapproximation of a regular
expression r within the class of deterministic regular expressions. That is,
whenever L(r) L(s), and s is deterministic, then there is a deterministic
expression s
′
with L(r) L(s
′
) L(s). We therefore measure the proximity
between r and s relative to the strings up to a ﬁxed length. Using this measure
we can compare the quality of diﬀerent approximations. We consider three al
gorithms. The ﬁrst one is an algorithm of Ahonen [3] which essentially repairs
bkw by adding edges to the minimal DFA whenever it gets stuck. The sec
ond algorithm operates like the ﬁrst one but utilizes grow rather than bkw
to generate the corresponding deterministic regular expression. The third
algorithm, called shrink, merges states, thereby generalizing the language,
until a regular language is obtained with a corresponding concise determinis
tic regular expression. For the latter, we again make use of grow, and of the
algorithm koatokore of [8] which transforms automata to concise regular
expressions. In our experimental study, we show in which situation which of
the algorithms works best.
154 Handling NonDeterministic Regular Expressions
Finally, based on the experimental assessment, we propose the algorithm
supac (supportive UPA checker) for handling nondeterministic regular ex
pressions. supac makes use of several of the aforementioned algorithms and
can be incorporated in a responsive XSD checker to automatically deal with
the UPA constraint.
Related Work. Although XML is accepted as the de facto standard for data
exchange on the Internet and XML Schema is widely used, fairly little atten
tion has been devoted to the study of deterministic regular expressions. We
already mentioned the seminal paper by BruggemannKlein and Wood [17].
Computational and structural properties were addressed by Martens, Neven,
and Schwentick [75]. They show that testing nonemptiness of an arbitrary
number of intersections of deterministic regular expressions is pspacecomplete.
Bex et al. investigated algorithms for the inference of regular expressions from
a sample of strings in the context of DTDs and XML Schemas [8, 9, 11, 12].
From this investigation resulted two algorithms: rewrite [9] which trans
forms automata with n states to equivalent expressions with n alphabet sym
bols and fails when no such expression exists; and the algorithm koatokore
[8, 9] which operates as rewrite with the diﬀerence that it always returns a
concise expression at the expense of generalizing the language when no equiv
alent concise expression exists.
Deciding determinism of expressions containing numerical occurrences was
studied by Kilpel¨ ainen and Tuhkanen [66]. The complexity of syntactic sub
classes of the deterministic regular expressions with counting has also been
considered [43, 44]. In the context of streaming the notion of determinism and
kdeterminism was used in [68] and [20].
The presented work would clearly beneﬁt from algorithms for regular ex
pression minimization. To the best of our knowledge, no such (eﬃcient) al
gorithms exist for deterministic regular expressions, for which minimization is
in np. Only a sound and complete rewriting system is available for general
regular expressions [95], for which minimization is pspacecomplete.
Outline. The outline of the chapter is as follows. In Section 9.1, we discuss
the complexity of deciding determinism of the underlying regular language.
In Section 9.2 and 9.3, we discuss the construction of equivalent and super
approximations of regular expressions, respectively. We present an experimen
tal validation of our algorithms in Section 9.4. We outline a supportive UPA
checker in Section 9.5.
9.1. Deciding Determinism 155
9.1 Deciding Determinism
The ﬁrst step in creating a responsive UPA checker is testing whether L(r) is
deterministic. Br¨ uggemannKlein and Wood obtained an exptime algorithm
(in the size of the regular expression) which we will refer to as bkw
dec
:
Theorem 98 ([17]). Given a regular expression r, the algorithm bkw
dec
de
cides in time quadratic in the size of the minimal DFA corresponding to r
whether L(r) is deterministic.
We show that, unless pspace = ptime, there is no hope for a theoretically
tractable algorithm.
Theorem 99. Given a regular expression r, the problem of deciding whether
L(r) is deterministic is pspacehard.
Proof. We reduce from the corridor tiling problem, introduced in Sec
tion 5.1. However, we restrict ourselves to those tiling instances for which
there exists at most one correct corridor tiling. Notice that we can assume
this without loss of generality: From the master reduction from Turing Ma
chine acceptance to corridor tiling in [104], it follows that the number
of correct tilings of the constructed tiling system is precisely the number of
accepting runs of the Turing Machine on its input word. As the acceptance
problem for polynomial space bounded Turing Machines is already pspace
complete for deterministic machines, we can assume w.l.o.g. that the input
instance of corridor tiling has at most one correct corridor tiling.
Now, let T be a tiling instance for which there exists at most one correct
tiling. We construct a regular expression r, such that L(r) is deterministic
if and only if there does not exist a corridor tiling for T. Before giving the
actual deﬁnition of r, we give the language it will deﬁne and show this is
indeed deterministic if and only if corridor tiling for T is false. We encode
corridor tilings by a string in which the diﬀerent rows are separated by the
symbol $, that is, by strings of the form
$R
1
$R
2
$ · · · $R
m
$
in which each R
i
represents a row and is therefore in X
n
. Moreover, R
1
is the
bottom row and R
n
is the top row.
Then, let Σ = X ⊎ {a, $, #} and for a symbol b ∈ Σ, let Σ
b
denote Σ \
{b}. Then, L(r) = Σ
∗
\ {w
1
#w
2
 w
1
encodes a valid tiling for T and w
2
∈
Σ
∗
#
Σ
a,#
Σ
#
}. First, if there does not exist a valid tiling for T, then L(r) = Σ
∗
and thus L(r) is deterministic. Conversely, if there does exist a valid corridor
tiling for T, then by our assumption, there exists exactly one. A DFA for L(r)
156 Handling NonDeterministic Regular Expressions
is graphically illustrated in Figure 9.1. Notice that this DFA is the minimal
DFA if and only if w
1
exists. By applying the algorithm of Br¨ uggemann
Klein and Wood (Algorithm 4), it is easily seen that L(r) is not deterministic.
Indeed, Algorithm 4 gets immediately stuck in line 17, where it sees that the
gates in the orbit consisting of the three rightmost states in Figure 9.1 are not
all ﬁnal states. Hence, this minimal DFA does not satisfy the orbit property.
Our regular expression r now consists of the disjunction of the following
regular expressions:
2
• Σ
∗
#Σ
∗
#Σ
∗
+ Σ
∗
#
: This expression detects strings that do not have ex
actly one occurrence of #.
• Σ
∗
#Σ?: This expression detects strings that have # as last or second to
last symbol.
• Σ
∗
#
Σ
∗
aΣ: This expression detects strings that have a as second to last
symbol.
• Σ
∗
aΣ
∗
#Σ
∗
: This expression detects strings that have an a before the
#sign.
• Σ
$
Σ
∗
+ Σ
∗
Σ
$
#Σ
∗
: This expression detects strings that do not have a
$sign as the ﬁrst or last element of their encoding.
• Σ
∗
$Σ
[0,n−1]
$
$Σ
∗
#Σ
∗
+Σ
∗
$Σ
[n+1,n+1]
$
Σ
∗
$
$Σ
∗
#Σ
∗
: This expression detects
all string in which a row in the tiling encoding is too short or too long.
• Σ
∗
x
1
x
2
Σ
∗
#Σ
∗
, for every x
1
, x
2
∈ X, (x
1
, x
2
) / ∈ H: These expressions
detect all violations of horizontal constraints in the tiling encoding.
• Σ
∗
x
1
Σ
n
x
2
Σ
∗
#Σ
∗
, for every x
1
, x
2
∈ X, (x
1
, x
2
) / ∈ V : These expressions
detect all violations of vertical constraints in the tiling encoding.
• Σ
i+1
Σ
b
i
Σ
∗
#Σ
∗
for every 1 ≤ i < n: These expressions detect all tilings
which do not have b as the bottom row in the tiling encoding.
• Σ
∗
Σ
t
i
Σ
n−i
#Σ
∗
for every 1 ≤ i < n: These expressions detect all tilings
which do not have t as the top row in the tiling encoding.
Finally, it is easily veriﬁed that L(r) is deﬁned correctly.
It is unclear whether the problem itself is in pspace. Simply guessing
a deterministic regular expression s and testing equivalence with r does not
work as the size of s can be exponential in the size of r (see also Theorem 102).
Next, we address the problem from the viewpoint of parameterized com
plexity [34], where an additional parameter k is extracted from the input r.
Then, we say that a problem is ﬁxed parameter tractable if there exists a com
putable function f and a polynomial g such that the problem can be solved in
2
Notice that r itself does not have to be deterministic.
9.2. Constructing Deterministic Expressions 157
start
w
1
Σ
$
#
Σ
#
Σ
a,#
a
#
Σ
#
#
Σ
a,#
a
#
Σ
Figure 9.1: DFA for L(r) in the proof of Theorem 99.
time f(k) · g(r). Intuitively, this implies that, if k is small and f is reason
able, the problem is eﬃciently solvable. We now say that an expression r is a
koccurrence regular expression (kORE) if every alphabet symbol occurs at
most k times in r. For example, ab(a
∗
+c) is a 2ORE because a occurs twice.
Proposition 100. Let r be a kORE. The problem of deciding whether L(r)
is deterministic is ﬁxed parameter tractable with parameter k. Speciﬁcally, its
complexity is O(2
k
2
r
2
).
Proof. Let r be a kORE. By applying a Glushkov construction (see, e.g.,
[17]), followed by a subset construction, it is easy to see that the resulting
DFA has at most 2
k
· Σ states. By Theorem 98, the result follows.
This result is not only of theoretical interest. It has already often been
observed that the vast majority of regular expressions occurring in practice
are kOREs, for k = 1, 2, 3 (see, e.g., [77]). Hence, this result implies that in
practice the problem can be solved in polynomial time.
Corollary 101. For any ﬁxed k, the problem of deciding whether the language
deﬁned by a kORE is deterministic is in ptime.
9.2 Constructing Deterministic Expressions
Next, we focus on constructing equivalent deterministic regular expressions.
Unfortunately, the following result by Br¨ uggemannKlein and Wood already
rules out a truly eﬃcient conversion algorithm:
Theorem 102 ([17]). For any n ∈ N, there exists a regular expression r
n
of
size O(n) such that any deterministic regular expression deﬁning L(r
n
) is of
size at least 2
n
.
158 Handling NonDeterministic Regular Expressions
Algorithm 2 Algorithm grow, with pool size P and expansion size E.
Require: P, E ∈ N, minimal DFA A = (Q, Σ, δ, q
0
, F) with L(A) determin
istic,
Ensure: Det. reg. exp. s with L(s) = L(A), if successful
for i = 0 to E do
2: Generate at most P nonisomorphic DFAs B s.t.
L(B) = L(A) and B has Q +i states
4: for each such B do
if rewrite (B) succeeds then
6: return rewrite (B)
if i > E then fail
9.2.1 Growing automata
We ﬁrst present grow as Algorithm 2, which is designed to produce con
cise deterministic expressions. The idea underlying this algorithm is that the
Glushkov construction [17] transforms small deterministic regular expressions
to small deterministic automata with as many states as there are alphabet
symbols in the expression. The minimization algorithm eliminates some of
these states, complicating the inverse Glushkovrewriting from DFA to deter
ministic regular expression. By expanding the minimal automaton, growtries
to recuperate the eliminated states. The algorithm rewrite of [9] succeeds
when the modiﬁed automaton can be obtained from a deterministic regular
expression by the Glushkov construction and assembles this expression upon
success. As their are many DFAs equivalent to the given automaton A, we
only enumerate nonisomorphic expansions of A up to a given number of ex
tra states E. However, the number of generated nonisomorphic DFAs can
explode quickly. Therefore, the algorithm is also given a given pool size P
which restricts the number of DFAs of each size which are generated.
Nonwithstanding the harsh brute force ﬂavor of grow, we show in our
experimental study that the algorithm can be quite eﬀective.
9.2.2 Enumerating Automata
The grow algorithm described above uses an enumeration algorithm as a
subroutine. In this section, we describe this algorithm which, given a min
imal DFA M and a size k, eﬃciently enumerates all nonisomorphic DFAs
equivalent to M of size at most k. Thereto, we ﬁrst provide some deﬁnitions.
Throughout this section, we assume that any DFA is trimmed, i.e., contains
no useless transitions. For a DFA A, we always denote its set of states by Q
A
,
transitions by δ
A
, initial state by q
A
, and set of ﬁnal states by F
A
. Finally, we
9.2. Constructing Deterministic Expressions 159
assume that the set of states Q of any DFA consists of an initial segment of
the integers. That is, if Q = n, then Q = {0, . . . , n −1}.
Let A and B be DFAs. Then, A and B are isomorphic, denoted A
∼
= B,
if there exists a bijection f : Q
A
→ Q
B
, such that (1) f(q
A
) = q
B
, (2) for all
q ∈ Q
A
, we have q ∈ F
A
if and only if f(q) ∈ F
B
, and (3) for all q, p ∈ Q
A
and a ∈ Σ, it holds that (q, a, p) ∈ Q
A
if and only if (f(q), a, f(p)) ∈ Q
B
. We
then also say that f witnesses A
∼
= B. Notice that, because we are working
with DFAs, there is at most one such bijection which, moreover, can easily be
constructed. Indeed, we must set f(q
A
) = q
B
. Then, because q
A
has at most
one outgoing transition for each alphabet symbol this, in turn, deﬁnes f(q)
for all q to which q
A
has an outgoing transition. By continuing recursively in
this manner, one obtains the unique bijection witnessing A
∼
= B
Further, we deﬁne a function bf : Q
A
→ [1, Q
A
] which assigns to every
state q ∈ Q
A
its number in a breadthﬁrst traversal of A, starting at q
A
. Here,
we ﬁx an order on the alphabet symbols (say, lexicographic ordering), and
require that in the breadthﬁrst traversal of the DFA transitions are followed
in the order of the symbol by which they are labeled. This ensures that
bf is unambiguously deﬁned. We then say that A is canonical if bf(i) = i,
for all i ∈ Q
A
(recall that the state labels are always an initial part of the
natural numbers). That is, A is canonical if and only if the names of its
states correspond to their number in the breadthﬁrst traversal of A. The
latter notion of canonical DFAs essentially comes from [4], although there it
is presented in terms of a string representation. The following simple lemma
is implicit in their paper.
Lemma 103. 1. Let A be a DFA. Then, there exists a canonical DFA B
such that A
∼
= B.
2. Let A, B be canonical DFAs. Then, A = B if and only if A
∼
= B.
Now, let M be a minimal DFA, and k an integer and deﬁne S = {A 
L(A) = L(M), Q
A
 ≤ k, A is canonical}. Then, by the above lemma, S con
tains exactly one isomorphic copy of any DFA A equivalent to M of size at
most k. Hence, given M and k, the goal of this section reduces to give an
algorithm which computes the set S.
To do so eﬃciently, we ﬁrst need some insight in the connection between
a minimal DFA and its larger equivalent versions. If A is a DFA, and q ∈ Q
A
,
recall that A
q
denotes the DFA A in which q is the initial state. The following
lemma then states a few basic facts about minimal DFAs (see, e.g., [55]).
Lemma 104. Let A be a DFA, and M its equivalent minimal DFA.
• For all q, p ∈ Q
M
, if q = p, then L(M
q
) = L(M
p
).
160 Handling NonDeterministic Regular Expressions
• For all q ∈ Q
A
, there exists a (unique) p ∈ Q
M
, such that L(A
q
) =
L(M
p
).
Let minimal : Q
A
→ Q
M
be the function which takes the state q ∈ Q
A
to
p ∈ Q
M
, when L(A
q
) = L(M
p
). According to the above lemma, this function
is welldeﬁned. We also say that q is a copy of p. Hence, for each state p ∈ Q
M
,
there is at least one copy in A, although there can be more. Further, for any
p
′
∈ Q
M
, and a ∈ Σ, if (p, a, p
′
) ∈ δ
M
, then there exists a q
′
∈ Q
A
, such
that minimal(q
′
) = p
′
, and (q, a, q
′
) ∈ Q
A
. That is, any q ∈ Q
A
is a copy
of minimal(q) ∈ Q
M
, and, furthermore, q’s outgoing transitions are copies
of those of minimal(q), again to copies of the appropriate target states in M.
Using this knowledge, Algorithm 3 enumerates the desired automata. For ease
of exposition, it is presented as a nondeterministic algorithm in which each
trace only outputs one DFA. It can easily be modiﬁed, using recursion, to an
algorithm outputting all these automata, and hence the desired set.
Theorem 105. Given a minimal DFA M, and k ≥ Q
M
, the set S of all
automata returned by enumerate, consists of pairwise nonisomorphic DFAs
of size at most k. Further, for any DFA B of size at most k and equivalent to
M, there is a B
′
∈ S such that B
∼
= B
′
.
Proof. We ﬁrst argue that given M, and k ≥ Q
M
, enumerate returns the
set S = {A  Q
A
 < k, L(A) = L(M), A is canonical}. Due to Lemma 103,
the set S satisﬁes the criteria of the theorem.
Let us ﬁrst show that enumerate only produces automata in this set S, i.e.
any DFA A generated by enumerate has Q
A
 < k, is equivalent to M, and is
canonical. The reason that A is canonical is because the enumeration explicitly
follows the breadthﬁrst traversal of the automaton it is generating. It is
equivalent to M because for any state q ∈ Q
A
, we maintain the corresponding
state minimal(q) ∈ Q
M
and give q outgoing transitions to copies of the states
where minimal(q) also has transitions to. This guarantees that for all q ∈
Q
A
, we have L(A
q
) = L(M
minimal(q)
), and, in particular L(A) = L(A
q
a
) =
L(M
q
M
) = L(M). Finally, Q
A
 < k, due to the test on Line 12, which
quarantees that never more than k states will be generated.
Conversely, we must show that any automaton in S is generated by enu
merate, i.e. any canonical A equivalent to M with at most k states is gen
erated by enumerate. Let A be such a DFA. By Lemma 104 we know it
consists of a number of copies of each state of M which are linked to each
other in a manner consistent with M. Given this information, enumerate
follows all possible breadthﬁrst paths generating such DFAs, and hence also
generates A.
9.2. Constructing Deterministic Expressions 161
Algorithm 3 Algorithm enumerate.
Require: Minimal DFA M = (Q
M
, δ
M
, q
M
, F
M
), and k ≥ Q
M

Ensure: Canonical DFA A, with L(A) = L(M) and Q
A
 ≤ k.
Q
A
, δ
A
, F
A
← ∅
2: Queue P ← ∅
q
A
← 0
4: Push q
A
onto P
Q
A
← Q
A
⊎ {q
A
}
6: Set minimal(q
A
) = q
M
while P = ∅ do
8: q ← P.Pop
p ← minimal(q)
10: for a ∈ Σ do
if ∃p
′
∈ Q
M
: (p, a, p
′
) ∈ δ
M
then
12: Choose q
′
nondeterministically such that either q
′
= Q
A
 (if Q
A
 <
k), or such that q
′
< Q
A
 and minimal(q
′
) = p
′
.
If no such q
′
exists: fail.
14: if q
′
= Q
A
 then
Push q
′
onto P
16: Set minimal(q
′
) = p
′
Q
A
← Q
A
⊎ q
′
18: δ
A
← δ
A
⊎ (q, a, q
′
)
if p
′
∈ F
M
then
20: F
A
← F
A
∪ q
′
return A
We conclude by noting that, by using a careful implementation, the algo
rithm actually runs in time linear in the total size of the output. Thereto,
the algorithm should never fail, but this can be accomplished by maintaining
some additional information (concerning the number of states which must still
be generated) which allows to avoid nondeterministic choices which will lead
to failure.
9.2.3 Optimizing the BKWAlgorithm
Next, we discuss Br¨ uggemannKlein and Woods bkw algorithm and then
present a few optimizations to generate smaller expressions.
First, we need some terminology. Given a DFA A, a symbol a is A
consistent if there is a unique state w(a) in A such that all ﬁnal states of
A have an atransition to w(a). We call w(a) the witness state for a. A set S
is Aconsistent if each element in S is Aconsistent. The Scut of A, denoted
162 Handling NonDeterministic Regular Expressions
Algorithm 4 The bkwAlgorithm.
Require: Minimal DFA A = (Q, Σ, δ, q
0
, F)
Ensure: Det. reg. exp. s with L(s) = L(A)
if A has only one state q and no transitions then
2: if q is ﬁnal then return ε
else return ∅
4: else if A has precisely one orbit then
S ← Aconsistent symbols
6: if S = ∅ then fail
else return bkw(A
S
) ·
¸
a∈S
a · bkw(A
w(a)
S
)
∗
8: else
if A has the orbit property then
10: for all a s.t. Orbit(q
0
) has outgoing atransition do
q
a
← unique target state of these atransitions
12: A
q
0
← orbit automaton of q
0
if A
q
0
contains a ﬁnal state then
14: return bkw(A
q
0
) ·
¸
a∈Σ
(a · bkw(A
q
a
))
?
else
16: return bkw(A
q
0
) ·
¸
a∈Σ
(a · bkw(A
q
a
))
else fail
by A
S
, is the automaton obtained from A by removing, for each a ∈ S, all
atransitions that leave a ﬁnal state of A. Given a state q of A, A
q
is the
automaton obtained from A by setting its initial state to q and restricting its
state set to the states reachable from q. For a state q, the orbit of q, denoted
Orbit(q), is the strongly connected component of A that contains q. We call
q a gate of Orbit(q) if q is ﬁnal, or q is the source of a transition that has a
target outside Orbit(q).
We say that A has the orbit property if, for every pair of gates q
1
, q
2
in the
same orbit the following properties hold:
1. q
1
is ﬁnal if and only if q
2
is ﬁnal; and,
2. for all a ∈ Σ and states q outside the orbit of q
1
and q
2
, there is a
transition (q
1
, a, q) if and only if there is a transition (q
2
, a, q).
Given a state q of A, the orbit automaton of q, denoted by A
q
, is obtained
from A by restricting its state set to Orbit(q), setting its initial state to q and
by making the gates of Orbit(q) its ﬁnal states.
The bkwAlgorithm is then given as Algorithm 4. For a regular expression
r, the algorithm is called with the minimal complete DFA A accepting L(r)
9.2. Constructing Deterministic Expressions 163
and then recursively constructs an equivalent deterministic expression when
one exists and fails otherwise. Algorithm 4 can fail in two places: (1) in
line 6, when the set of Aconsistent symbols is empty and (2) in line 17, if
A does not have the orbit property. Notice that, if A has the orbit property,
the unique state q
a
on line 11 can always be found. The correctness proof is
nontrivial and can be found in [17]. bkw runs in time double exponential
in the size of the nondeterministic regular expression. The ﬁrst exponential
arises from converting the given regular expression to a DFA, the second one
from branching in the lines 7, 14, and 16. The generated expressions can
therefore be quite large. As Algorithm 4 was not designed with conciseness of
regular expressions in mind, we therefore propose three optimizations resulting
in smaller expressions.
To this end, by slight abuse of notation, let ﬁrst(A) denote the set {a 
∃w ∈ Σ
∗
, aw ∈ L(A)}, i.e., the set of possible ﬁrst symbols in a string in L(A).
We adapt the lines 7, 14, and 16 in Algorithm 4 in the way described below
and refer to the modiﬁed algorithm as bkwopt.
line 7 Now, A consists of one orbit and S is the set of Aconsistent symbols:
• If L(A
S
) = L(A
w(a)
S
) for all a ∈ S, ε ∈ L(A
S
), and ﬁrst(A
S
) ∩ S = ∅,
then return ((S +ε) · bkw(A
S
))
∗
.
• Else, partition S into equivalence classes S
1
, . . . , S
n
where for a, b ∈ S,
a is equivalent to b if and only if w(a) = w(b). Furthermore, let, for
each i ∈ {1, . . . , n}, a
i
be an arbitrary but ﬁxed element from S
i
. Then,
return bkw(A
S
) ·
¸
1≤i≤n
S
i
· bkw(A
w(a
i
)
S
i
)
∗
.
line 14 and 16 If A consists of more than one orbit, we can view A as an
acyclic DFA when considering every orbit as an atomic subautomaton. We
therefore deﬁne the acyclic DFA summary automaton summ(A) of A where
every state corresponds to a unique orbit. As these automata are usually quite
small, we subsequently apply grow to obtain a concise regular expression over
an alphabet consisting of Σsymbols and orbit identiﬁers. We then replace
each orbit identiﬁer by its corresponding, recursively obtained, deterministic
expression.
Before deﬁning summary automata formally, we present an example. Fig
ure 9.2(a) illustrates a DFA A with three orbits: {1}, {2, 3, 4}, and {5}. Orbit
{2, 3, 4} has two possible entry points: states 2 (with an atransition) and 3
(with the b and ctransitions). For each such entry point we have a state
in the summary automaton. Figure 9.2(b) presents the summary automaton
summ(A).
164 Handling NonDeterministic Regular Expressions
1 start
2
3
4 5
a
b, c
d
e
f
g
(a) DFA A.
O
1
start
O
2
O
3
O
5
(a, 2)
(b, 3), (c, 3)
(g, 5)
(g, 5)
(b) summ(A)
Figure 9.2: A DFA and its summary automaton.
start
· · ·
· · ·
a
1
a
2
a
3
a
4
a
5
a
6
a
n
−
5
a
n
−
4
a
n
−
3
a
n
−
2
a
n
−
1
a
n
Figure 9.3: Class of DFAs for which our optimization improves exponentially
over the bkw algorithm.
Formally, let A = (Q, δ, q
0
, F) over alphabet Σ. Then we deﬁne summ(A)
as a DFA (Q
s
, δ
s
, q
s
0
, F
s
) over alphabet Σ
s
⊆ Σ × Q. In particular, for each
transition (q
1
, a, q
2
) ∈ δ where Orbit(q
1
) = Orbit(q
2
), we have (a, q
2
) ∈ Σ
s
.
The state set Q
s
is deﬁned as {O
q
 there is a transition (p, a, q) ∈ δ for
p outside Orbit(q)}. Furthermore, we deﬁne q
s
0
:= O
q
0
, F
s
:= {O
p
∈ Q
s

Orbit(p) ∩ F = ∅}, and (O
q
1
, (a, q
2
), O
q
2
) ∈ δ
s
if and only if Orbit(q
1
) =
Orbit(q
2
) and there exists a q
′
1
∈ Orbit(q
1
) such that (q
′
1
, a, q
2
) ∈ δ. Notice
that, if A is a DFA fulﬁlling the orbit property, all outgoing transitions of each
orbit go to the same witness state. Therefore, summ(A) is also a DFA.
To ﬁnd a small regular expression for the multiple orbits case, we make use
of the deterministic expressions r
q
which are obtained by applying bkwopt
recursively to the orbit automata A
q
. We run grow on summ(A) to ﬁnd a
small deterministic expression for L(summ(A)). If we ﬁnd one, we obtain the
deterministic expression for L(A) by replacing each symbol (a, q) by a · r
q
.
Notice that this optimization potentially generates exponentially smaller
regular expressions than the bkw algorithm. Consider the family of DFAs of
Figure 9.3. The summary DFAs for these automata are equal to the DFAs
themselves. While the bkw algorithm would essentially unfold this DFA and
9.3. Approximating Deterministic Regular Expressions 165
return a regular expression of size at least 2
n
, grow would return the expres
sion (a
1
a
3
+a
2
a
4
) · · · (a
n−3
a
n−1
+a
n−2
a
n
), which is linear in n.
It is shown in [53] that there are acyclic DFAs whose smallest equivalent
regular expression is of superpolynomial size: Ω(n
log n
) for n the size of the
DFA. As acyclic DFAs deﬁne ﬁnite languages and ﬁnite languages are deter
ministic, the result transfers to deterministic regular expressions. Hence, it is
impossible for grow to always return a (small) result. Therefore, when grow
does not ﬁnd a solution we just apply one nonoptimized step of the bkw al
gorithm (i.e., return line 14/16 of Algorithm 4). However, in our experiments
we noticed that this almost never happened (less than 1% of the total calls to
grow did not return an expression).
9.3 Approximating Deterministic Regular Expres
sions
When the regular language under consideration is not deterministic, we can
make it deterministic at the expense of generalizing the language. First, we
show that there is no best approximation.
9.3.1 Optimal Approximations
An expression s is a deterministic superapproximation of an expression r when
L(r) ⊆ L(s) and s is deterministic. In the sequel we will just say approxima
tion rather than superapproximation. Then, we say that s is an optimal
deterministic approximation of r, if L(r) ⊆ L(s), and there does not exist a
deterministic regular expression s
′
such that L(r) ⊆ L(s
′
) L(s). That is, an
approximation is optimal if there does not exist another one which is strictly
better. Unfortunately, we can show that no such optimal approximation exists:
Theorem 106. Let r be a regular expression, such that L(r) is not deter
ministic. Then, there does not exist an optimal deterministic approximation
of r.
Proof. We show in Lemma 107 below that for every deterministic language L
and string w ∈ L, the language L \ {w} is also deterministic. Now, suppose,
towards a contradiction, that an optimal deterministic approximation s exists.
Then, since L(r) is not deterministic, L(r) L(s) and thus there exists some
string w with w ∈ L(s) but w / ∈ L(r). But then, for the language L
w
= L(s) \
{w}, we have that L(r) ⊆ L
w
and, by Lemma 107, L
w
is also deterministic.
This gives us the desired contradiction.
166 Handling NonDeterministic Regular Expressions
Lemma 107. For every deterministic language L and string w ∈ L, the
language L \ {w} is also deterministic.
Proof. By u we denote the length of string u. We deﬁne the preﬁxlanguage
L
≤w
= {u ∈ L  u < w}∪{u ∈ Σ
∗
 u = w, ∃v.uv ∈ L}. As L
≤k
is a ﬁnite
language, one can easily construct a deterministic regular expression for it con
sisting of nested disjunctions. For instance, the set L
≤w
= {aab, ab, baa, bba}
can be deﬁned by aa(b +ε) +b(aa+ba). Denote the resulting expression by r.
Further, we note that deterministic regular languages are closed under deriva
tives [17]. Speciﬁcally, for a string u, the uderivative of a regular expression
s is a regular expression s
′
that deﬁnes the set {v  uv ∈ L(s)}. For every
string u ∈ L
≤w
with u = w and u = w, let r
u
be a deterministic expression
deﬁning the uderivative of L. Notice that the uderivative can be ε. Now
for u = w, let r
w
deﬁne the wderivative of L minus ε. It is shown in [17]
(Theorem D.4) that, for every deterministic regular language L
′
, L
′
− {ε} is
also deterministic. Hence, r
w
can also be written as a deterministic regular
expression. Now, the expression deﬁning L \ {w} is obtained by adding, for
each u ∈ L
≤w
with u = w, r
u
after the last symbol of u in r.
As ﬁnite languages are always deterministic, Theorem 106 implies that ev
ery approximation deﬁnes inﬁnitely more strings than the original expression.
Furthermore, one can prove analogously that an optimal underapproximation
of a nondeterministic regular expression r also does not exist. That is, a
deterministic regular expression s such that L(s) ⊆ L(r) for which there is no
deterministic s
′
with L(s) L(s
′
) ⊆ L(r).
9.3.2 Quality of the approximation
Motivated by the above discussion, we will compare sizes of regular languages
by only comparing strings up to a predeﬁned length.
Thereto, for an expression r and a natural number ℓ, let L
ℓ
(r) be the subset
of strings in L(r) with length exactly ℓ. For regular expressions r and s with
L(r) ⊆ L(s), deﬁne the proximity between r and s, as
proximity(r, s) :=
1
k
k
¸
ℓ=1
L
ℓ
(r) + 1
L
ℓ
(s) + 1
for k = max{2r +1, 2s +1}. The proximity is always a value between 0 and
1. When the proximity is close to 1, the size of the sets L
ℓ
(s) \ L
ℓ
(r) is small,
and the quality of approximation is excellent.
Although the above measure provides us with a tool to compare proximity
of regular languages, we cannot simply search for a deterministic expression
9.3. Approximating Deterministic Regular Expressions 167
which performs best under this measure. It is also important that the obtained
expression is small and thus readable. Indeed, a user might well favor an ex
pression which is understandable, but constitutes only a rough approximation,
over one which is very close to the original expression, but is completely in
comprehensible due to its large size.
In conclusion, a valid approximation s for an expression r, is a deter
ministic expression constituting a good tradeoﬀ between (1) a large value for
proximity(r, s) and (2) a small size s. The heuristics in the following section
will try to construct approximations which ﬁt these requirements.
9.3.3 Ahonen’s Algorithm
In the previous section, we have seen that the bkwalgorithm will translate
a DFA into a deterministic expression, and will fail if no such equivalent de
terministic expression exists. Ahonen’s algorithm [3] is a ﬁrst method that
constructs a deterministic regular expression at the expense of generalizing the
target language. It essentially runs the bkw
dec
algorithm, the decision variant
of bkw which does not produce an output expression (cf. Theorem 98), until
it fails, and subsequently repairs the DFA by adding transitions, making states
ﬁnal, or merging states, in such a manner that bkw
dec
can continue. In the
end, a DFA is produced deﬁning a deterministic language. The corresponding
deterministic regular expression is then obtained by running bkw. Ahonen’s
algorithm for obtaining a DFA deﬁning a deterministic language is presented
in Algorithm 5.
3
By ahonenbkw we then denote the application of bkw on
the result of ahonen.
ahonen proceeds by merging states. We explain in more detail how we can
merge two states in a DFA A = (Q, δ, q
0
, F). For an example, see Figure 9.4(c)
and 9.4(d), where states 2 and 4 are merged into a new state {2, 4}. For ease
of exposition, we assume that states in Q are sets. Initially, all sets in Q are
singletons (e.g., {2}, {4}) and by merging such states we obtain nonsingletons
(e.g., {2, 4}). Let q
1
and q
2
be the two states to be merged into a new state
q
M
:= q
1
∪q
2
. We denote by Q
new
the state set of A after this merge operation
(analogously for δ
new
,q
0
new
, and F
new
). We assume that q
1
, q
2
∈ Q and q
M
/ ∈ Q.
Then, Q
new
:= (Q ∪ {q
M
}) \ {q
1
, q
2
}. Furthermore, q
M
∈ F
new
if and only if
q
1
∈ F or q
2
∈ F. Analogously, q
0
new
is the unique set p ∈ Q
new
such that q
0
∈
p. The transitions are adapted by removing all transitions containing q
1
or q
2
,
but redirecting all these transitions to q
M
. For instance, when (q, a, q
1
) ∈ δ, we
have (q, a, q
M
) ∈ δ
new
. As long as the obtained automaton is not deterministic,
we choose nondeterministic transitions (p, a, q
1
) and (p, a, q
2
) and continue
3
The algorithm we present slightly diﬀers from Ahonen’s original algorithm as the original
algorithm is slightly incorrect. We brieﬂy discuss this at the end of this section.
168 Handling NonDeterministic Regular Expressions
Algorithm 5 An adaptation of Ahonen’s repair algorithm: the ahonen al
gorithm.
Require: DFA A = (Q, Σ, δ, q
0
, F)
Ensure: DFA B such that L(A) ⊆ L(B)
S ← Aconsistent symbols
2: if A has only one state q then
if q is ﬁnal then return ε
4: else return ∅
else if A has precisely one orbit then
6: if S = ∅ then
Choose an a s.t. (q, a, q
1
) ∈ δ for q ﬁnal
8: for all p ∈ F do
add (p, a, q
1
) to δ
10: if (p, a, q
2
) ∈ δ for q
2
= q
1
then
Merge(q
1
,q
2
)
12: S ← {a}
ForceOrbitProperty (A
S
)
14: for each orbit O of A
S
do
Choose an arbitrary q ∈ O
16: ahonen ((A
S
)
q
)
merging states until it is deterministic again. We denote this recursive merging
procedure in Algorithms 5 and 6 by Merge.
ahonen repairs the automaton A in two possible instances where bkw
dec
gets stuck. If A has one orbit but no Aconsistent symbols, ahonen simply
chooses a symbol a and adds transitions to force Aconsistency. If A has
more than one orbit, but does not fulﬁll the orbit property, then ahonen calls
ForceOrbitProperty (Algorithm 6) to add ﬁnal states and transitions until
A fulﬁlls it.
Example 108. We illustrate the algorithm on the regular expression (aba +
a)
+
b. The minimal DFA A is depicted in Figure 9.4(a). As there are no
Aconsistent symbols, we have that S = ∅. As there are three orbits ({1},
{2, 3, 4}, and {5}), the algorithm immediately calls ForceOrbitProperty on A,
where 4 is made ﬁnal and transition (3, b, 5) is added (Figure 9.4(b)). In the
next recursive level, we call ahonen for every orbit of A. We only consider the
nontrivial orbit {2, 3, 4} here, with its orbit automaton A
2
in Figure 9.4(c). As
there are no A
2
consistent symbols and A
2
has only one orbit, we recursively
merge states 2 and 4. (Line 7 gives us a choice of which transition to take,
but any choice would lead us to the merging of 2 and 4.) After the merge, a is
9.3. Approximating Deterministic Regular Expressions 169
Algorithm 6 The ForceOrbitProperty procedure.
Require: DFA A
Ensure: DFA B such that L(A) ⊆ L(B)
for each orbit O of A do
18: Let g
1
, . . . , g
k
be the gates of K
if there exists a gate g
i
∈ F then
20: F ← F ∪ {g
1
, . . . , g
k
}
for each ordered pair of gates (g
i
, g
j
) do
22: while there is an a s.t. (g
i
, a, q) ∈ δ
for q outside Orbit(g
i
) and (g
j
, a, q) / ∈ δ do
24: Add (g
j
, a, q) to δ
while (g
j
, a, q
′
) ∈ δ for q
′
= q do
26: Merge(q,q
′
)
A
2
consistent. We therefore call ForceOrbitproperty on the {a}cut of A
2
as
in Figure 9.4(e). Here, we discover that there are only two trivial orbits left,
and the algorithm ends.
It remains to return from the recursion and construct the resulting DFA
of the algorithm. Plugging the DFA from Figure 9.4(d) into the DFA from
Figure 9.4(b) results in Figure 9.4(f). Notice that this automaton is non
deterministic. Therefore, we have to merge states 3 and 5 in order to restore
determinism. The ﬁnal DFA obtained by the algorithm is in Figure 9.4(g).
Notice that this is a nonminimal DFA deﬁning the language a(a +b)
∗
.
It remains to discuss Ahonen’s original algorithm [3]. It is essentially the
same as the one we already presented, with a slightly diﬀerent ForceOrbit
Property (see Algorithm 7). We did not succeed in recovering the actual
implementation of Ahonen. As the algorithm presented by Ahonen is slightly
incorrect, we had to mend it in order to make a fair comparison. In particular,
we observed the following: (a) The iftest on l.9 should not be there. Oth
erwise, the output will certainly not always be an automaton that fulﬁlls the
orbit property. (b) The iftest on l.8 should, in our opinion, be some kind of
forloop. Currently, the algorithm does not necessarily choose the same a for
each pair of gates (g
i
, g
j
). We do believe, however, that these diﬀerences are
merely typos and that Ahonen actually intended to present this new version.
Several additional remarks should be made about the original paper [3]:
(i) it does not prove that the input DFA for Algorithm 5 is transformed
into an automaton that can always be converted into a deterministic regular
expression; (ii) it does not formally explain how states of the automaton should
be merged; we have chosen the most reasonable deﬁnition; and (iii) it does
170 Handling NonDeterministic Regular Expressions
1 start 2 3 4 5
a
a
b a
a
b
(a) Minimal DFA A for (aba + a)
+
b.
1 start 2 3 4 5
a
a
b a
b
a
b
(b) Extra ﬁnal state and transition.
2 start 3 4
a
b a
a
(c) Orbit automaton A
2
.
2,4
start 3
a
b
a
(d) Merge 2,4.
2,4
start 3
b
(e) {a}cut.
1 start
2,4
3
5
a
a
b
a
b b
(f) Reconstruct.
1 start
2,4 3,5
a
a
b
a
b
(g) Merge 3,5.
Figure 9.4: Example run of the adapted Ahonen’s algorithm.
not explain how the output DFA should be reconstructed when going back up
in the recursion. Therefore, we had to make some assumptions. For example,
we assume that, when recombining orbits into a large automaton, we have a
transition (q
1
, a, q
2
) if and only if there were subsets q
′
1
⊆ q
1
and q
′
2
⊆ q
2
such
that the original automaton had a transition (q
′
1
, a, q
′
2
). (See, for example, the
9.3. Approximating Deterministic Regular Expressions 171
Algorithm 7 The original ForceOrbitProperty.
ForceOrbitProperty(A = (Q, Σ, δ, q
0
, F))
2: for each orbit C of A do
Let g
1
, . . . , g
k
be the gates of K
4: if there exists a gate g
i
∈ F then
F ← F ∪ {g
1
, . . . , g
k
}
6: for each ordered pair of gates (g
i
, g
j
) do
if there is an a s.t. (g
i
, a, q) ∈ δ
8: for q outside Orbit(g
i
) and (g
j
, a, q) / ∈ δ then
if (q
j
, a, q
′
) ∈ δ for q
′
= q then
10: Add (g
j
, a, q) to δ
Merge q and q
′
transition ({2, 4}, b, {5}) in Figure 9.4(f), which is there because the original
automaton in Figure 9.4(b) had a transition ({4}, b, {5}).)
With respect to remark (i), we noticed in our experiments that Ahonen’s
algorithm sometimes indeed outputs a DFA that cannot be converted into an
equivalent deterministic expression. If this happens, we reiterate Ahonen’s
algorithm to the thus far constructed DFA until the resulting DFA deﬁnes a
deterministic language.
9.3.4 Ahonen’s Algorithm Followed by Grow
Ahonen’s algorithm ahonenbkw relies on bkw to construct the correspond
ing deterministic expression. However, as we know, bkw generates very large
expressions. We therefore consider the algorithm ahonengrow which runs
grow on the DFA resulting from ahonen.
9.3.5 Shrink
As a ﬁnal approach, we present shrink. The latter algorithm operates on
statelabeled ﬁnite automata instead of standard DFAs. Recall that for state
labeled automata, the function symbol : Q → Σ associates each state with
its corresponding symbol. For instance, the automaton in Figure 9.4(a) is
statelabeled and has symbol(2) = a, symbol(3) = b, symbol(4) = a, and
symbol(5) = b; symbol(1) is undeﬁned as 1 does not have incoming transitions.
We note that from any ﬁnite automaton, we can easily construct an equiv
alent statelabeled automaton by duplicating states which have more than one
symbol on their incoming transitions. In particular, from a minimal DFA, one
can thus construct a minimal statelabeled DFA.
172 Handling NonDeterministic Regular Expressions
Algorithm 8 The shrink algorithm with pool size P.
Require: Minimal statelabeled DFA A, P ∈ N
Ensure: Array of det. reg. exp. s with L(A) ⊆ L(s)
Pool ← {A}
2: BestArray ← empty array of A −Σ elements
while Pool is not empty do
4: j ← 1
for each B ∈ Pool do
6: for each pair of states q
1
, q
2
of B with symbol(q
1
) = symbol(q
2
) do
B
j
← Merge(B,q
1
,q
2
)
8: j ← j + 1
Pool ← Rank(P,{B
1
, . . . , B
j−1
})
10: for each B
k
∈ Pool do
r
k,1
← koatokore(B
k
)
12: r
k,2
← grow(B
k
)
for each ℓ, Σ ≤ ℓ ≤ A do
14: BestArray[ℓ] ← the deterministic regexp r of size ℓ from BestArray[ℓ]
and all r
k,x
with maximal value proximity(A, r)
16: return BestArray
The philosophy behind shrink rather opposes the one behind grow: it
tries to reduce the number of states of the input automaton by merging pairs
of states with the same label, until every state has a diﬀerent label.The result
of shrink is an array containing deterministic expressions for which the lan
guage proximity to the target language is maximal among the deterministic
expressions of the same size.
shrink is presented in Algorithm 8. The call to Merge(B, q
1
, q
2
) is the
one we explained in Section 9.3.3 and operates on the DFA B. Rank(P,{B
1
,
. . . , B
j
}) is a ranking procedure that selects the P “best” automata from
the set {B
1
, . . . , B
j
}. Thereto, we say that an automaton B
i
is better than
B
j
when L(B
i
) is deterministic but L(B
j
) is nondeterministic. Otherwise, if
L(B
i
) and L(B
j
) are either both deterministic or both nondeterministic, B
i
is better than B
j
if and only if proximity(B
i
, A) > proximity(B
j
, A). That is,
we favour automata which deﬁne deterministic languages (as we are looking
for deterministic languages), and when no distinction is made in this manner,
we favour those automata which form the best approximation of the original
language.
koatokore is an algorithm of [8] which transforms a statelabeled au
tomaton A into a (possibly nondeterministic) expression r such that L(A) ⊆
9.4. Experiments 173
L(r). Further r contains one symbol for every labeled state in A. As we are
only interested in deterministic expressions, we discard the result of koato
kore when it is nondeterministic. However, if every state of A is labeled with
a diﬀerent symbol, then the resulting expression also contains every symbol
only once, and hence is deterministic. As shrink will always generate au
tomata which have this property, shrink is thus guaranteed to always output
at least one deterministic approximation.
Notice that shrink always terminates: the automata in Pool become
smaller in each iteration and when they have Σ states left, no more merges
can be performed.
9.4 Experiments
In this section we validate our approach by means of an experimental analy
sis. All experiments were performed using a prototype implementation writ
ten in Java executed on a Pentium M 2 GHz with 1GB RAM. As the XML
Schema speciﬁcation forbids ambiguous content models, it is diﬃcult to ex
tract realworld expressions violating UPA from the Web. We therefore test
our algorithms on a sizable and diverse set of generated regular expressions.
To this end, we apply the synthetic regular expression generator used in [8]
to generate 2100 nondeterministic regular expressions. From this set, 1200
deﬁne deterministic languages while the others do not. We utilize three pa
rameters to obtain a versatile sample. The ﬁrst parameter is the size of the
expressions (number of occurrences of alphabet symbols) and ranges from 5 to
50. The second parameter is the average number of occurrences of alphabet
symbols in the expression, denoted by κ. That is, when the size of r is n, then
κ(r) = n/Char(r), where Char(r) is the set of diﬀerent alphabet symbols
occurring in r. For instance, when r = a(a +b)
+
acab, then κ(r) = 7/3 = 2.3.
In our sample, κ ranges from 1 to 5. At ﬁrst glance, the maximum value of 5
for κ might seem small. However, the latter value must not be confused with
the maximum number of occurrences of a single alphabet symbol, which in our
sample ranges from 1 to 10. Finally, the third parameter measures how much
the language of a generated expression overlaps with Σ
∗
, which is measured
by proximity(r, Σ
∗
). The expressions are generated in such a way that the
parameter covers the complete spectrum uniformly from 0 to 1.
9.4.1 Deciding Determinism
As a sanity check, we ﬁrst ran the algorithm bkw
dec
on the real world deter
ministic expressions obtained in the study [10] (which are all deterministic).
On average they were decided to deﬁne a deterministic regular language within
174 Handling NonDeterministic Regular Expressions
35 milliseconds. This outcome is not very surprising as κ for each of these
expressions is close to 1 (cfr. Proposition 100). We then ran the algorithm
bkw
dec
on each of the 2100 expressions and were surprised that on average
no more than 50 milliseconds were needed, even for the largest expressions of
size 50. Upon examining these expressions more closely, we discovered that
all of them have small corresponding minimal DFAs: on average 25 states
or less. Apparently random regular expressions do not suﬀer much from the
theoretical worst case exponential size increase when translated into DFAs.
9.4.1.1 Discussion
Although the problem of deciding determinism is theoretically intractable
(Theorem 99), in practice, there does not seem to be a problem so we can
safely use bkw
dec
as a basic building block of supac (Algorithm 9).
9.4.2 Constructing Deterministic Regular Expressions
In this section, we compare the deterministic regular expressions generated
by the three algorithms: bkw, bkwopt, and grow. We point out that the
comparison with bkw is not a fair one, as the latter was not deﬁned with
eﬃciency in mind.
Table 9.1 depicts the average sizes of the expressions generated by the three
methods (again size refers to number of symbol occurrences), with the average
running times in brackets. Input size refers to the size of the input regular
expressions. Here, the poolsize and depth of groware 100 and 5, respectively.
We note that for every expression individually the output of bkw is always
larger than that of bkwopt and that grow, when it succeeds, always gives
the smallest expression. Due to the exponential nature of the bkw algorithm,
both bkw and bkwopt can not be used for input expressions of size larger
than 20.
4
For smaller input expressions, bkwopt is better than bkw, but
still returns expressions which are in general too large to be easily interpreted.
In strong contrast, when it succeeds, grow produces very concise expressions,
roughly the size of the input expression.
It remains to discuss the eﬀectiveness of grow. In Table 9.2, we give
the success rates and the average running times for various sizes of input
expressions and for several values for poolsize and depth. It is readily seen
that the success rate of grow is inversely proportional to the input size,
starting at 90% for input size 5, but deteriorating to 20% for input size 25.
Further, Table 9.2 also shows that increasing the poolsize or depth only has
4
For expressions of size 20, bkw already returned expressions of size 560.000.
9.4. Experiments 175
input size bkw bkwopt grow
5 9 (< 0.1) 7 (< 0.1) 3 (< 0.1)
10 216 (< 0.1) 95 (0.1) 6 (0.2)
15 1577 (0.2) 394 (0.6) 9 (0.6)
20 / / 12 (1.5)
2530 / / 13 (4.0)
3550 / / 23 (19.6)
Table 9.1: Average output sizes and running times (in brackets, in seconds)
of bkw, bkwopt and grow on expressions of diﬀerent input size.
input size (d:5,p:20) (5,100) (10,20) (10,100)
5 89 (< 0.1) 89 (< 0.1) 89 (< 0.1) 89 (< 0.1)
10 66 (< 0.1) 68 (0.2) 68 (0.1) 70 (0.5)
15 43 (0.1) 46 (0.6) 44 (0.3) 47 (1.6)
20 31 (0.3) 33 (1.5) 31 (0.8) 33 (3.8)
2530 21 (0.8) 21 (4.0) 21 (1.8) 21 (9.1)
3550 7 (3.9) 8 (19.6) 7 (8.3) 8 (43.7)
Table 9.2: Success rates (%) and average running times (in brackets, in sec
onds) of grow for diﬀerent values of the depth (d) and poolsize (p) parame
ters.
a minor impact on the success rate of grow, but a bigger inﬂuence on its
running time.
9.4.2.1 Discussion
grow is the preferred method to run in a ﬁrst try. When it gives a result it
is always a concise one. Its success rate is inversely proportional to the size
of the input expressions and quite reasonable for expressions up to size 20.
Should grow fail, it is not a real option to try bkwopt as on expressions of
that size it never produces a reasonable result. In that case, the best option
is to look for a concise approximation (as implemented in Algorithm 9).
9.4.3 Approximating Deterministic Regular Expressions
We now compare the algorithms ahonenbkw, shrink, and ahonengrow.
Note that ahonenbkw and ahonengrow return a single approximation,
whereas shrink returns a set of expressions (with a tradeoﬀ between size and
proximity). To simplify the discussion, we take from the output of shrink the
expression with the best proximity, disregarding the size of the expressions.
176 Handling NonDeterministic Regular Expressions
input size ahonenbkw ahonengrow shrink
5 0.73 (100%) 0.71 (75%) 0.75 (100%)
10 0.81 (100%) 0.79 (56%) 0.78 (100%)
15 0.84 (100%) 0.88 (40%) 0.79 (100%)
20 / 0.89 (18%) 0.76 (100%)
2530 / 0.89 (8%) 0.71 (100%)
3550 / 0.75 (4%) 0.68 (100%)
Table 9.3: Quality of approximations of ahonenbkw, ahonengrow, and
shrink (closer to one is better). Success rates in brackets.
input size ahonenbkw ahonengrow shrink
5 8 (100%) 3 (75%) 3 (100%)
10 28 (100%) 6 (56%) 6 (100%)
15 73 (100%) 8 (40%) 8 (100%)
20 / 11 (18%) 10 (100%)
2530 / 11 (8%) 13 (100%)
3550 / 14 (4%) 18 (100%)
Table 9.4: Average output sizes of ahonenbkw, ahonengrow, and shrink.
Success rates in brackets.
This is justiﬁed as all expressions returned by shrink are concise by deﬁnition.
In a practical scenario, however, the choice in tradeoﬀ between proximity and
conciseness can be left to the user.
Table 9.3 then shows the average proximity(r, s) where r is the input ex
pression and s is the expression produced by the algorithm. As ahonengrow
is not guaranteed to produce an expression, the success rates are given in
brackets. In contrast, ahonenbkw and shrink always return a determinis
tic expression. Further, as ahonenbkw uses bkw as a subroutine its use is
restricted to input expressions of size 15.
We make several observations concerning Table 9.3. First, we see that the
succes rate of ahonengrow is inversely proportional to the size of the input
expression. This is to be expected, as grow is not very successful on large
input expressions or automata. But, as ahonenbkw is also not suited for
larger input expressions, only shrink produces results in this segment.
Concerning the quality of the approximations, we only compare ahonen
bkw with shrink because ahonengrow, when it succeeds, returns an ex
pression equivalent to ahonenbkw which consequently possesses the same
proximity. In Table 9.3, we observe that ahonenbkw returns on average
slightly better approximations than shrink. Also in absolute values, ahonen
9.5. SUPAC: Supportive UPA Checker 177
Algorithm 9 Supportive UPA Checker supac.
Require: regular expression r
Ensure: deterministic reg. exp. s with L(r) ⊆ L(s)
if r is deterministic then return r;
2: else if L(r) is deterministic then
if grow(r) succeeds then return grow(r)
4: else return best from bkwopt(r) and shrink(r)
else return best from ahonengrow(r) and shrink(r)
bkw returns in roughly 2/3th of the cases the best approximation with respect
to proximity, and shrink in the other 1/3th. We further observe that the qual
ity of the approximations of shrink only slightly decreases but overall remains
fairly good, with 0.68 for expressions of size 50.
Table 9.4 shows the average output sizes of the diﬀerent algorithms. Here,
we see the advantage of ahonengrow over ahonenbkw. When an ex
pression is returned by ahonengrow, it is much more concise than (though
equivalent to) the output of ahonenbkw. Furthermore, it has a small chance
on success for those sizes of expressions on which ahonenbkw is not feasible
anymore. Also shrink can be seen to always return very concise expressions.
Finally, we consider running times. On the input sizes for which ahonen
bkw is applicable, it runs in less than a second. ahonengrow was executed
with poolsize 100 and depth 5 for the grow subroutine, and took less than
a second for the small input sizes (5 to 15) and up to a half a minute for
the largest (50). Finally, shrink was executed with poolsize 10, for input
expressions of size 5 to 20, and poolsize 5 for bigger expressions, and took up
to a few seconds for small input expressions, and a minute on average for the
largest ones.
9.4.3.1 Discussion
As running times do not pose any restriction on the applicability of the pro
posed methods, the best option is to always try ahonengrow and shrink,
and ahonenbkw only for very small input expressions (up to size 5) and
subsequently pick the best expression.
9.5 SUPAC: Supportive UPA Checker
Based on the observations made in Section 9.4, we deﬁne our supportive UPA
checker supac as in Algorithm 9. We stress that the notion of ‘best’ ex
pression can depend on both the conciseness and the proximity of the result
178 Handling NonDeterministic Regular Expressions
input output
c
∗
cac +b c
+
ac +b
(a?bc +d)
+
d ((a?(bc)
+
)
∗
d
+
)
+
((cba +c)
∗
b)? ((c
+
ba?)
∗
b?)?
(c
+
cb +a +c)
∗
(a
+
+c
+
b?)
∗
Table 9.5: Sample output of supac.
ing regular expressions and is essentially left as a choice for the user. Note
that, when grow does not succeed, there is only a choice between a probably
lengthy equivalent expressions generated by bkwopt or a concise approxi
mation generated by shrink. In line 5, we could also make the distinction
between expressions r of small size (≈ 5) and larger size (> 5). For small
expressions, we then could also try ahonenbkw.
As an illustration, Table 9.5 lists a few small input expressions with the
corresponding expression constructed by supac. The ﬁrst two expressions
deﬁne deterministic languages, while the last two do not.
10
Conclusion
To conclude I would like to make a few observations.
First, let me point out the strong interaction between XML and formal
language theory. Clearly, XML technologies are strongly rooted in formal
languages. Indeed, XML schema languages are formalisms based on context
free grammars, make use of regular expressions, and thus deﬁne regular tree
languages. Even the study of counting and interleaving operators present in
these languages already goes back to the seventies. Therefore, XML raises
questions well worth studying by the formal language theory community, such
as the following:
• In Chapter 6, we discussed deterministic regular expression with count
ing, and showed that they can not deﬁne all regular languages. However,
a good understanding of the class of languages they deﬁne is still lacking.
In particular, an algorithm deciding whether a language can be deﬁned
by a weak deterministic expression with counting would be a major step
forward.
• Deterministic regular expressions have the beneﬁt of being deterministic,
but mostly have negative properties. Therefore, it would be interesting
for applications if another class of languages, deﬁnable by corresponding
“deterministic” regular expressions, could be found that does not have
these downsides. Ideally, this would include (1) an easy syntax, (2)
“suﬃcient” expressivity, (3) good closure properties, and (4) tractable
decision problems. Here, the ﬁrst point is probably the most important
180 Conclusion
as it is the lack of a syntax for deﬁning all (and only) deterministic
expressions, that makes them so diﬃcult to use in practice.
The above illustrates that there is a strong link from formal language
theory to XML. In this thesis, I hope to have illustrated that the opposite
also holds. For instance, when studying the complexity of translating pattern
based schemas to other schema formalisms it turned out that this reduces to
taking the complement and intersection of regular expressions. Studying the
latter showed that this can be surprisingly diﬃcult (double exponential), and
it furthermore followed that in a translation from ﬁnite automata to regular
expressions an exponential sizeincrease can not be avoided, even when the
alphabet is ﬁxed. These results clearly belong more to the ﬁeld of formal
language theory, and illustrate that XML research can add to our knowledge
about formal languages.
Finally, this dual connection between XML and formal language theory
can, or should, also be found in theory versus practice. The theoretical ques
tions asked here are mostly inspired by practical problems, but it is often
more diﬃcult for theoretical results to ﬁnd their way back to practice. For
instance, BruggemannKlein and Wood’s much cited paper concerning deter
ministic (oneunambiguous) regular languages [17] is already 10 years old, but
never found its way to practical algorithms. We implemented these algorithms
(see Chapter 9) and demonstrate that they can also be of practical interest.
We also show in that chapter that deciding whether a language is determin
istic, is pspacehard but yet can be done very eﬃciently on all instances of
interest. This is of course not a contradiction as the pspacehardness only
implies worstcase complexity bounds, but the term worstcase is easily for
gotten in discussing the diﬃculty of problems. Therefore, the main lesson I
learned in writing this thesis is that a fundamental study can be a very useful
ﬁrst step in solving a problem, but should never be its last.
11
Publications
The results presented in this thesis have been published in several papers. We
list these publications here. Apart from the publications listed below, I also
cooperated on a few other papers [8, 13, 33, 39].
Chapter Reference
3 [42]
4 [37]
5 [40]
6 [38]
7 [41]
8 [40]
9 [7]
Bibliography
[1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web : From
Relations to Semistructured Data and XML. Morgan Kaufmann, 1999.
[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis
of Computer Algorithms. AddisonWesley, 1974.
[3] H. Ahonen. Disambiguation of SGML content models. In Principles of
Document Processing (PODP 1996), pages 27–37, 1996.
[4] M. Almeida, N. Moreira, and R. Reis. Enumeration and generation
with a string automata representation. Theoretical Computer Science,
387(2):93–102, 2007.
[5] M. Benedikt, W. Fan, and F. Geerts. XPath satisﬁability in the presence
of DTDs. Journal of the ACM, 55(2), 2008.
[6] P. A. Bernstein. Applying Model Management to Classical Meta Data
Problems. In Innovative Data Systems Research (CIDR 2003), 2003.
[7] G. J. Bex, W. Gelade, W. Martens, and F. Neven. Simplifying XML
Schema: Eﬀortless handling of nondeterministic regular expressions. In
Management of Data (SIGMOD 2009), 2009.
[8] G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning de
terministic regular expressions for the inference of schemas from XML
data. In World Wide Web Conference (WWW 2008), pages 825–834,
2008.
[9] G. J. Bex, F. Neven, T. Schwentick, and K. Tuyls. Inference of concise
DTDs from XML data. In Very Large Data Base (VLDB 2006), pages
115–126, 2006.
[10] G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML
schema: A practical study. In The Web and Databases (WebDB 2004),
pages 79–84, 2004.
183
184 Bibliography
[11] G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML Schema
deﬁnitions from XML data. In Very Large Databases (VLDB 2007),
pages 998–1009, 2007.
[12] G. J. Bex, F. Neven, and S. Vansummeren. SchemaScope: a system for
inferring and cleaning XML schemas. In Management of Data (SIGMOD
2008), pages 1259–1262, 2008.
[13] H. Bj¨ orklund, W. Gelade, M. Marquardt, and W. Martens. Incremental
XPath evaluation. In Database Theory (ICDT 2009), pages 162–173,
2009.
[14] R. Book, S. Even, S. Greibach, and G. Ott. Ambiguity in graphs and
expressions. IEEE Transactions on Computers, 20:149–153, 1971.
[15] A. Br¨ uggemannKlein. Regular expressions into ﬁnite automata. Theo
retical Computer Science, 120(2):197–213, 1993.
[16] A. Br¨ uggemannKlein, M. Murata, and D. Wood. Regular tree and
regular hedge languages over unranked alphabets. Technical report, The
Hongkong University of Science and Technologiy, April 3 2001.
[17] A. Br¨ uggemannKlein and D. Wood. Oneunambiguous regular lan
guages. Information and Computation, 142(2):182–206, 1998.
[18] J. Buchi. Weak secondorder logic and ﬁnite automata. S. Math. Logik
Grundlagen Math., 6:66–92, 1960.
[19] A. K. Chandra, D. Kozen, and L. J. Stockmeyer. Alternation. Journal
of the ACM, 28(1):114–133, 1981.
[20] C. Chitic and D. Rosu. On validation of XML streams using ﬁnite state
machines. In The Web and Databases (WebDB 2004), pages 85–90, 2004.
[21] N. Chomsky and G. A. Miller. Finite state languages. Information and
Control, 1(2):91–112, 1958.
[22] J. Clark and M. Murata. RELAX NG Speciﬁcation. OASIS, December
2001.
[23] R. S. Cohen. Ranknonincreasing transformations on transition graphs.
Information and Control, 20(2):93–113, 1972.
[24] D. Colazzo, G. Ghelli, and C Sartiani. Eﬃcient asymmetric inclusion
between regular expression types. In Database Theory (ICDT 2009),
pages 174–182, 2009.
Bibliography 185
[25] J. Cristau, C. L¨ oding, and W. Thomas. Deterministic automata on
unranked trees. In Fundamentals of Computation Theory (FCT 2005),
pages 68–79, 2005.
[26] S. DalZilio and D. Lugiez. XML schema, tree logic and sheaves au
tomata. In Rewriting Techniques and Applications (RTA 2003), pages
246–263, 2003.
[27] Z. R. Dang. On the complexity of a ﬁnite automaton corresponding to
a generalized regular expression. Dokl. Akad. Nauk SSSR, 1973.
[28] A. Deutsch, M. F. Fernandez, and D. Suciu. Storing Semistructured
Data with STORED. In Management of Data (SIGMOD 1999), pages
431–442, 1999.
[29] L. C. Eggan. Transition graphs and the star height of regular events.
Michigan Mathematical Journal, 10:385–397, 1963.
[30] A. Ehrenfeucht and H. Zeiger. Complexity measures for regular expres
sions. Journal of Computer and System Sciences, 12(2):134–146, 1976.
[31] K. Ellul, B. Krawetz, J. Shallit, and M. Wang. Regular expressions:
New results and open problems. Journal of Automata, Languages and
Combinatorics, 10(4):407–437, 2005.
[32] J. Esparza. Decidability and complexity of petri net problems  an
introduction. In Petri Nets, pages 374–428, 1996.
[33] W. Fan, F. Geerts, W. Gelade, F. Neven, and A. Poggi. Complexity
and composition of synthesized web services. In Principles of Database
Systems (PODS 2008), pages 231–240, 2008.
[34] J. Flum and M. Grohe. Parametrized Complexity Theory. Springer,
2006.
[35] M. F¨ urer. The complexity of the inequivalence problem for regular ex
pressions with intersection. In Automata, Languages and Programming
(ICALP 1980), pages 234–245, 1980.
[36] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide
to the Theory of NPCompleteness. Freeman, 1979.
[37] W. Gelade. Succinctness of regular expressions with interleaving, inter
section and counting. In Mathematical Foundations of Computer Science
(MFCS 2008), pages 363–374, 2008.
186 Bibliography
[38] W. Gelade, M. Gyssens, and W. Martens. Regular expressions with
counting: Weak versus strong determinism. In Mathematical Founda
tions of Computer Science (MFCS 2009), 2009.
[39] W. Gelade, M. Marquardt, and T. Schwentick. The dynamic complexity
of formal languages. In Theoretical Aspects of Computer Science (STACS
2009), pages 481–492, 2009.
[40] W. Gelade, W. Martens, and F. Neven. Optimizing schema languages
for XML: Numerical constraints and interleaving. SIAM Journal on
Computing, 38(5):2021–2043, 2009. An extended abstract appeared in
the International Conference on Database Theory (ICDT 2007).
[41] W. Gelade and F. Neven. Succinctness of patternbased schema lan
guages for XML. Journal of Computer and System Sciences. To Ap
pear. An Extended abstract appeared in the International Symposium
on Database Programming Languages (DBPL 2007).
[42] W. Gelade and F. Neven. Succinctness of the complement and inter
section of regular expressions. ACM Transactions on Computational
Logic. To appear. An extended abstract appeared in the International
Symposium on Theoretical Aspects of Computer Science (STACS 2008).
[43] G. Ghelli, D. Colazzo, and C. Sartiani. Eﬃcient inclusion for a class of
XML types with interleaving and counting. In Database Programming
Languages (DBPL 2007), volume 4797, pages 231–245. Springer, 2007.
[44] G. Ghelli, D. Colazzo, and C. Sartiani. Linear time membership in a class
of regular expressions with interleaving and counting. In Information
and Knowledge Management (CIKM 2008), pages 389–398, 2008.
[45] I. Glaister and J. Shallit. A lower bound technique for the size of nonde
terministic ﬁnite automata. Information Processing Letters, 59(2):75–77,
1996.
[46] N. Globerman and D. Harel. Complexity results for twoway and
multipebble automata and their logics. Theoretical Computer Science,
169(2):161–184, 1996.
[47] M. Grohe and N. Schweikardt. The succinctness of ﬁrstorder logic on
linear orders. Logical Methods in Computer Science, 1(1), 2005.
[48] H. Gruber and M. Holzer. Finite automata, digraph connectivity, and
regular expression size. In Automata, Languages and Programming
(ICALP 2008), pages 39–50, 2008.
Bibliography 187
[49] H. Gruber and M. Holzer. Provably shorter regular expressions from
deterministic ﬁnite automata. In Developments in Language Theory
(DLT 2008), pages 383–395, 2008.
[50] H. Gruber and M. Holzer. Language operations with regular expressions
of polynomial size. Theoretical Computer Science, 410(35):3281–3289,
2009.
[51] H. Gruber and M. Holzer. Tight bounds on the descriptional complex
ity of regular expressions. In Developments in Language Theory (DLT
2009), pages 276–287, 2009.
[52] H. Gruber, M. Holzer, and M. Tautschnig. Short regular expressions
from ﬁnite automata: Empirical results. In Implementation and Appli
cations of Automata (CIAA 2009), pages 188–197, 2009.
[53] H. Gruber and J. Johannsen. Optimal lower bounds on regular expres
sion size using communication complexity. In Foundations of Software
Science and Computational Structures (FOSSACS 2008), pages 273–286,
2008.
[54] L. Hemaspaandra and M. Ogihara. Complexity Theory Companion.
Springer, 2002.
[55] J. E. Hopcroft, R. Motwani, and J.D. Ullman and. Introduction to
Automata Theory, Languages, and Computation. AddisonWesley, third
edition, 2007.
[56] H. Hosoya and B. C. Pierce. Xduce: A statically typed XML processing
language. ACM Transactions on Internet Technologies, 3(2):117–148,
2003.
[57] J. Hromkovic, S. Seibert, and T. Wilke. Translating regular expres
sions into small epsilonfree nondeterministic ﬁnite automata. Journal
of Computer and System Sciences, 62(4):565–588, 2001.
[58] A. Hume. A tale of two greps. Software, Practice and Experience,
18(11):1063–1072, 1988.
[59] L. Ilie and S. Yu. Algorithms for computing small NFAs. In Mathematical
Foundations of Computer Science (MFCS 2002), pages 328–340, 2002.
[60] J. J¸ edrzejowicz and A. Szepietowski. Shuﬄe languages are in P. Theo
retical Computer Science, 250(12):31–53, 2001.
188 Bibliography
[61] T. Jiang and B. Ravikumar. A note on the space complexity of some
decision problems for ﬁnite automata. Information Processing Letters,
40(1):25–31, 1991.
[62] G. Kasneci and T. Schwentick. The complexity of reasoning about
patternbased XML schemas. In Principles of Database Systems (PODS
2007), pages 155–163, 2007.
[63] P. Kilpel¨ ainen. Inclusion of unambiguous #res is NPhard, May 2004.
Unpublished.
[64] P. Kilpel¨ ainen and R. Tuhkanen. Regular expressions with numerical
occurrence indicators — preliminary results. In Symposium on Pro
gramming Languages and Software Tools (SPLST 2003), pages 163–173,
2003.
[65] P. Kilpel¨ ainen and R. Tuhkanen. Towards eﬃcient implementation of
XML schema content models. In Document Engineering (DOCENG
2004), pages 239–241, 2004.
[66] P. Kilpel¨ ainen and R. Tuhkanen. Oneunambiguity of regular expres
sions with numeric occurrence indicators. Information and Computation,
205(6):890–916, 2007.
[67] S. C. Kleene. Representation of events in nerve nets and ﬁnite au
tomata. In Automata Studies, pages 3–41. Princeton University Press,
1956.
[68] C. Koch and S. Scherzinger. Attribute grammars for scalable query
processing on XML streams. VLDB Journal, 16(3):317–342, 2007.
[69] C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema
based scheduling of event processors and buﬀer minimization for queries
on structured data streams. In Very Large Databases (VLDB 2004),
pages 228–239, 2004.
[70] D. Kozen. Lower bounds for natural proof systems. In Foundations of
Computer Science (FOCS 1977), pages 254–266. IEEE, 1977.
[71] O. Kupferman and S. Zuhovitzky. An improved algorithm for the mem
bership problem for extended regular expressions. In Mathematical
Foundations of Computer Science (MFCS 2002), pages 446–458, 2002.
[72] L. Libkin. Logics for unranked trees: An overview. In Automata, Lan
guages and Programming (ICALP 2005), pages 35–50, 2005.
Bibliography 189
[73] I. Manolescu, D. Florescu, and D. Kossmann. Answering XML Queries
on Heterogeneous Data Sources. In Very Large Databases (VLDB 2001),
pages 241–250, 2001.
[74] W. Martens and F. Neven. Frontiers of tractability for typechecking
simple XML transformations. Journal of Computer and System Sciences,
73(3):362–390, 2007.
[75] W. Martens, F. Neven, and T. Schwentick. Complexity of decision prob
lems for simple regular expressions. In Mathematical Foundations of
Computer Science (MFCS 2004), pages 889–900, 2004.
[76] W. Martens, F. Neven, and T. Schwentick. Simple oﬀ the shelf abstrac
tions for XML schema. SIGMOD Record, 36(3):15–22, 2007.
[77] W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and
complexity of XML schema. ACM Transactions on Database Systems,
31(3):770–813, 2006.
[78] W. Martens and J. Niehren. On the minimization of XML schemas and
tree automata for unranked trees. Journal of Computer and System
Sciences, 73(4):550–583, 2007.
[79] A. J. Mayer and L. J. Stockmeyer. Word problemsthis time with inter
leaving. Information and Computation, 115(2):293–311, 1994.
[80] R. McNaughton. The loop complexity of puregroup events. Information
and Control, 11(1/2):167–176, 1967.
[81] R. McNaughton. The loop complexity of regular events. Information
Sciences, 1(3):305–328, 1969.
[82] R. McNaughton and H. Yamada. Regular expressions and state graphs
for automata. IEEE Transactions on Electronic Computers, 9(1):39–47,
1960.
[83] A. R. Meyer and L. J. Stockmeyer. The equivalence problem for regular
expressions with squaring requires exponential space. In Switching and
Automata Theory (FOCS 1972), pages 125–129, 1972.
[84] D. W. Mount. Bioinformatics: Sequence and Genome Analysis. Cold
Spring Harbor Laboratory Press, September 2004.
[85] M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XML
schema languages using formal language theory. ACM Transactions on
Internet Technologies, 5(4):660–704, 2005.
190 Bibliography
[86] F. Neven. Automata, logic, and XML. In Computer Science Logic (CSL
2002), pages 2–26, 2002.
[87] F. Neven and T. Schwentick. On the complexity of XPath containment
in the presence of disjunction, DTDs, and variables. Logical Methods in
Computer Science, 2(3), 2006.
[88] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML
data. In Principles of Database Systems (PODS 2000), pages 35–46,
New York, 2000. ACM Press.
[89] H. Petersen. Decision problems for generalized regular expressions. In
Descriptional Complexity of Automata, Grammars and Related Struc
tures (DCAGRS 2000), pages 22–29, 2000.
[90] H. Petersen. The membership problem for regular expressions with in
tersection is complete in LOGCFL. In Theoretical Aspects of Computer
Science (STACS 2002), pages 513–522, 2002.
[91] G. Pighizzini and J. Shallit. Unary language operations, state complex
ity and Jacobsthal’s function. International Journal of Foundations of
Computer Science, 13(1):145–159, 2002.
[92] M. O. Rabin and D. Scott. Finite automata and their decision problems.
IBM Journal of Research, 3(2):115–125, 1959.
[93] F. Reuter. An enhanced W3C XML Schemabased language binding for
object oriented programming languages. Manuscript, 2006.
[94] J. M. Robson. The emptiness of complement problem for semi extended
regular expressions requires c
n
space. Information Processing Letters,
9(5):220–222, 1979.
[95] A. Salomaa. Two complete axiom systems for the algebra of regular
events. Journal of the ACM, 13(1):158–169, 1966.
[96] R. Schott and J. C. Spehner. Shuﬄe of words and araucaria trees.
Fundamenta Informatica, 74(4):579–601, 2006.
[97] M. P. Schutzenberger. On ﬁnite monoids having only trivial subgroups.
Information and Control, 8(2):190–194, 1965.
[98] T. Schwentick. Automata for XML  a survey. Journal of Computer and
System Sciences, 73(3):289–315, 2007.
Bibliography 191
[99] H. Seidl. Haskell overloading is DEXPTIMEcomplete. Information
Processing Letters, 52(2):57–60, 1994.
[100] C. M. SperbergMcQueen and H. Thompson. XML Schema.
http://www.w3.org/XML/Schema, 2005.
[101] C.M. SperbergMcQueen. Notes on ﬁnite state automata with counters.
http://www.w3.org/XML/2004/05/msmcfa.html, 2004.
[102] L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential
time: Preliminary report. In Theory of Computing (STOC 1973), pages
1–9. ACM Press, 1973.
[103] E. van der Vlist. XML Schema. O’Reilly, 2002.
[104] P. van Emde Boas. The convenience of tilings. In Complexity, Logic and
Recursion Theory, volume 187 of Lecture Notes in Pure and Applied
Mathematics, pages 331–363. Marcel Dekker Inc., 1997.
[105] M. Y. Vardi. From monadic logic to PSL. In Pillars of Computer Science,
pages 656–681, 2008.
[106] V. Vianu. Logic as a query language: From frege to XML. In Theoretical
Aspects of Computer Science (STACS 2003), pages 1–12, 2003.
[107] V. Waizenegger. Uber die Eﬃzienz der Darstellung durch regul¨ are
Ausdr¨ ucke und endliche Automaten. Diplomarbeit, RWTH Aachen,
2000.
[108] L. Wall, T. Christiansen, and J. Orwant. Programming Perl. O’Reilly,
third edition, 2000.
[109] G. Wang, M. Liu, J. X. Yu, B. Sun, G. Yu, J. Lv, and H. Lu. Ef
fective schemabased XML query optimization techniques. In Inter
national Database Engineering and Applications Symposium (IDEAS
2003), pages 230–235, 2003.
[110] S. Yu. Regular languages. In G. Rozenberg and A. Salomaa, edi
tors, Handbook of formal languages, volume 1, chapter 2, pages 41–110.
Springer, 1997.
[111] S. Yu. State complexity of regular languages. Journal of Automata,
Languages and Combinatorics, 6(2):221–234, 2001.
[112] Djelloul Ziadi. Regular expression for a language without empty word.
Theoretical Computer Science, 163(1&2):309–315, 1996.
Samenvatting
De reguliere talen vormen een zeer robuuste verzameling van talen. Zij kunnen
worden voorgesteld door veel verschillende formalismen, waaronder eindige au
tomaten, reguliere uitdrukkingen, eindige monoides, monadische tweedeorde
logica, rechtslineare grammatica’s, kwantorvrije eersteorde updates, ... Re
guliere talen zijn gesloten onder een groot aantal operaties en, nog belangrijker,
bijna ieder interessant probleem met betrekking tot reguliere talen is beslis
baar. Allerhande aspecten van de reguliere talen zijn bestudeerd in de laatste
50 jaar. Vanuit een praktisch perspectief is de meest populaire manier om
reguliere talen te speciﬁ¨eren, gebruik te maken van reguliere uitdrukkingen.
Ze worden dan ook gebruikt in toepassingen in veel verschillende gebieden van
de informatica, waaronder bioinformatica, programmeertalen, automatische
veriﬁcatie, en XML schema talen.
XML is de lingua franca en de de facto standaard voor het uitwisselen van
data op het Web. Wanneer twee partijen data uitwisselen in XML, voldoen
de XML documenten meestal aan een bepaald formaat. XML schema talen
worden gebruikt om vast te leggen welke structuur een XML document mag
hebben. De meest gebruikte XML schema talen zijn DTD, XML Schema,
beide W3C standaarden, en Relax NG. Vanuit het oogpunt van de theorie
der formele talen is ieder van deze talen een grammaticagebaseerd formalisme
met reguliere uidrukkingen aan de rechterkant. Deze uitdrukkingen verschillen
echter van de standaard reguliere uitdrukkingen omdat zij zijn uitgebreid met
extra operatoren, maar ook beperkt door de vereiste dat zij deterministisch
moeten zijn. Hoewel deze zaken zijn opgenomen in W3C en ISO standaarden,
is het niet duidelijk wat hun impact is op de verschillende schema talen. Het
doel van deze thesis is daarom een studie van deze gevolgen. In het bijzonder
bestuderen we de complexiteit van het optimaliseren van XML schema’s in
de aanwezigheid van deze operatoren, we illustreren de moeilijkheden in het
migreren van een schemataal naar een andere, en we bestuderen de implicaties
van de eis tot determinisme (ook in de aanwezigheid van een teloperator), en
194 Samenvatting
proberen de vereiste toegankelijk te maken in de praktijk.
Hoewel de vragen die we ons stellen hoofdzakelijk geinspireerd zijn door
vragen over XML, geloven we dat de antwoorden ook interessant zijn voor de
algemene theoretische informatica gemeenschap. Daarom bestaat deze thesis
uit twee delen. Na de inleiding en deﬁnities in de eerste twee hoofdstuk
ken, bestuderen we in het eerste deel fundamentele aspecten van reguliere
uitdrukkingen. In het tweede deel passen we deze resultaten toe op vragen
met betrekking tot XML schema talen. Hoewel het grootste deel van het werk
van fundamentele aard is, ontwikkelden we ook software om deze theoretische
inzichten naar de praktijk over te brengen.
Fundamenten van Reguliere Uitdrukkingen
In Hoofdstuk 3 behandelen we de beknoptheid van reguliere uitdrukkingen.
In het bijzonder behandelen we de volgende vragen, waarbij, zoals gewoon
lijk, L(r) staat voor de taal gedeﬁnieerd door een reguliere uitdrukkingen r.
Gegeven reguliere uitdrukkingen r, r
1
, . . . , r
k
over een alfabet Σ,
1. wat is de complexiteit van het construeren van een reguliere uitdrukking
r
¬
die Σ
∗
\ L(r) deﬁnieert, i.e., het complement van r?
2. wat is de complexiteit van het construeren van een reguliere uitdrukking
r
∩
die L(r
1
) ∩ · · · ∩ L(r
k
) deﬁnieert?
In beide gevallen vereist het naieve algoritme dubbel exponenti¨ele tijd in
de grootte van de invoer. Inderdaad, voor het complement, vertaal r naar een
NFA en determiniseer deze (eerste exponenti¨ele stap), complementeer deze en
vertaal terug naar een reguliere uitdrukking (tweede exponenti¨ele stap). Voor
intersectie is er een gelijkaardige procedure met behulp van een vertaling naar
eindige automaten, het product van deze automaten te construeren, en terug
te vertalen naar een reguliere uitdrukking. Merk op dat beide algoritmen niet
enkel dubbel exponenti¨ele tijd vereisen, maar in het slechtste geval ook uit
drukkingen van dubbel exponenti¨ele grootte construeren. We tonen aan dat
deze naieve algoritmen niet substantieel kunnen worden verbeterd door een
klasse van reguliere uitdrukkingen te construeren waarvoor deze dubbel expo
nenti¨ele vergroting van de expressies niet kan worden vermeden. De grootste
technische contributie van dit hoofdstuk is een veralgemening van een resul
taat van Ehrenfeucht en Zeiger. We construeren een familie van talen over
een alfabet dat slechts twee symbolen bevat en tonen aan dat deze talen niet
kunnen worden gedeﬁnieerd door kleine reguliere uitdrukkingen. Hieruit volgt
bovendien dat, in het slechtste geval, een vertaling van DFAs
1
naar reguliere
1
We gebruiken in deze nederlandse samenvatting toch de engelstalige afkortingen, zoals
hier van Deterministic Finite Automaton.
Samenvatting 195
uitdrukkingen exponentieel is, zelfs wanneer het alfabet slechts uit 2 symbolen
bestaat.
Bovendien behandelen we dezelfde vragen voor twee deelverzamelingen:
enkelvoorkomende en deterministische reguliere uitdrukkingen. Een regu
liere uitdrukking is enkelvoorkomend wanneer ieder alfabetsymbool slechts
´e´enmaal voorkomt in de uitdrukking. Bijvoorbeeld, (a + b)
∗
is een enkel
voorkomende reguliere uitdrukking (SORE) terwijl a
∗
(a + b)
+
dat niet is.
Ondanks hun eenvoud omvatten SOREs toch de overgrote meerderheid van
XML schema’s op het Web. Determinisme vereist intu¨ıtief dat, wanneer een
woord van links naar rechts tegen een uitdrukking wordt gevalideerd, het al
tijd duidelijk is welk symbool in het woord overeenkomt met welk symbool in
de uitdrukking. Bijvoorbeeld, de uitdrukking (a+b)
∗
a is niet deterministisch,
maar de equivalente uitdrukking b
∗
a(b
∗
a)
∗
is het wel. De XML schema talen
DTD en XML Schema vereisen dat alle uitdrukkingen deterministisch zijn.
Zonder in detail te gaan, kunnen we stellen dat voor beide klassen het com
plementeren van uitdrukkingen makkelijker wordt, terwijl intersectie moeilijk
blijft.
Waar Hoofdstuk 3 de complexiteit van het toepassen van operaties op regu
liere uitdrukkingen onderzocht, bestuderen we in Hoofdstuk 4 de beknoptheid
van uitgebreide reguliere uitdrukkingen, i.e. uitdrukkingen uitgebreid met ex
tra operatoren. In het bijzonder bestuderen we de beknoptheid van uitdruk
kingen uitgebreid met operatoren om te tellen (RE(#)), doorsnede te nemen
(RE(∩)), en woorden te mengen (RE(&)). De teloperator laat expressies zoals
a
2,5
toe, die uitdrukt dat er minstens 2 en ten hoogste 5 a’s moeten voorko
men. Deze RE(#)s worden gebruikt in egrep en Perl uitdrukkingen en XML
Schema. De klasse RE(∩) is een goed bestudeerde extensie van de reguliere
uitdrukkingen en wordt ook vaak de halfuitgebreide reguliere uitdrukkingen
genoemd. De mengoperator laat uitdrukkingen zoals a&b &c toe, die speciﬁ
eert dat a, b, en c in eender welke volgorde mogen voorkomen. Deze operator
komt voor in de XML schema taal Relax NG [22] en, in een zeer beperkte
vorm, XML Schema. We geven een volledig overzicht van de beknoptheid
van deze klassen van uitgebreide reguliere uitdrukkingen met betrekking tot
standaard reguliere uitdrukkingen, NFAs, en DFAs.
De belangrijkste reden om de complexiteit van een vertaling van uitgebrei
de reguliere uitdrukkingen naar standaard reguliere uitdrukkingen te bestude
ren, is om meer inzicht te krijgen in de kracht van de verschillende operatoren.
Deze resultaten hebben echter ook belangrijke gevolgen met betrekking tot de
complexiteit van het vertalen van een schemataal naar een andere. Zo tonen
we bijvoorbeeld aan dat in een vertaling van RE(&) naar standaard RE een
dubbel exponenti¨ele vergroting in het algemeen niet vermeden kan worden.
Aangezien Relax NG de mengoperator toelaat, en XML Schema deze alleen
196 Samenvatting
in een zeer beperkte vorm toelaat, impliceert dit dat ook in een vertaling van
Relax NG naar XML Schema zo een dubbel exponenti¨ele vergroting niet te
vermijden is. Desondanks, aangezien XML Schema een wijdverspreide W3C
standaard is, en Relax NG een meer ﬂexibel alternatief is, zou zo een vertaling
meer dan wenselijk zijn. Ten tweede bestuderen we de beknoptheid van re
guliere uitdrukkingen ten opzichte van eindige automaten aangezien, wanneer
algoritmische problemen worden beschouwd, de gemakkelijkste oplossing vaak
via een vertaling naar automaten gaat. Onze negatieve resultaten tonen echter
aan dat het vaak een beter idee is om gespecialiseerde algoritmen, die zulke
vertalingen vermijden, voor deze uitgebreide uitdrukkingen te ontwikkelen.
In Hoofdstuk 5 bestuderen we de complexiteit van het equivalentie, in
clusie, en intersectie nietleegheid probleem voor verschillende klassen van re
guliere uitdrukkingen. Deze klassen zijn algemene reguliere uitdrukkingen,
uitgebreid met tel en mengoperatoren, en ketting reguliere uitdrukkingen
(CHAREs), uitgebreid met de teloperator. Deze CHAREs vormen een ande
re eenvoudige subklasse van de reguliere uitdrukkingen. De grootste motivatie
voor het bestuderen van deze klassen ligt in het feit dat ze gebruikt worden
in XML schema talen, en we in Hoofdstuk 8 de complexiteit van dezelfde
problemen voor deze schema talen zullen bestuderen, gebruik makend van de
resultaten in dit hoofdstuk.
In Hoofdstuk 6 bestuderen we deterministische reguliere uitdrukkingen uit
gebreid met de teloperator, aangezien dit essentieel de reguliere uitdrukkingen
zijn die voorkomen in XML Schema. De meest gebruikte notie van determinis
me is zoals we ze eerder deﬁnieerden: een expressie is deterministisch wanneer
het bij het valideren van een woord steeds duidelijk is welk symbool in het
woord overeenkomt met welk symbool in de uitdrukking. We kunnen echter
bijkomend vereisen dat het ook steeds duidelijk is hoe er van ´e´en positie naar
de volgende in de uitdrukking moet gegaan worden. De eerste notie zullen we
zwak determinisme noemen, en de tweede sterk determinisme. Bijvoorbeeld,
(a
∗
)
∗
is zwak deterministisch, maar niet sterk deterministisch, aangezien het
niet duidelijk is over welke ster we moeten itereren bij het gaan van ´e´en a naar
de volgende.
Voor standaard reguliere uitdrukkingen wordt het verschil tussen deze twee
noties zelden opgemerkt, aangezien ze voor deze uitdrukkingen bijna overeen
komen. Dit kunnen we zeggen omdat iedere zwak deterministische uitdrukking
in lineaire tijd naar een sterk deterministische uitdrukking kan worden ver
taald. Deze situatie verandert echter volledig wanneer de teloperator wordt
toegevoegd. Ten eerste, het algoritme om te beslissen of een uitdrukking al dan
niet deterministisch is, is allesbehalve eenvoudig. Bijvoorbeeld, (a
2,3
+b)
2,2
b is
zwak determintisch terwijl (a
2,3
+b)
3,3
b dat niet is. Ten tweede, zoals we zullen
aantonen, zijn zwak deterministische uitdrukkingen met teloperator strikt ex
Samenvatting 197
pressiever dan sterk deterministische uitdrukkingen met teloperator. Daarom
is het doel van dit hoofdstuk een studie van de noties van zwak en sterk deter
minisme in het bijzijn van de teloperator met betrekking tot expressiviteit,
beknoptheid, en complexiteit.
Toepassingen voor XML Schema Talen
In Hoofdstuk 7 bestuderen we patroongebaseerde schema talen. Dit zijn
schema talen equivalent in expressieve kracht met enkeltype EDTDs, de vaak
gebruikte abstractie van XML Schema. Een voordeel van deze taal is dat ze
de expressieve kracht van XSDs duidelijker maakt: het inhoud model van een
element hangt enkel af van reguliere woordeigenschappen gevormd door de
voorouders van dat element. Patroongebaseerde schema’s kunnen daarom
gebruikt worden als een typevrije frontend voor XML Schema. Aangezien ze
zowel in een existentiele als in een universele semantiek kunnen ge¨ınterpreteerd
worden, bestuderen we in dit hoofdstuk de complexiteit van vertalingen tussen
de twee semantieken en vertalingen naar de formalismen DTDs, EDTDs, en
enkeltype EDTDs, de vaak gebruikte abstracties voor respectievelijk DTD,
Relax NG en XML Schema.
Hiervoor maken we gebruik van de resultaten in Hoofdstuk 3 en tonen aan
dat in het algemeen vertalingen van patroongebaseerde schema talen naar
de andere formalismen, exponenti¨ele of zelfs dubbel exponenti¨ele tijd vragen.
Bijgevolg is er weinig hoop om patroongebaseerde schema’s, in hun meest
algemene vorm, als een nuttige frontend te gebruiken voor XML Schema.
Daarom bestuderen we ook meer beperkte klassen van schema’s: lineaire en
sterk lineaire patroongebaseerde schema’s. De meest gerestricteerde klasse,
sterk lineaire patroongebaseerde schema’s, laten eﬃciente vertalingen naar
andere talen en eﬃciente algoritmes voor beslissingsproblemen toe, en zijn bo
vendien expressief genoeg om de overgrote meerderheid van XSDs voorkomend
op het Web uit te drukken. Vanuit een praktisch oogpunt is dit dus een zeer
interessante klasse van schema’s.
In Hoofdstuk 8 bestuderen we de impact van het toevoegen van tel en
mengoperator aan reguliere uitdrukkingen voor de verschillende schema ta
len. We beschouwen het equivalentie, inclusie, en intersectie nietleegheid
probleem aangezien deze de bouwstenen voor algoritmen voor het optimalise
ren van XML schema’s zijn. We beschouwen ook het simpliﬁcatie probleem:
Gegeven een EDTD, is deze equivalent met een enkeltype EDTD of een DTD?
Ten slotte, in Hoofdstuk 9, geven we algoritmen waarmee je kan omgaan
met de vereiste in DTD en XML dat alle reguliere uitdrukkingen determinis
tisch moeten zijn, i.e. de UPA vereiste in XML Schema. In de meeste boeken
over XML Schema wordt de UPA vereiste uitgelegd met behulp van een voor
198 Samenvatting
beeld, in plaats van een duidelijke syntactische deﬁnitie. Wanneer bijgevolg,
na het opstellen van het schema, ´e´en of meerdere expressies worden geweigerd
omdat ze nietdeterministisch zijn, is het zeer moeilijk voor gebruikers, die
geen expert zijn om de reden van de fout te vinden, en nog moeilijker om deze
te corrigeren. Het doel van dit hoofdstuk is om methodes te onderzoeken om
nietdeterministische uitdrukkingen te vertalen naar kleine en leesbare deter
ministische uitdrukkingen die ofwel dezelfde taal deﬁni¨eren, ofwel een goede
benadering zijn. We stellen het supac algoritme voor dat kan gebruikt worden
in een slimme XSD tester. Naast fouten te vinden in XSDs die niet conform
UPA zijn, kan het bovendien ook goede alternatieven suggereren. Op deze
manier is de gebruiker ontlast van de vervelende UPA vereiste, en kan hij zich
concentreren op het ontwerpen van een juist schema.
Preface
This is the ideal opportunity to thank the people who have contributed either directly or indirectly to this thesis. First and foremost, I would like to thank my advisor, Frank Neven, for his continuing guidance and support. Next to his excellent vision on scientiﬁc work, he also is a great person whose students always come ﬁrst. Needless to say, this dissertation would not have been possible without him. During the last years I met many very interesting persons, too many to thank all of them personally. But thank you Thomas, Bart, Wim, Henrik, Walied, Geert Jan, Tim, Marcel, Kurt, Jan, Volker, Natalia, Goele, Stijn, Marc, Wenfei, Floris, . . . for pleasant collaborations (scientiﬁc as well as teaching), interesting conversations, and good company. Among them, special thanks go out to Wim for many fun collaborations, and Thomas for inviting me to his group in Dortmund, which really felt like a second research group to me. On a more personal note, I would also like to thank my friends for their continuing support. Last, but deﬁnitely not least, I would like to thank my parents, brothers, and sister, for being, each in their own way, simply the best people I know. I can not imagine growing up in a better family as I did. Thank you all! Diepenbeek, September 2009
i
. . . . . . .4 Schema Languages for XML . . . . . . . . . . . . . . &) . . . . 4. . . .3 Succinctness w. NFAs . . 4. . .t. . . . .3 Mn : Succinctness of RE(&) . . . . . Regular Expressions . . . . . . . . . . 57 iii . . . . . . . . . . . . . . . . . . . . . . .1 Automata for RE(#. . . . . . . . .2 Ln : Binary Encoding of Kn . . . . . . . I Foundations of Regular Expressions Zeiger to a . .4. . . . . . 4. . . 2. .4. . . . DFAs . . . . . . . 4 Succinctness of Extended Regular Expressions 4. . .4. . . . . . .1 Regular Expressions . .1 A Generalization of a Theorem by Ehrenfeucht and Fixed Alphabet . . . . .t. . . . . . . 2. . . . . . . . . 2. . . .3 Finite Automata . . . . . . . . .r. . . . . . . .4 Succinctness w. . . . . . . . . . i 1 7 7 9 10 11 12 . . . . . . . . . . . . . . . . .2 Deterministic Regular Expressions 2. . . . . .r. . 4. . . . . . . . .5 Decision Problems . . . . . . . . . . . .t. . . . . . . .r. . .1 Deﬁnitions and Basic Results . . . . . . . . . . . . . . . . . 3. . . . . . . . . . .1 Kn : The Basic Language . . . . . . . . . . . . . . .Contents Preface 1 Introduction 2 Deﬁnitions and preliminaries 2. . . . . 4. . . . . . . . . . . . . . .3 Intersecting Regular Expressions . . . . . . . . 3. . . . . . . .2 Succinctness w. . . . 4. 15 17 19 27 32 37 40 43 44 47 48 49 52 3 Succinctness of Regular Expressions 3. . . . . . . . . 5 Complexity of Extended Regular Expressions 57 5. . . . .2 Complementing Regular Expressions . . . . . . . .
. . . 9. . . . . . . . . . .2 CONTENTS Complexity of Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Constructing Deterministic Expressions . . . . . . . . . . . . . . . . . .3 Linear PatternBased Schemas . . . . . . . . . .2 Characterizations of (Extended) DTDs . . . . . . . . 6. . 6. .3 Simpliﬁcation . . . . . . . . . . . . . II Applications to XML Schema Languages . . . . . 7. . . . . . . . . . . . . . .2 Regular PatternBased Schemas . . . . . . .3. . . . . . . . 8 Optimizing XML Schema Languages 8. . . . .1. 7. . . . . .1. 9. .3. . . 6. . . .1 Preliminaries . . . . 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Optimal Approximations . . . . . . . . . . . . . .2 Enumerating Automata . . . . . . . . . . . . 7. . . . . . . . . . . . . . . . . . .1 Deciding Determinism . . . . . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . .2 Quality of the approximation . . . . . . . . . . . . . . .4 Ahonen’s Algorithm Followed by Grow . . . . . . . . . . . . . 6. . . . . . . . . 9. . . . . . . . . . .8 Discussion . 9. 7. 6.1. . . 9. . 9 Handling NonDeterministic Regular Expressions 9. . 8. . . . . . 9. . . . .3 Approximating Deterministic Regular Expressions . . . 9. . . . . . . .3. .3 Ahonen’s Algorithm . . . . . . . 66 75 78 80 89 92 97 104 109 110 6 Deterministic Regular Expressions with 6. . . . . . . . . . . . . . . . . .3 Theorems . 9. . . . . .4 Succinctness .1 Complexity of DTDs and SingleType EDTDs . . . . . . . . . . . . .1. . . . . . . . .7 Decision Problems for RE(#) . . . . .4 Counter Automata . . . . 6. .3. . . . . . . . . . . . . . . . . . . 9. . . . . . . . . . .2 Expressive Power . . . . . . . . . . . . . . . . . . . . . . . . . Counting . . . . . . . . .5 Shrink . .1 PatternBased Schema Languages .5 From RE(#) to CNFA . . . . . . . .2. . . . . . . . . .2. . . . . . 7. . . . . . . . .3 Optimizing the BKWAlgorithm . . . . . . . . . . .3 Succinctness . . . . . . . . . . . . .4 Strongly Linear PatternBased Schemas . . . . 8. . . . . . . . . . . . . . . . .1 Preliminaries and Basic Results . . . . . . . . . . . . . . 6. . .6 Testing Strong Determinism . . . . .1 Growing automata . 111 113 115 116 117 119 120 120 126 133 141 143 144 147 151 155 157 158 158 161 165 165 166 167 171 171 7 Succinctness of PatternBased Schema Languages 7. .iv 5. . . . .2 Complexity of Extended DTDs . . .3. . . . . . . . . . . . . . . . . . . . . 7. . . . . . . . . . . . . . . . . . . . . . . . . 9. .
. . .2 Constructing Deterministic Regular Expressions . .1 Deciding Determinism . . . . . . . . .3 Approximating Deterministic Regular Expressions SUPAC: Supportive UPA Checker . .9. . . . . . .4. . . . . . . . . . . . . . . 9. . . 173 173 174 175 177 179 181 183 193 10 Conclusion 11 Publications Bibliography Samenvatting . . . . . . . . . 9.4 9. . . . . . . . . . . . .4. . . . .5 Experiments . . . . .4. . . . 9. .
.
These expression. All kinds of aspects of the regular languages have been studied over the past 50 years. Thereto. monadic secondorder logic [18]. diﬀer from the standard regular expressions in that they are extended by additional operators but are also restricted by the requirement to be deterministic. The most common XML schema languages are DTD.1 Introduction The regular languages constitute a highly robust class of languages. programming languages [108]... XML Schema [100]. almost any interesting problem concerning regular languages is decidable. it is not clear what their impact on the diﬀerent . both W3C standards. From a practical perspective. They can be represented by many diﬀerent formalisms such as ﬁnite automata [92]. quantiﬁerfree ﬁrstorder updates [39]. When two parties exchange data in XML. model checking [105]. and XML schema language [100]. rightlinear grammars [21]. XML is the lingua franca and the de facto standard for data exchange on the Web. From a formal language theory point of view each of these is a grammarbased formalism with regular expressions at their righthand sides. Regular languages are closed under a wide range of operations and. regular expressions [67]. Although these requirements are recorded in W3C and ISO standards. the most widespread way to specify regular languages is by using regular expressions (REs). even more importantly. XML schema languages are used to describe the structure an XML document can have. and Relax NG [22]. including bioinformatics [84]. They are used in applications in many diﬀerent areas of computer science. . the XML documents usually adhere to a certain format. ﬁnite semigroups [97]. however.
. As usual L(r) denotes the language deﬁned by regular expression r. we also developed software to bring these theoretical insights to practice (cf. we construct a family of languages over a twoletter alphabet which can not be deﬁned by small regular expressions. we apply the former results to applications concerning XML schema languages. transform r to an NFA and determinize it (ﬁrst exponential step). and study the implications of the determinism constraint (also in the presence of additional operators) and try to make it accessible in practice. Indeed. Therefore. In particular. Although most of the work is of a foundational nature. The latter . 1. in particular. taking the crossproduct and a retranslation to a regular expression. . complement it and translate back to a regular expression (second exponential step). . we believe that the answers are also interesting for the general theoretical computer science community as they answer fundamental questions concerning regular expressions. The goal of this thesis therefore is a study of these consequences. we study the complexity of optimization in the presence of additional operators. The main technical contribution of this chapter is a generalization of a result by Ehrenfeucht and Zeiger [30]. illustrate the diﬃculties of migrating from one schema language to another. For the intersection there is a similar algorithm through a translation to NFAs. We show that these naive constructions can not be substantially improved by exhibiting classes of regular expressions for which this double exponential size increase can not be avoided. In the second. r1 . what is the complexity of constructing a regular expression r∩ deﬁning L(r1 ) ∩ · · · ∩ L(rk )? In both cases. this thesis consists of two parts. we consider the following questions. rk over an alphabet Σ. we study fundamental aspects of regular expression. Foundations of Regular Expressions In Chapter 3. In the ﬁrst part. what is the complexity of constructing a regular expression r¬ deﬁning Σ∗ \ L(r). Note that both algorithms do not only take double exponential time but also result in a regular expression of double exponential size. the complement of r? 2. . In particular. Chapter 9). we address succinctness of regular expressions. the naive algorithm takes time double exponential in the size of the input.2 Introduction schema languages is. Although the questions we ask are mainly inspired by questions about XML. that is. Given regular expressions r. for the complement.
3 also allows to show that in a translation from automata to regular expression an exponential sizeincrease can in general not be avoided. this implies that also in a translation from Relax NG to XML Schema a double exponential size increase can not be avoided. when matching a string from left to right against an expression. While Chapter 3 investigates the complexity of applying language operations to regular expressions. However. and interleaving (RE(&)) operators. In particular. (a + b)∗ c is a singleoccurrence regular expression (SORE) while a∗ (a + b)∗ is not. even over a binary alphabet. Chapter 4 investigates the succinctness of extended regular expressions. but the equivalent expression b∗ a(b∗ a)∗ is. The class RE(∩) is a well studied extension of the regular expressions.5 . For instance. these results also have important consequences concerning the complexity of translating from one schema language to another. As Relax NG allows interleaving. The main reason to study the complexity of a translation from extended regular expressions to standard expressions is to gain more insight in the power of the diﬀerent operators. intersection (RE(∩)). We give a complete overview of the succinctness of these classes of extended regular expressions with respect to regular expressions. The counting operator allows for expressions such as a2. as XML Schema is a widespread W3C standard. we consider the same questions for two strict subclasses: singleoccurrence and deterministic regular expressions. The XML schema languages DTD and XML Schema [100] require expressions to be deterministic. it is always clear against which position in the expression the next symbol must be matched. expressions extended with additional operators. the expression (a + b)∗ a is not deterministic. and XML Schema only allows a very restricted form of interleaving. in XML Schema [100]. for both classes complementation becomes easier. specifying that there must occur at least two and at most ﬁve a’s. Nevertheless. These RE(#)s are used in egrep [58] and Perl [108] patterns and XML Schema [100]. and is often referred to as the semiextended regular expressions. and Relax NG is a more ﬂex . A regular expression is singleoccurrence when every alphabet symbol occurs at most once. For instance. specifying that a. Despite their apparent simplicity. and is used in the XML schema language Relax NG [22] and. and c may occur in any order. i. SOREs nonetheless capture the majority of XML schemas on the Web [9]. In short. we show that in a translation from RE(&) to standard RE a double exponential size increase can in general not be avoided. NFAs. in a very restricted form. Determinism (also called oneunambiguity [17]) intuitively requires that. In addition. while intersection remains diﬃcult. For example.e. b. The interleaving operator allows for expressions such as a & b & c. we study the succinctness of regular expressions extended with counting (RE(#)). and DFAs.
when considering algorithmic problems. the aim of this chapter is an indepth study of the notions of weak and strong determinism in the presence of counting with respect to expressiveness. we study deterministic regular expressions with counting. Second. So. as these are essentially the regular expressions allowed in XML Schema. and intersection nonemptiness problem for various classes of regular expressions. These classes are general regular expressions extended with counting and interleaving operators. Firstly. when studying the complexity of the same problems for XML schema languages. inclusion.3 + b)2. The main motivation for studying these classes lies in their application to the complexity of the diﬀerent XML schema languages and. the easiest solution is often through a translation to ﬁnite automata. but not strongly deterministic since it is not clear over which star one should iterate when going from one a to the next. However. That is. (a2.2 b is weakly deterministic. In Chapter 5. consequently. succinctness.3 b is not. The latter is another simple subclass of the regular expressions [75]. For instance. Second. it is always clear against which position in the expression the next symbol must be matched. the algorithm for deciding whether an expression is weakly deterministic is nontrivial [66]. 1 Br¨ggemannKlein [15] did not study strong determinism explicitly. the amount of nondeterminism introduced depends on the concrete values of the counters. but the very similar (a2. However. we study the succinctness of extended regular expressions to ﬁnite automata as. For standard regular expressions this distinction is usually not made as the two notions almost coincide for them. and complexity. such a translation would be more than desirable. a weak deterministic expression can be translated in linear time into an equivalent strong deterministic one [15]. as we show. (a∗ )∗ is weakly deterministic. one can also consider a stronger notion of determinism in which it additionally must also always be clear how to go from one position to the next. our negative results indicate that it is often more desirable to develop dedicated algorithms for extended regular expressions which avoid such translations. weakly deterministic expressions with counting are strictly more expressive than strongly deterministic ones. and CHAin Regular Expressions (CHAREs) extended with counting. In Chapter 6.4 Introduction ible alternative. For example. these results will further be used in Chapter 8. she does u give a procedure to transform expressions into star normal form which rewrites weakly determinisistic expressions into equivalent strongly deterministic ones in linear time. However. We refer to the former notion as weak and the latter as strong determinism.1 This situation changes completely when counting is involved. The commonly accepted notion of determinism is as given before: an expression is deterministic when matching a string against it.3 + b)3. we investigate the complexity of the equivalence. Therefore. .
we study the impact of adding counting and interleaving to regular expressions for the diﬀerent schema languages. we study patternbased schema languages. In most books (c. We consider the equivalence. thereby reducing much of the hope of using patternbased schemas in its most general form as a useful frontend for XML Schema.f. An advantage of this language is that it makes the expressiveness of XML Schema more apparent: the content model of an element can only depend on regular string properties of the string formed by the ancestors of that element. In Chapter 8. [103]). and XML Schema. in Chapter 9. So. Therefore. we make extensive use of the results in Chapter 3 and show that in general translations from patternbased schemas to the other schema formalisms requires exponential or even double exponential time. the common abstractions of DTD. the most restricted class. and singletype EDTDs. Interestingly. inclusion. Relax NG. eﬃcient algorithms for basic decision problems. and intersection nonemptiness problem as these constitute the basic building blocks in algorithms for optimizing XML schemas. one or several regular expressions are rejected by the schema checker on account of being nondeterministic. it is very diﬃcult for nonexpert2 users to grasp the source of the error and almost impossible to rewrite the expression into an admissible one. we provide algorithms for dealing with the requirement in DTD and XML Schema that all regular expressions are deterministic.e. We also consider the simpliﬁcation problem: Given an EDTD. . Patternbased schemas can therefore be used as a typefree frontend for XML Schema. the commonly used abstraction of XML Schema. EDTDs. allow for efﬁcient translations to other languages. the Unique Particle Attribution constraint in XML Schema.5 Applications to XML Schema Languages In Chapter 7. This is a schema language introduced by Martens et al. i. strongly linear schemas. and yet are expressive enough to capture the far majority of realworld XML Schema Deﬁnitions (XSDs). we also study more restricted classes of schemas: linear and strongly linear patternbased schemas. when after the schema design process. respectively Here. this is hence a very useful class of schemas. From a practical point of view. [77] and equivalent in expressive power to singletype EDTDs. is it equivalent to a singletype EDTD or a DTD? Finally. we study in this chapter the complexity of translating between the two semantics and into the formalisms of DTDs. As they can be interpreted both in an existential and universal way. The purpose of the present chapter is to investigate methods for transforming nondeterministic expres2 In formal language theory. the UPA constraint is usually explained in terms of a simple example rather than by means of a clear syntactical deﬁnition.
the task of designing an XSD is relieved from the burden of the UPA restriction and the user can focus on designing an accurate schema. . We propose the algorithm supac (Supportive UPA Checker) which can be incorporated in a responsive XSD tester which in addition to rejecting XSDs violating UPA also suggests plausible alternatives. Consequently.6 Introduction sions into concise and readable deterministic ones deﬁning either the same language or constituting good approximations.
That is. and counting operators. and every Σsymbol is a regular expression. and r1 . ε. for w ∈ Σ∗ . w′ ∈ L′ }. n} and the symbol of w at position i is ai . REs extended with interleaving. denoted by RE. ¬.2 Deﬁnitions and preliminaries In this section. that is. The set of regular expressions over Σ. intersection. For the rest of this thesis. we deﬁne their concatenation L · L′ to be the set {ww′  w ∈ L. complementation. ﬁnite automata and XML schema languages. For two string languages L. . We deﬁne the length of w. By w1 · w2 we denote the concatenation of two strings w1 and w2 . ∩. The set of all strings is denoted by Σ∗ . w & ε = ε & w = {w}. . By RE(&. . A string language is a subset of Σ∗ . As usual. Σ always denotes a ﬁnite alphabet. we denote the concatenation of w1 and w2 by w1 w2 . and aw1 & bw2 = ({a}(w1 & bw2 )) ∪ ({b}(aw1 & w2 )). for readability. We denote the empty string by ε.1 Regular Expressions By N we denote the natural numbers without zero. denoted by w. A Σstring (or simply string) is a ﬁnite sequence w = a1 · · · an of Σsymbols. L′ ⊆ Σ∗ . So. then so are r1 ·r2 . is deﬁned in the usual way: ∅. and when r1 and ∗ r2 are regular expressions. . #) we denote the class of extended regular expressions. when . The set of positions of w is {1. 2. The operator & is then extended to languages in the canonical way. to be n. r1 +r2 . We abbreviate L · L · · · L (i times) by Li . By w1 & w2 we denote the set of strings that is obtained by interleaving w1 and w2 in every possible way. we deﬁne the necessary deﬁnitions and notions concerning regular expressions.
For any subset S of {&. All notions deﬁned for extended regular expressions in the remainder of this section also apply to standard regular expressions. in the absence of complementation. • L(r1 + r2 ) = L(r1 ) ∪ L(r2 ). k < ∞ for any k ∈ N ∪ {0}. Note also that. is inductively deﬁned as follows: • L(∅) = ∅. only intersection and complementation. • L(ε) = {ε}. we denote by RE(S) the class of extended regular expressions using only operators in S. • L(a) = {a}.8 Deﬁnitions and preliminaries k. and not in any other expression not containing complementation. this is an abbreviation for rk (r + ε)l−k . When rk. we often refer to the normal regular expressions as standard regular expressions. we assume in the remainder of this thesis that the operator ∅ is only used in the expression r = ∅. we sometimes omit ∗ as a basic operator when counting is allowed. ¬. . ∩. RE(∩. r?. .ℓ ) = rr∗ . r1 ∩r2 .ℓ r1 and r2 are RE(&. we abbreviate by S the regular expression a1 + · · · + an . r + ε. . Therefore. ¬) denotes the set of all expressions using. in the canonical manner. respectively. ¬r1 and r1 for k ∈ N ∪ {0}. Here. The language deﬁned by an extended regular expression r. By r+ . we abbreviate the expressions i=1 r1 + · · · + rk . the operator ∅ is only necessary to deﬁne the language ∅ and can be removed in linear time from any extended regular expression deﬁning a language diﬀerent from ∅. unless mentioned otherwise. #) expressions then so are r1 &r2 . To distinguish from extended regular expressions. • L(r1 r2 ) = L(r1 ) · L(r2 ). For a set S = {a1 . Note also that in the context of RE(#). and rk .∞ . ¬. Therefore. #}. . • L(r1 & r2 ) = L(r1 ) & L(r2 ) • L(r1 ∩ r2 ) = L(r1 ) ∩ L(r2 ). k ri . ℓ ∈ N ∪ {∞} and k ≤ ℓ. with k ∈ N. For instance. denoted by L(r). . • L(r∗ ) = {ε} ∪ ∞ i i=1 L(r) .l is used in a standard regular expression. r∗ is an abbreviation of r0. next to the standard operators. and • L(rk. • L(¬r1 ) = Σ∗ \ L(r1 ). ∩. and rr · · · r (ktimes). an } ⊆ Σ. ℓ i=k L(r)i .
and εsymbols. However. and parentheses [2. Formally. k.2 Deterministic Regular Expressions As mentioned in the introduction. r1 r2  = r1 ∩ r2  = r1 + r2  = r1 & r2  = r1 ∩ r2  = r1  + r2  + 1. We deﬁne the size of an extended regular expression r over Σ. 2. r ∈ MRE if Char(r) ⊂ Σ × N and. A marked regular expression Σ is a regular expression over Σ × N in which every (Σ × N)symbol occurs at most once. • If r = r1 · r2 : – If ε ∈ L(r2 ).ℓ • If r = r1 : ﬁrst(r) = ﬁrst(r1 ) and last(r) = last(r1 ). A marked string is a . We introduce this notion in this section. • If r = r1 + r2 : ﬁrst(r) = ﬁrst(r1 ) ∪ ﬁrst(r2 ) and last(r) = last(r1 ) ∪ last(r2 ). for a ∈ Σ. Other possibilities considered in the literature for deﬁning the size of a regular expression are: (1) counting all symbols. or. the set ﬁrst(r) (respectively. as the number of Σsymbols and operators occurring in r plus the sizes of the binary representations of the integers. provided they are preprocessed by syntactically eliminating superﬂuous ∅. We denote the set of all these expressions by MRE. we deﬁne the size of an expression as the length of its reverse Polish form. for every subexpression ss′ or s + s′ of r.ℓ  = r + ⌈log k⌉ + ⌈log ℓ⌉. since for instance the expression (¬ε)(¬ε)(¬ε) does not contain any Σsymbols. Therefore. and nested stars. denoted by r. it is known (see. (2) counting only the Σsymbols. ﬁrst(a) = last(a) = {a}. Deterministic Regular Expressions 9 For an extended regular expression r. last(r)) consists of all symbols which are the ﬁrst (respectively. These sets are inductively deﬁned as follows: • ﬁrst(ε) = last(ε) = ∅ and ∀a ∈ Char(r). several XML schema languages restrict regular expressions occurring in rules to be deterministic (also sometimes called oneunambiguous [17]). Char(s) ∩ Char(s′ ) = ∅.2. the three length measures are identical up to a constant multiplicative factor. counting only the Σsymbols is not suﬃcient. ﬁrst(r) = ﬁrst(r1 ) ∪ ﬁrst(r2 ). 59]. r∗  = ¬r = r + 1. else last(r) = last(r2 ). last) symbols in some string deﬁned by r. Formally. For an RE(#) expression r. else ﬁrst(r) = ﬁrst(r1 ). for instance [31]) that for standard regular expressions. For extended regular expressions. – If ε ∈ L(r1 ). operators. we denote by Char(r) the set of Σsymbols occurring in r. This is equivalent to the length of its (parenthesisfree) reverse Polish form [112]. last(r) = last(r1 ) ∪ last(r2 ). ∅ = ε = a = 1.2. and rk.
Formally. The markings and demarkings of strings are deﬁned analogously. q0 is the initial state. symbols. q2 ) and (q2 . An expression r ∈ RE is deterministic (also called oneunambiguous) if. The size of an NFA is δ. w′ ∈ Σ∗ such that (q0 . q ′ ) ∈ δ ∗ . So. q3 ). Likewise w will denote a marking of a string w. For a transition (q. without looking ahead in the string. An NFA is trimmed if it only contains δ f f . q ′ ) ∈ δ ∗ for some q ′ ∈ F . i. when matching a string against the expression from left to right. p) ∈ δ. dm(r + s) := dm(r) + dm(s). a. δ ∗ is the smallest relation such that for all q ∈ Q. it holds that {(q. b ∈ Char(r). an expression is deterministic if. 2. The demarking of a marked expression is obtained by deleting these integers. By δ ∗ we denote the reﬂexivetransitive closure of δ. by r the demarking of r. for all strings u. w ∈ Char(r)∗ and all symbols a.10 Deﬁnitions and preliminaries string over Σ×N (in which (Σ×N)symbols can occur more than once). dm(rs) := dm(r)dm(s). w. q) ∈ ∗ . v. i) in marked regular expressions. we usually leave the actual marking and demarking functions m and dm implicit and denote by r a marking of the expression r and. q3 ) are in δ ∗ then so is (q1 . i. F ) where Q is the set of states. while (b∗ a)(b∗ a)∗ is. (a + b)∗ a is not deterministic. p) ∈ δ. When r is a marked regular expression. where dm : MRE → RE is deﬁned as dm(ε) := ε. L(r) is therefore a set of marked strings. for some q ∈ F . and (q.ℓ . An NFA is deterministic (or a DFA) if for all a ∈ Σ.w q ′ when (q. ε. the conditions uav. Intuitively. a) = p when (q. w. a marking of (a + b)a + bc is (a1 + b1 )a2 + b2 c1 . uv. we say that q is its source state and p its target state. δ. we always know against which symbol in the expression we must match the next symbol. (q. Any function m : RE → MRE such that for every r ∈ RE it holds that dm(m(r)) = r is a valid marking function. dm((a. q0 .3 Finite Automata A nondeterministic ﬁnite automaton (NFA) A is a 4tuple (Q. q ′ ) ∈ δ  q ′ ∈ Q} ≤ 1. A state q ∈ Q is useful if there exist strings w. Deﬁnition 1. For the rest of the thesis. w. w is accepted by A if (q0 . We also sometimes write q ⇒A. and dm(rk. δ ⊂ δ ∗ and when (q1 . The set of strings accepted by A is denoted by L(A). q ) ∈ δ ∗ . a. and strings. ubw ∈ L(r) and a = b imply that a = b. For conciseness and readability. u.e. we will from now on write ai instead of (a. F is the set of ﬁnal states and δ ⊆ Q × Σ × Q is the transition relation. δ(q. q ∈ Q.ℓ ) := dm(r)k. w ′ . the demarking of r is dm(r). q) ∈ δ ∗ . When A is a DFA we sometimes write δ and δ ∗ as a function instead of a relation. For instance. v. For instance.e. conversely. We always use overlined letters to denote marked expressions. i)) := a. a.
A is statelabeled if for any q ∈ Q. A regular expression r. a. i. Schema Languages for XML 11 useful states. can be constructed in time O(2n ) [110].. i. denoted by Dom(t). . Theorem 2. let symbols(q) = {a  ∃p ∈ Q.. a(w) is in TΣ . we denote the label of u by labt (u). q) ∈ δ}. denoted by TΣ . 1. For a node u ∈ Dom(t). 2. Notice that there is no a priori bound on the number of children of a node in a Σtree. such that L(B) = L(r). i. The set of unranked Σtrees. .4. we also denote this symbol by symbol(q).2. q0 has no incoming transitions. Then. . We denote by ancstrt (u) the sequence of labels on the path from the root to u including both the root and u itself. with L(r) = L(A). In the sequel. A DFA B with 2m states. such that L(B) = Σ∗ \L(A). For every t ∈ TΣ . . whenever we say tree. We will make use of the following wellknown results concerning ﬁnite automata and regular expressions. For q ∈ Q. and (ii) if t = a(t1 · · · tn ). We write a rather than a(). Further. 3. all transitions to a single state are labeled with the same symbol. A is nonreturning if symbols(q0 ) = ∅. the subtrees t1 . can be constructed in time O(2n ) [110]. 4.e. Let r ∈ RE. . a tree is either ε (empty) or is of the form a(t1 · · · tn ) where each ti is a tree.e. (p. then Dom(t) = {ε} ∪ n {iu  i=1 u ∈ Dom(ti )}. labt (u1) · · · labt (un). In this case. A tree language is a set of trees. can be constructed in time 2O(n) [82. such trees are therefore unranked. the set of nodes of t. An NFA B with r + 1 states..4 Schema Languages for XML We use unranked trees as an abstraction for XML documents. can be constructed in time O(r · k) [15]. of size O(mk4m ). we always mean Σtree. is the set deﬁned as follows: (i) if t = ε. and chstrt (u) denotes the string formed by the labels of the children of u. such that L(B) = L(A). then Dom(t) = ∅. and let A = n and Σ = k. is the smallest set of strings over Σ and the parenthesis symbols “(” and “)” containing ε such that.e. tn are attached to the root labeled a. In the tree a(t1 · · · tn ). A DFA B with 2m states. 31]. Denote by t1 [u ← t2 ] the tree obtained from a tree t1 by replacing the subtree rooted at node u of t1 by t2 . By subtreet (u) we denote the subtree of t rooted at u. 2. where each ti ∈ TΣ . So. symbols(q) ≤ 1. for a ∈ Σ and w ∈ (TΣ )∗ . Let A be an NFA over Σ with m states.
We denote by EDTD (respectively. Again. A tree t then satisﬁes an extended DTD if t = µ(t′ ) for some t′ ∈ L(d). respectively. s) is a DTD(R) over Σ′ . EDTDst ) the class EDTD(RE) (respectively. Similarly. #} we denote by EDTD(S) and EDTDst (S) the classes EDTD(RE(S)) and EDTDst (RE(S)). A tree t satisﬁes d if (i) labt (ε) = sd and.5 Decision Problems The following decision problems will be studied for various classes of regular expressions. By L(d) we denote the set of trees satisfying d. 2. labt (u1) · · · labt (un) ∈ L(d(labt (u))). s. Let R be a class of representations of regular string languages over Σ. i ∈ N} for some natural numbers ka . Furthermore. a ∈ Σ. Let M be a class of representations of regular string or tree languages: . µ) with the property that for every a ∈ Σ′ . respectively. and µ is a mapping from Σ′ to Σ. we sometimes denote (Σ. we always take Σ′ = {ai  1 ≤ i ≤ ka . sd ) where d is a function that maps Σsymbols to elements of R and sd ∈ Σ is the start symbol. d. and we set µ(ai ) = a. while singletype EDTDs form a strict subset thereof [77]. Σ′ .12 Deﬁnitions and preliminaries We make use of the following deﬁnitions to abstract from the commonly used schema languages [77]: Deﬁnition 3. For notational convenience. µ). (Σ′ . where Σ′ is an alphabet of types. sd ) by d and leave the start symbol sd and alphabet Σ implicit. d. EDTDst (RE)). (ii) for every u ∈ Dom(t) with n children. and XML schema languages. d. automata. EDTDs and singletype EDTDs correspond to Relax NG and XML Schema. EDTDs correspond to the unranked regular languages [16]. 1. Deﬁnition 4. we denote by L(D) the set of trees satisfying D. A singletype EDTD (EDTDst (R)) over Σ is an EDTD(R) D = (Σ. For ease of exposition. d. 85]. 2. Here we abuse notation and let µ also denote its extension to deﬁne a homomorphism on trees. As explained in [77. ¬. 3. s. for a subset S of {&. d. An extended DTD (EDTD(R)) over Σ is a 5tuple D = (Σ. in the regular expression d(a) no two types bi and bj with i = j occur. Σ′ . A DTD(R) over Σ is a tuple (Σ. ∩.
. Decision Problems • inclusion for M: Given two elements e. does there exist a nonempty string or tree f such that f ∈ L(e)? • simplification for M: Given an element e ∈ M. is L(e) ⊆ L(e′ )? 13 • equivalence for M: Given two elements e. e′ ∈ M. is L(e) = L(e′ )? • intersection for M: Given an arbitrary number of elements e1 . is n L(ei ) = ∅? i=1 • membership for M: Given an element e ∈ M and a string or a tree f .5. e′ ∈ M. en ∈ M. is f ∈ L(e)? • satisfiability for M: Given an element e ∈ M. . does there exist a DTD D such that L(e) = L(D)? . . .2.
.
Part I
Foundations of Regular Expressions
15
3
Succinctness of Regular Expressions
The two central questions addressed in this chapter are the following. Given regular expressions (REs) r, r1 , . . . , rk over an alphabet Σ, 1. what is the complexity of constructing a regular expression r¬ deﬁning Σ∗ \ L(r), that is, the complement of r? 2. what is the complexity of constructing a regular expression r∩ deﬁning L(r1 ) ∩ · · · ∩ L(rk )? In both cases, the naive algorithm takes time double exponential in the size of the input. Indeed, for the complement, transform r to an NFA and determinize it (ﬁrst exponential step), complement it and translate back to a regular expression (second exponential step). For the intersection there is a similar algorithm through a translation to NFAs, taking the crossproduct and a retranslation to a regular expression. Note that both algorithms do not only take double exponential time but also result in a regular expression of double exponential size. In this chapter, we exhibit classes of regular expressions for which this double exponential size increase cannot be avoided. Furthermore, when the number k of regular expressions is ﬁxed, r∩ can be constructed in exponential time and we prove a matching lower bound for the size increase. In addition, we consider the fragments of deterministic and singleoccurrence regular expressions relevant to XML schema languages [9, 11, 43, 77]. Our main results are summarized in Table 3.1.
18
Succinctness of Regular Expressions complement regular expression deterministic singleoccurrence 2exp poly poly intersection (ﬁxed) exp exp exp intersection (arbitrary) 2exp 2exp exp
Table 3.1: Overview of the size increase for the various operators and subclasses. The main technical part of the chapter is centered around the generalization of a result in [30]. They exhibit a class of languages (Kn )n∈N each of which can be accepted by a DFA of size O(n2 ) but cannot be deﬁned by a regular expression of size smaller than 2n−1 . The most direct way to deﬁne Kn is by the DFA that accepts it: it consists of n states, labeled 0 to n − 1, where 0 is the initial and n − 1 the ﬁnal state, and which is fully connected with the edge between state i and j carrying the label ai,j . Note that the alphabet over which Kn is deﬁned grows quadratically with n. We generalize their result to a ﬁxed alphabet. In particular, we deﬁne Hn as the binary encoding of Kn using a suitable encoding for ai,j and prove that every regular expression deﬁning Hn should be at least of size 2n . As integers are encoded in binary the complement and intersection of regular expressions can now be used to separately encode H2n (and slight variations thereof) leading to the desired results. Although the succinctness of various automata models have been investigated in depth [46] and more recently those of logics over (unary alphabet) strings [47], the succinctness of regular expressions had, up to recently, hardly been addressed. For the complement of a regular expression an exponential lower bound is given in [31]. For the intersection of an arbitrary number of regular expressions Petersen gave an exponential lower bound [90], while [31] mentions a quadratic lower bound for the intersection of two regular expressions. In fact, in [31], it is explicitly asked what the maximum achievable size increase is for the complement of one and the intersection of two regular expressions (Open Problems 4 and 5), and whether an exponential size increase in the translation from DFA to RE is also unavoidable when the alphabet is ﬁxed (Open Problem 3). More recently, there have been a number of papers concerning succinctness of regular expressions and related matters [53, 48, 49, 50, 51, 52]. Most related is [48], where, independently, a number of problems similar to the problems in this chapter are studied. They also show that in constructing a regular expression for the intersection of two expressions, an exponential sizeincrease √ can not be avoided. However, we only give a 2Ω( n) lower bound whereas
3.1. A Generalization of a Theorem by Ehrenfeucht and Zeiger to a Fixed Alphabet 19 they obtain 2Ω(n) , which in [49] is shown to be almost optimal. For the complementation of an RE we both obtain a double exponential lower bound. √ Ω(n) Ω( n log n) whereas we prove a (tight) 22 lower Here, however, they obtain 22 2Ω(n) lower bound. Finally, as bound. However, in [51] they also prove a 2 a corollary of our results we obtain that in a translation from a DFA to an RE an exponential size increase can not be avoided, also when the alphabet √ is ﬁxed. This yields a lower bound of 2Ω( n/ log n) which in [48] is improved to the (tight) bound of 2Ω(n) . All results mentioned here involve ﬁxed binary alphabets, and, together, these results settle open problems 3, 4, and 5 of [31]. As already mentioned in the introduction, the main motivation for this research stems from its application in the emerging area of XMLtheory [72, 86, 98, 106]. The lower bounds presented here are utilized in Chapter 7 to prove, among other things, lower bounds on the succinctness of existential and universal patternbased schemas on the one hand, and singletype EDTDs (a formalization of XSDs) and DTDs, on the other hand. As the DTD and XML Schema speciﬁcation require regular expressions occurring in rules to be deterministic, formalized by Br¨ggemannKlein and Wood in terms of oneu unambiguous regular expressions [17], we also investigate the complement and intersection of those. In particular, we show that a deterministic regular expressions can be complemented in polynomial time, whereas the lower bounds concerning intersection carry over from unrestricted regular expressions. A study in [9] reveals that most of the deterministic regular expression used in practice take a very simple form: every alphabet symbol occurs at most once. We refer to those as singleoccurrence regular expressions (SOREs) and show a tight exponential lower bound for intersection. Outline. In Section 3.1 we extend the results of Ehrenfeucht and Zeiger to languages over a ﬁxed alphabet. Then, in Sections 3.2 and 3.3 we investigate the succinctness of the complement and intersection of regular expressions, respectively.
3.1
A Generalization of a Theorem by Ehrenfeucht and Zeiger to a Fixed Alphabet
We ﬁrst introduce the family (Kn )n∈N of string languages deﬁned by Ehrenfeucht and Zeiger [30] over an alphabet whose size grows quadratically with the parameter n: Deﬁnition 5. Let n ∈ N and Σn = {ai,j  0 ≤ i, j ≤ n − 1}. Then, Kn contains exactly all strings of the form a0,i1 ai1 ,i2 · · · aik−1 ,n−1 where k ∈ N.
20
Succinctness of Regular Expressions a0,1 q0 a1,0 a0,2
q1 a2,1 a1,2 q2
a1,1
a0,0 a2,0
a2,2
Figure 3.1: The DFA AK , accepting K3 . 3 A way to interpret Kn is to consider the DFA with states {q0 , . . . , qn−1 } which is fully connected, whose start state is state q0 and ﬁnal state is qn−1 and where the edge between state qi and qj is labeled with ai,j . Formally, let AK = (Q, q0 , δ, F ) be deﬁned as Q = {q0 , . . . , qn−1 }, F = {qn−1 }, and for all n i, j ∈ [0, n − 1], (qi , ai,j , qj ) ∈ δ. Figure 3.1 shows AK . 3 Ehrenfeucht and Zeiger obtained the succinctness of DFAs with respect to regular expressions through the following theorem: Theorem 6 ([30]). For n ∈ N, any regular expression deﬁning Kn must be of size at least 2n−1 . Furthermore, there is a DFA of size O(n2 ) accepting Kn . We will now deﬁne the language Hn as a binary encoding of Kn over a fourletter alphabet, and generalize Theorem 6 to Hn . Later, we will deﬁne 2 Hn as an encoding of Hn over a binary alphabet. The language Hn will be deﬁned as the straightforward binary encoding of Kn that additionally swaps the pair of indices in every symbol ai,j . Thereto, for ai,j ∈ Σn , deﬁne the function ρn as ρn (ai,j ) = enc(j)$enc(i)#, where enc(i) and enc(j) denote the ⌈log(n)⌉bit binary encodings of i and j, respectively. Note that since i, j < n, i and j can be encoded using only ⌈log(n)⌉bits. We extend the deﬁnition of ρn to strings in the usual way: ρn (ai0 ,i1 · · · aik−1 ,ik ) = ρn (ai0 ,i1 ) · · · ρn (aik−1 ,ik ). We are now ready to deﬁne Hn . Deﬁnition 7. Let ΣH = {0, 1, $, #}. For n ∈ N, let Hn = {ρn (w)  w ∈ Kn }. For instance, for n = 5, w = a0,2 a2,1 a1,4 a4,4 ∈ K5 and thus ρn (w) = 010$000#001$010#100$001#100$100# ∈ H5 . Remark 8. Although it might seem a bit artiﬁcial to swap the indices in the binary encoding of Kn , it is deﬁnitely necessary. Indeed, consider the
3.1. A Generalization of a Theorem by Ehrenfeucht and Zeiger to a Fixed Alphabet 21
$ 0 1 0 q2 1 0 1 $ 1 0 f2,3 $ 1 0 f2,2 $ 1 0 f2,1 1 0 f2,0
Figure 3.2: The automaton B2 , for n = 4.
′ language Hn obtained from Kn like Hn but without swapping the indices of the symbols in Kn . This language can be deﬁned by a regular expressions of size O(n log(n)):
enc(0)$
i<n
enc(i)#enc(i)$ enc(n − 1)#.
∗
′ Therefore, Hn is not suited to show exponential or double exponential lower bounds on the size of regular expressions.
We now show that, by using the encoding that swaps the indices, it is possible to generalize Theorem 6 to a fourletter alphabet as follows: Theorem 9. For any n ∈ N, with n ≥ 2, 1. there is a DFA An of size O(n2 log n) deﬁning Hn ; and, 2. any regular expression deﬁning Hn is of size at least 2n . We ﬁrst show (1). Let n ∈ N and n ≥ 2. We compose An from a number of subautomata which each will be able to read one block of a string. That is, strings deﬁned by the regular expression (0 + 1)⌈log n⌉ $(0 + 1)⌈log n⌉ . For any i < n, let Bi = (Qi , qi , δi , {fi,0 , . . . , fi,n−1 }), with L(Bi ) = {w$w′  w, w′ ∈ (0 + 1)⌈log n⌉ ∧ enc(w′ ) = i}, such that for any j < n, q ⇒Bi ,w$w′ fi,j whenever enc(w) = j. That is, Bi reads strings of the form w$w′ , checks whether w′ encodes i and remembers the value of w in its accepting state. The construction of these automata is straightforward. We illustrate the construction of B2 , for n = 4, in Figure 3.2. Now, let An = (Q, q0 , δ, qn−1 ) where Q = i<n Qi and δ contains i<n δi plus for every i, j < n, (fi,j , #, qj ) ∈ δ. So, An works as follows: ﬁrst A0
A regular expression s is a starred subexpression of a regular expression r when s is a subexpression of r and is of the form t∗ . We say that i0 is the startpoint of w and ik is its endpoint. For instance. the linear automata following this part are each of size ⌈log n⌉ and there are n of these. That is. and 3 but its startand endpoint are undeﬁned as it is not the image of a string covered by Kn . v contains any integer encoded by ⌈log n⌉ consecutive bits in v. A regular expression r covers w when L(r) covers w.2. That is. for the size of An . start. and so on. treelike. .and endpoint of a string w are also extended to Hn . a0. but the notion is a bit more general.1 has startpoint 0. we say that w contains i or i occurs in w if i occurs as an index of some symbol in w. For instance.i1 ai1 . Then it passes control to Bj which reads the next block in which it remembers the ﬁrst integer. and the notions of startand endpoints is only extended to images of strings covered by Kn . the string 0$01#11$10 contains 1. For w covered by Kn .22 Succinctness of Regular Expressions reads the ﬁrst block of the string. Finally. Clearly. say k. ai. and contains 0. Whenever An reaches the initial state of Bn−1 . So. It follows the structure of the proof of Ehrenfeucht and Zeiger but is technically more involved as it deals with binary encodings of integers. is extended to every string v which is a covered by Hn . a valid string has been read and An can accept.and endpoints of ρn (w) are the start and endpoints of w. We say that a regular expression r is proper if every starred subexpression of r has a sidekick. The notion of containing an integer.i2 · · · aik−1 . on the other hand. 2. the startpoint of ρn (w) is the integer occurring between the ﬁrst $ and # signs. Further. say j. Furthermore. the start.i occurs in w for some j. which is bounded by 2n.ik be a string covered by Kn . We start by introducing some terminology. checks whether the second integer is 0 and remembers the ﬁrst integer of the block. An is of size O(n2 log n). the total size of any Bi is O(n log n). for any w covered by Kn . For a regular expression r.2 a2.2 a2. Let w = ai0 . endpoint 1. Since An consists of a linear number of Bi subautomata. Hence. The notions of contains. for n = 4. We now prove Theorem 9(2). and checks whether the second integer is j. We say that a language L covers a string w if there exist strings u. occurs. If so. 1 and 2. u′ ∈ Σ∗ such that uwu′ ∈ L. ρn (w) contains exactly the integers contained in w. we say that i is a sidekick of r when it occurs in every nonempty string deﬁned by r. part consists of at most n + n/2 + n/4 + · · · + 1 nodes and transitions. it passes control to Bk . This concludes the proof of Theorem 9(1).j or aj. The ﬁrst. consider the subautomata Bi as illustrated in Figure 3.
there are two $symbols without an intermediate #symbol and therefore uw2 w1 u′ ∈ Hn . Therefore. because of the concatenation H of w2 and w1 . That is. uw contains at least one $ and #symbol. A Generalization of a Theorem by Ehrenfeucht and Zeiger to a Fixed Alphabet 23 Lemma 10. In a similar way. α1 = (0 + 1)∗ $(0 + 1)⌈log n⌉ #Σ∗ H 2. Towards a contradiction. Then w must contain at least one $symbol. (ii ) w contains an equal number of $ and #symbols. Let u. we refer to this integer as the start block. Assume that all strings deﬁned by s are of type1. w1 and w2 are nonempty strings in L(s) such that w1 and w2 have diﬀerent start blocks. α2 = Σ∗ #(0 + 1)⌈log n⌉ $(0 + 1)∗ . suppose it does not. suppose that w. and (iv ) between any consecutive $ and #symbol there is a string of length ⌈log(n)⌉ containing only 0’s and 1’s. (iii ) the $ and #symbols must alternate. We prove this lemma by a detailed examination of the structure of strings deﬁned by s. uw′ u′ ∈ L(r). which yields the desired contradiction. If w contains a # then uwwu′ ∈ L(r) but uwwu′ ∈ Hn which leads to the desired contradiction. Then. u′ be such that uww1 u′ ∈ L(r) and therefore uww1 u′ ∈ Hn . Towards a contradiction assume there is a type1 string w1 and a type2 string w2 in L(s). Proof. respectively. by deﬁnition of Hn . If w contains no # and therefore only consists of 0’s and 1’s. uwj u′ ∈ L(r). For any string w ∈ L(s). that is. H We refer to the strings deﬁned by α1 and α2 as strings of type1 and type2. Now also uww2 u′ ∈ L(r) as s is a starred subexpression. u′ ∈ Σ∗ such that uw2 w1 u′ ∈ L(r). That is.3. the integer i. there exist strings u. For a type1 string. u′ ∈ Σ∗ such that uwu′ ∈ L(r). w can be pumped in uwu′ and still form a string deﬁned by r. Towards a contradiction. then uwn u′ ∈ L(r) but uwn u′ ∈ Hn which again is a contradiction. one can show that (i ) w contains at least one #symbol. Now. it must be equal to the integer preceding the last $symbol in uw. Let w be a nonempty string in L(s) and let u. Any starred subexpression s of a regular expression r deﬁning Hn is proper. In addition. However. H Furthermore. From the above it follows that w matches one of the following expressions: 1. w2 w1 ∈ L(s) and thus there exist u. u′ be such that uwu′ ∈ L(r). . between the ﬁrst $ and #symbol is the same for every w ∈ L(s) which gives us our sidekick. the value of the start block of w1 is uniquely determined by uw. We next show that all strings deﬁned by s are either all of type1 or all of type2. for every j ∈ N. but uww2 u′ ∈ Hn as w2 has a diﬀerent start block.1. for every other w′ ∈ L(s). We start by making the following observation. We next argue that the substring of length ⌈log n⌉.
By Lemma 11. Further. However. Let n ≥ 2. For k = 1. Lemma 11 ([30]). The proof is by induction on k. . 1 For any regular expression r and string w. in [30] the length of an expression is deﬁned as the number of Σsymbols occurring in it. consider all subexpressions of r which are wℓ inﬁnite. For the inductive step. Fix such a regular expression r. Then. If there is a greatest integer m for which r covers wm . . a subexpression of the form a or ε can never be wℓ inﬁnite and a subexpression 1 In fact. If r is wℓ ﬁnite for some ℓ ≤ k then. .and endpoint i and only containing integers in C. by construction of w. if r is wﬁnite. jk } and i ∈ C. The index of a regular expression can be used to give a lower bound on its size according to the following lemma. Here and in the following. r ≥ 2k and we are done. Indeed. n − 1} of cardinality k and i ∈ C. we can state the most important property of Hn . we call m the index of w in r and denote it by Iw (r). this lemma still holds in our setting. . we say that an expression is minimal with respect to a set simply when no other expression in the set is a subexpression. Lemma 12.24 Succinctness of Regular Expressions The same kind of reasoning can be used to show that s has a sidekick when all deﬁned strings are of type2. For every ℓ ≤ k. such that any proper regular expression r which covers w is of size at least 2k . In this case we say that r is wﬁnite. w is covered by Hn . For any C ⊆ {0. deﬁne w = enc(i)$enc(i)#. deﬁne m = 2k+1 and set w equal to the string m m m enc(j1 )$enc(i)#w1 enc(j2 )$enc(j1 )#w2 enc(j3 )$enc(j2 )# · · · wk enc(i)$enc(jk )#. let C = {j1 . we say that r is winﬁnite. It only remains to show that any expression r which is proper and covers w is of size at least 2k . there exists a string w covered by Hn with start. . . Then. Note that jℓ ∈ Cℓ . has i as start and endpoint and only contains integers in C. We will see that all minimal elements in this set of subexpressions must be starred subexpressions. which satisﬁes all conditions and any expression covering w must deﬁnitely have size at least 2. C = {i}. Now. since our length measure also contains these Σsymbols. Iwℓ (rk ) ≥ m = 2k+1 . . Deﬁne Cℓ = C \ {j(ℓ mod k)+1 } and let wℓ be the string given by the induction hypothesis with respect to Cℓ (of size k − 1) and jℓ . Otherwise. assume that r is wℓ inﬁnite for every ℓ ≤ k. . then Iw (r) < 2r. . Proof. Therefore.
Since by Lemma 10 any expression deﬁning Hn is proper. Or.1. . a full iteration of sℓ can never contribute to the matching of any number of repetitions of wp . k = n. It then follows that sp still covers wp . they are both not a subexpression of one another. As r is proper and sℓ is a starred subexpression of r. wk . Theorem 9(2) directly follows from Lemma 12 by choosing i = 0. then sp is still wp inﬁnite. it follows that j ∈ C = {j1 . sp can only lose its wp inﬁnity by this replacement if sℓ contains a subexpression which is itself wp inﬁnite. by the induction hypothesis the size of each sℓ is at least 2k−1 . . there is an integer j such that every nonempty string in L(sℓ ) contains j. In particular. there are three possibilities: • sℓ and sp are completely disjoint subexpressions of r. Let k ′ be such that j = jk′ . . . A Generalization of a Theorem by Ehrenfeucht and Zeiger to a Fixed Alphabet 25 of the form r1 r2 or r1 + r2 can only be wℓ inﬁnite if r1 and/or r2 are wℓ inﬁnite and is thus not minimal with respect to wℓ inﬁnity. sℓ is wℓ inﬁnite. wp does not contain j. • sℓ is a strict subexpression of sp . Let E = {s1 . Among these minimal starred subexpressions for wℓ . . a contradiction. Now. . and thus sp without sℓ is of size at least 2k−1 . Now. as j is a sidekick of sℓ . This concludes the proof of Theorem 9(2). However. . By induction they must both be of size 2k−1 and thus r ≥ 2k−1 + 2k−1 = 2k . more speciﬁcally. we have that for p = k ′ − 1 (if k ′ ≥ 2. So. sk }. . and minimal in this respect). We show that if we replace sℓ by ε in sp . We can only conclude that sp without sℓ is still wp inﬁnite. all its subexpressions are also proper. recall that any nonempty string deﬁned by sℓ contains j and j does not occur in wp . wℓ only contains integers in C. As sℓ  ≥ 2k−1 as well it follows that r ≥ 2k . . for any i. . . By deﬁnition of the inductively obtained strings w1 . This is not possible since sℓ is chosen to be a minimal element from E. and p = k. otherwise). • sp is a strict subexpression of sℓ . . As.3. and sℓ is chosen to be minimal with respect to wℓ inﬁnity. Therefore. jk } must hold. choose from E some expression sℓ such that sℓ is minimal with respect to the other elements in E. To see that sp without sℓ is still wp inﬁnite. As in addition each sℓ covers wℓ . That is. Denote by sp the starred subexpression from E which is wp inﬁnite. no nonempty word w ∈ L(sℓ ) i can ever be a substring of a word wp . sℓ and sp cannot be the same subexpression of r. . this then also is a subexpression of sp and sp is chosen to be minimal with respect to wp inﬁnity. Note that since r is proper. choose one and denote it by sℓ . by the induction hypothesis. and sp deﬁnes strings not containing j (recall that it is wp inﬁnite.
we can encode Kn using log(n2 ) bits. ρ2 (w) = ρ2 (a1 ) · · · ρ2 (an ). He takes an approach similar to ours and deﬁnes a binary encoding of Kn as follows. 2 2. For (2). We conclude the section by noting that Waizenegger [107] already claimed a result similar to Theorem 9 using a diﬀerent binary encoding of Kn . for instance. For n ∈ N. if n = 2. and ρ2 (#) = bb. Thereto. The latter happens. and q0 to q ′ . and hence the proof in [107] to be incorrect. reading a 0. and. even when the alphabet is ﬁxed and binary. any regular expression deﬁning Hn is of size at least 2n . we have to insert a new state. there is a DFA An of size O(n2 log n) deﬁning Hn . thus showing that in a translation from a DFA to an RE an exponential blowup can not be avoided. let Σ2 = {a. For instance. Theorem 14. the proof of Theorem 9(2) also carries over almost literally. b}. Then. as ρ2 (0) = aa and ρ2 (1) = ab. then . it can 2 be proved that any regular expression deﬁning Hn is of size at least 2n . we can simply merge the two states to which the transitions point. For instance. as we discuss in some detail next. in some unspeciﬁed order. with transitions from q to q0 . 1. and associate a natural number from 0 to n2 − 1 to each of them. with n ≥ 2. Σn consists of n2 symbols. the automaton An can be constructed very similarly as the DFA An in Theorem 9.26 Succinctness of Regular Expressions We now extend the results on Hn . Then. For n ∈ N. 2 1. let Hn = {ρ2 (w)  w ∈ Hn }. Further. if there is a transition from state q to q ′ . 2 We can now easily extend Theorem 9 to Hn . we can deﬁne H 2 Hn as follows. Proof. when a state had an outgoing 0 and 1 transition. Hence. for w = a1 · · · an ∈ Σ∗ . then sp is still wp inﬁnite. ρ2 ($) = ba. #}. The only diﬀerence is that for every transition. to a binary alphabet. The proof of Theorem 9 carries over almost literally to the current setting. say q0 . For any n ∈ N. In 2 particular. If we list these. we believe that a more sophisticated encoding as presented here is necessary. For (1). 2 Deﬁnition 13. The obtained DFA 2 accepts Hn and is of size O(n2 log n). all starred subexpressions of an expression deﬁning Hn must still have a sidekick. as usual. if after applying this transformation. And. if we replace the starred subexpression sℓ by ε in sp . Unfortunately. we have to add an additional state to the automaton. ρ2 (1) = ab. we obtain a state which has two outgoing transitions with the same label. and deﬁne ρ2 as ρ2 (0) = aa. $. in the last and crucial step of the proof. which is deﬁned over the fourletter alphabet ΣH = {0. each reading an a.
0 . each exponent in the tower requires nesting of an additional complement. over the fourletter alphabet {0. To see the importance. Waizenegger’s claim that the speciﬁc encoding of the n2 symbols in Σ does not matter in the proof for Theorem 9 is incorrect.1 . 3. if r2 is an expression deﬁning W2 .0 . sion s with L(s) = Σ∗ \ L(r) can be constructed in time 22 2. 102].3. 2.0 a0. respectively.1 } and we can encode these by the numbers 0. This statement is incorrect. Intuitively. we can reuse our proof for Hn and show that ′ any regular expression deﬁning Hn must be of at least exponential size which is clearly false. a0. In contrast. a1. a string in Kn is encoded by replacing every symbol by its binary encoding and additionally adding an a and b to the start and end of each symbol. say Wn . if we assume that every string deﬁned by a starred subexpression starts with a #. Then. . Now. However. when the expression is deterministic its complement can be computed in polynomial time. there is a regular expression rn of size O(n) such that any regular expression r deﬁning Σ∗ \ L(rn ) is n of size at least 22 .2 Complementing Regular Expressions It is known that extended regular expressions are nonelementary more succinct than classical ones [27. For every n ∈ N. in this proof it is claimed that if rn is an expression deﬁning Wn . Theorem 15. For instance.1 becomes a00ba01b. a0. and we still use the same encoding as above. Complementing Regular Expressions 27 Σn = {a0. 1. a double exponential size increase cannot be avoided in general. However. but every regular expression deﬁning it must be of size at least exponential in n. then ((a0(0ba0)∗ 0b) + ε)r2 still deﬁnes W2 but does not have this property. a. For every regular expression r over Σ. 3. the mistake has serious consequences. It follows from this claim that every starred subexpression has a sidekick (using our terminology). For instance. we show that in deﬁning the complement of a single regular expression. Albeit small. Hn can be deﬁned by a regular expression of size O(n log(n)). 1. The latter is done by using the same proof techniques as in [30]. then every string deﬁned by a starred subexpression of rn must start with an a. Let Σ be a binary alphabet. a regular expresO(n) . Waizenegger shows that Wn can be described by an NFA of size O(k 2 · log k 2 ). b}. Our encoding used in deﬁning Hn allows to prove the sidekick lemma (Lemma 10) in a correct way. To summarize. and the rest of the proof builds ′ upon this fact. which gives a language. In this section.2. As ′ illustrated in Remark 8. 1. a1. consider the language Hn obtained from Kn like Hn but without swapping the indices of the symbols in Kn .
2n+1 • all strings where a $ is not followed by a string in (0 + 1)n #: Σ∗ $ Σ0. Afterwards. by Theorem 2(1). $. . 1. $. (1) Let r ∈ RE. For instance. the last expression above is only properly deﬁned over words of the form ((0 + 1)n $(0 + 1)n #)∗ . we only need to consider words which are not yet deﬁned by one of the foregoing expressions. Then. we show that it also holds for alphabets of size two.n−1 (# + $) + Σn (0 + 1 + #) Σ∗ • all strings where the corresponding bits of corresponding blocks are different: ((0 + 1)∗ + Σ∗ #(0 + 1)∗ )0Σ3n+2 1Σ∗ + ((0 + 1)∗ + Σ∗ #(0 + 1)∗ )1Σ3n+2 0Σ∗ . (2) Take Σ as ΣH . By Theorem 9. that is. as all other strings are already deﬁned by the previous expressions.n−1 (# + $) + Σn (0 + 1 + $) Σ∗ • all strings where a nonﬁnal # is not followed by a string in (0 + 1)n $: Σ∗ # Σ0. the total algorithm O(n) is in time 22 . So. #} is matched by none of the above expressions if and only if it ′ belongs to H2n . {0. We deﬁne an expression ′ ′ rn of size O(n). Then.2n (1 + $ + #)Σ∗ • all strings that do not end with a suﬃx in 1n $(0 + 1)n #: Σ0. #}. We ﬁrst construct a DFA A.n−1 ($ + #)Σ∗ + Σn (0 + 1 + #)Σ∗ + Σn+1. the complement of rn deﬁnes exactly H2n . Notice here that in constructing each of the above regular expressions. of size 22 . with L(A) = Σ∗ \ L(r) and then construct the regular expression s equivalent to A.2n+1 + Σ∗ (0 + 1 + $) + Σ∗ ($ + #)Σ1. Let n ∈ N. 1. any regular n expression deﬁning H2n is of size exponential in 2n . that is. such that Σ∗ \ L(rn ) = H2n . it should be clear that a string over {0.28 Succinctness of Regular Expressions Proof. ′ The expression rn is then deﬁned as the disjunction of the following expressions: • all strings that do not start with a preﬁx in (0 + 1)n $0n : Σ0.n + Σ∗ (0 + 1 + #)Σn+1 + Σ∗ (0 + $ + #)Σn+2.2n + Σ0. This hence already proves that the theorem holds for languages over an alphabet of size four. According to Theorem 2(3) and (4) A contains at most 2r+1 states and can be constructed in time exponential in the size of r.
b}. For any deterministic regular expression r over an alphabet Σ. x) = Σ \ {dm(y)  y ∈ Char(r) ∧ ∃w. Theorem 16. Complementing Regular Expressions 29 We now extend this result to a binary alphabet Σ = {a. which by Theorem 14 the desired expression as its complement is exactly H2n n 2 must be of size at least 22 . notﬁrst(r) = Σ \ ﬁrst(r). 2 • Recall that last(r) contains all marked symbols which are the last symbol of some word deﬁned by r. • For any symbol x ∈ Char(r). Thereto. there is no better algorithm than translating to a DFA. where n is the size of r. as in the context of deterministic expressions the name Glushkov construction is the most common. w′ ∈ Char(r)∗ . The proof of the next theorem shows that the complement of the Glushkov automaton of a deterministic expression can directly be deﬁned by a regular expression of polynomial size. which we refer to as Glushkov construction. It should be noted that this construction. computing the complement and translating back to a regular expression which includes two exponential steps. Proof. let ′ ρ2 (rn ) be the expression obtained by replacing every symbol σ occurring in ′ ′ rn by ρ2 (σ). a regular expression s deﬁning Σ∗ \ L(r) can be constructed in time O(n3 ). 2 Recall that dm(y) denotes the demarking of y. Hence. However. when the given regular expression is deterministic. That is. that is. notfollow(r. We introduce some notation. . the expression rn = ((a + b)(a + b))∗ (a + b) + ρ2 (rn ) is 2 . To see that the complement of rn is indeed H2n . was introduced by [14]. x) contains all Σsymbols of which no marked version can follow x in any word deﬁned by r. ρ2 (rn ) captures H ∗ exactly all other strings which do represent a string w ∈ ΣH .3. Let r be a deterministic expression over Σ and ﬁx a marking r of r. However. the complement of r is exactly H2 . we will consistently use this naming. • The set notﬁrst(r) contains all Σsymbols which are not the ﬁrst symbol in any word deﬁned by r.2. H2n n 2n The previous theorem essentially shows that in complementing a regular expression. a corresponding DFA can be computed in quadratic time through the Glushkov construction [17] eliminating already one exponential step. Then. the set notfollow(r. but are not in 2 . Then. ∗ (a + b) deﬁnes exactly all strings notice that the expression ((a + b)(a + b)) ′ which are not the encoding of some string w ∈ Σ∗ . and often goes by the name position automata [57]. wxyw′ ∈ L(r)}.
F ). Clearly. we show that L(s) = L(Gr ). a. let rx be dm(rx ). r – a transition (q0 . . The complement automaton Gc = (Qc . δ c . w ∈ L(s) if and only if w ∈ L(Gc ). from which the lemma follows. We conclude by showing that s can be constructed in time cubic in the size of r and that s deﬁnes the complement of r. dm(x). a. It is known that L(r) = L(Gr ) and that Gr is deterministic whenever r is deterministic [17]. q ′ ) ∈ δ. Finally. We prove that L(s) = Σ∗ \ L(r) by highlighting the correspondence between s and the complement of the Glushkov automaton Gr of r. First. That is. F c = {q } ∪ (Q \ F ). We are now ready to deﬁne s: init(r) + x∈last(r) / rx (ε + notfollow(r. r Thereto. all preﬁxes of strings in r ending in x. ε ∈ L(s) if and only if ε ∈ L(r) if and only if ε ∈ L(Gc ). x)Σ∗ . q0 . We prove that for any nonempty / r word w. for every a ∈ Σ. by deﬁnition of s. q ) for every state q ∈ Q and symbol a ∈ Σ for which there is no q ′ ∈ Q with (q. there is – a transition (qx . we show that the nonempty words deﬁned by the diﬀerent disjuncts of s correspond exactly to subsets of the language Gc . if y follows x in some word deﬁned by r. qx ) ∈ δ. wxu ∈ L(r)}. and ε + notﬁrst(r)Σ∗ if ε ∈ L(r). L(Gc ) = Σ∗ \ L(Gr ). q0 . where • Q = {q0 } ∪ {qx  x ∈ Char(r)}. x)Σ∗ ) + x∈last(r) rx notfollow(r. a.30 Succinctness of Regular Expressions We deﬁne the following regular expressions: • init(r) = notﬁrst(r)Σ∗ if ε ∈ L(r). dm(y). / • For every x ∈ Char(r). and (q. • F = {qx  x ∈ last(r)} (plus q0 if ε ∈ L(r)). F c ) is obtained from Gr by r making it complete and interchanging ﬁnal and nonﬁnal states. and. r c Now. The Glushkov automaton Gr is the DFA (Q. Formally. Qc = Q ∪ {q } (with q ∈ Q) and δ c contains δ plus the triples (q . δ. q ). y ∈ Char(r). rx will denote an expression deﬁning {wx  w ∈ Char(r)∗ ∧ ∃u ∈ Char(r)∗ . and • for x. qy ) ∈ δ. if x ∈ ﬁrst(r). Then.
if x occurs in α αx αβx otherwise. . all sets can be computed in time cubic in the size of r. and q only has transitions to itself. the expression rx can be constructed in time quadratic in the size of r. it follows that s can be constructed in cubic time. r We now show that s can be computed in time cubic in the size of r. or (2) arrives in qx after reading a part of w. The expression init(r) can be constructed in linear time.2. rx = • For r = α∗ . notfollow(r.3. Complementing Regular Expressions 31 • init(r) deﬁnes exactly the nonempty words for which Gc immediately r goes from q0 to q and reads the rest of the word while in q . goes to q and reads the rest of w there. So. rx (notfollow(r. • For r = y ∈ Char(r). αx if x occurs in α βx otherwise. Note that because there are no incoming transitions in q0 . r Therefore. rx = ∅. • For any x ∈ last(r). By a result of Br¨ggemannKlein [15] the Glushkov automaton Gr corresponding to u r as deﬁned above can be computed in time quadratic in the size of r. x)Σ∗ ) deﬁnes exactly all strings w for which Gc arrives in qx after reading a part of w. goes to q and reads r the rest of w there. rx = • For r = αβ. Using Gr . x) and last(r) can be computed in time quadratic in the size of r. rx (ε+notfollow(r. x)Σ∗ ) deﬁnes exactly all strings w / for which Gc either (1) arrives in qx after reading w and accepts because r qx is an accepting state. The expression rx is inductively deﬁned as follows: • For r = ε or r = ∅. rx = α∗ αx x if y = x ∅ otherwise. As rx = dm(rx ). any nonempty string w ∈ L(s) if and only if w ∈ Gc . the sets notﬁrst(r). We next show that for any x ∈ Char(r). rx = • For r = α + β. we have described exactly all accepting runs of Gc . • For any x ∈ last(r).
In particular. 3 3 ∗ • rc4 = a1 (a2 b3 c4 )∗ a2 b∗ c4 . Actually.3 Intersecting Regular Expressions In this section. and hence α is doubled.32 Succinctness of Regular Expressions The correctness is easily proved by induction on the structure of r. For the time complexity. Note that there are no ambiguities in the inductive deﬁnition of concatenation and disjunction since the expressions are marked and therefore every marked symbol occurs only once in the expression. we show that the intersection of two (or any ﬁxed number) and an arbitrary number of regular expressions are exponentially and double exponentially more succinct than regular expressions. Then the complement of r is deﬁned by ε + (b + c)Σ∗ + a(ab∗ c)∗ a(ε + aΣ∗ ) + a(ab∗ c)∗ ab∗ b(ε + aΣ∗ ) + a((b + c)Σ∗ ) + a(ab∗ c)∗ ab∗ c((b + c)Σ∗ ). 3. for r = α∗ . Let r = a(ab∗ c)∗ . respectively. 3 . Further. notice that all steps in the above inductive deﬁnition. Then. as the inductive construction continues on only one of the two operands it follows that the complete construction is at most quadratic. However. rx = α∗ αx . • ra1 = a1 . • ra2 = a1 (a2 b∗ c4 )∗ a2 . we study the succinctness of intersection. 3 • rb3 = a1 (a2 b∗ c4 )∗ a2 b∗ b3 . whereas the double exponential bound for an arbitrary Recall that an expression is singleoccurrence if every alphabet symbol occurs at most once in it. and r = a1 (a2 b3 c4 )∗ . We conclude this section by remarking that deterministic regular expressions are not closed under complement and that the constructed expression s is therefore not necessarily deterministic. except for the case r = α∗ . the exponential bound for a ﬁxed number of expressions already holds for singleoccurrence regular expressions 3 . and 3 • init(r) = ε + (b + c)Σ∗ . are linear. We illustrate the construction in the previous proof by means of an exam∗ ple.
.0 which is covered by K5 is mapped to the string £2 a2. 1. . we require that Kn covers w. with i < n. For singleoccurrence expressions this can again be done in exponential time.3◦ a3◦ . Then.ik−1 aik−1 . let Σ◦ = {ai◦ . △n−2 }. Dn = {ρn (w)△  Kn covers w ∧ w is even}. . 1. we see that r must be of size at least 2n . the string a2.4◦ a4◦ . Thereto.i1 ai1 .3. £n−1 . Proof.0 △0 ∈ E5 . Note that words in En are those in Kn where every odd position is promoted to a circled one (◦ ).ik−1 aik−1 .3 a3. We make use of the following property: Lemma 19. to allow also for strings with a diﬀerent start. (1) Analogous to Lemma 10. 2. j < n} . rK  ≤ rE  and rK deﬁnes exactly all strings of even length in Kn plus ε. In this respect.i2 · · · aik−2 . This will prove technically convenient later on. 2 (2) Let n ∈ N and rE be a regular expression deﬁning En .i2 ) · · · ρ(aik−2 . instead of requiring w ∈ Kn . Intersecting Regular Expressions 33 number of expressions only carries over to deterministic expressions. Let ΣD = {0. Then. ρ(ai.j ◦ aj ◦ .ik ) = ρ(ai0 . Any regular expression deﬁning Dn is of size at least 2n . For instance. Any regular expression deﬁning En is of size at least 2n−1 .k and n ˆ ρ(ai0 . $. j < n. We also deﬁne a variant of Kn which only slightly alters the ai. by 2 2 ε and any ai◦ . The next theorem shows the succinctness of the intersection operator. we introduce a slightly altered version of Hn .ik ).4 a4. Let rK be the regular expression obtained from rE by replacing £i and △i .j aj. △}. Deﬁnition 18. and thus also covers every string in Kn . with 0 ≤ i. Here.and endpoint than 0 and n − 1. any regular expression r deﬁning Dn must also be proper and must furthermore cover any string w ∈ Hn .j or ai.j symbols in ˆ Kn . Deﬁnition 17. by ai. ai.3 a3.3.j . #. By choosing i = 0 and k = n in Lemma 12.k ) = £i ai.3 £3 a3. Let n ∈ N.j ◦ .j ◦  0 ≤ i. 2 it immediately follows that rK and thus rE must be of size at least 2n−1 . where ˆ ˆ k is even. Since the proof in [30] also constructs a string w covered by Kn such that any proper expression covering w must be of size at least 2n−1 .i1 ai1 . respectively. For all n ∈ N. but does not change the structure of the language.j . △0 . . n E En = {ˆ(w)△i  Kn covers w ∧ w is even ∧ i is the endpoint of w} ∪ {△i  ρ 0 ≤ i < n}. Let n ∈ N and Σn = Σ◦ ∪ {£0 . . and triangles labeled with the noncircled positions are added.
any regular expression deﬁning Mn is of size at least 2n−1 . for i. . . sn checks whether the noncircled indices are equal and whether the triangles have the correct indices. ending with a △i . such that L(A) = i≤k L(Ai ). a reguk lar expression deﬁning i≤k L(rk ) can be constructed in time 2O((m+1) ) . For every n ∈ N. j. . there are deterministic regular expressions r1 . where m = i≤n ri . rn checks that every string consists of a sequence of blocks of the form £i aj. If m = max {ri   1 ≤ i ≤ k}. of size O(n) such that any regular expression deﬁning 2n i≤m L(ri ) is of size at least 2 . Outi◦ = {ai◦ . an NFA A with (m + 1)k states. for i < n. .34 Succinctness of Regular Expressions Theorem 20. 1. such that L(rn ) ∩ L(sn ) = En . . We deﬁne SOREs rn and sn of size quadratic in n. where m = max {ri   1 ≤ i ≤ k}. Ak such that L(Ai ) = L(ri ). construct NFAs A1 . Then. .k◦ ak◦ . . Ini◦ Outi◦ i◦ ∗ (△0 + · · · + △n−1 ) ∗ (Ini + ε)(£i + △i )(Outi + ε) . .j ◦  0 ≤ j < n}. . (2) Let n ∈ N. . . 4. with m = 2n + 1. Σn = E Further. and.i◦  0 ≤ j < n}. k. . Since the alphabet of En is of size O(n2 ). Ini◦ = {aj. We start by partitioning Σn in two diﬀerent ways. Let r1 . 3. Now. . also rn and sn are of size O(n2 ). △i }. Then. rn be SOREs. For any k ∈ N and regular expressions r1 . rk . can be constructed in time O((m + 1)k ) by means of a product construction. △i } = i◦ Ini◦ ∪ Outi◦ ∪ {£i . rm . For every n ∈ N. To E this end. By Lemma 19(2).i  0 ≤ j < n}.j  0 ≤ j < n}. i≤n L(rn ) can Proof.ℓ . . . A regular expression deﬁning be constructed in time 2O(m) . then by Theorem 2(4) any Ai has at most m + 1 states and can be constructed in time O(m · Σ). 2. . for any i ≤ k. deﬁne rn = (£0 + · · · + £n−1 ) and sn = i i Ini ∪ Outi ∪ {£i . Ini = {aj ◦ . . there are SOREs rn and sn of size O(n2 ) such that any regular expression deﬁning L(rn ) ∩ L(sn ) is of size at least 2n−1 . deﬁne Outi = {ai. It thus sets the format of the strings and checks whether the circled indices are equal. ℓ < n. for every i < n. a k regular expression deﬁning L(A) can then be constructed in time 2O((m+1) ) . (1) First. Further. By Theorem 2(1).
. . We deﬁne m = 2n + 1 deterministic regular expressions of size O(n). An . . rn . for any i ≤ n. . . every expression is of size O(n) and is deterministic as the Glushkov construction translates them into a DFA [17]. i i i Now. rn are SOREs. Clearly.3. if a symbol a does not qa i occur in an expression ri . . we can construct an NFA A with Σ + 1 states deﬁning i≤n L(ri ) in time cubic in the sizes of r1 . Since r1 . . such that their intersection deﬁnes D2n . This construction creates an automaton which has an initial state. (4) We show that given SOREs r1 . q0 . and additionally one state for each symbol in the regular expressions. δi . . . . So. . the (i + 1)th bit of the two numbers surrounding an even # should be equal: Σ2n+2 Σi (0Σ2n−i+1 (△ + Σn+i+1 0Σn−i−1 #)+ (1Σ2n−i+1 (△ + Σn+i+1 1Σn−i−1 #))) ∗ ∗ ∗ . . . . n − 1}. We ﬁrst construct NFAs A1 . the (i + 1)th bit of the two numbers surrounding an odd # should be equal: Σi (0Σ3n+2 0 + 1Σ3n+2 1)Σn−i−1 # △. Furthermore. . q0 . The expressions are as follows. Now. δ. . . It then follows from Theorem 2(1) that an expression deﬁning i≤n L(ri ) can be constructed in time 2O(m) . A state is accepting if all its . . and that all incoming transitions carry the label of that state. Furthermore. Fi ) then we say that Qi = {q0 } ∪ {qa  a ∈ Σ}. For all i ∈ {0. rn . For ease of exposition. we add a state qa to Qi which does not have any incoming or outgoing transitions. F ) deﬁning the intersection of A1 . Q has again an initial state and one state for each symbol: Q = {q0 } ∪ {qa  a ∈ Σ}. with only outgoing transitions. any regular n expression deﬁning D2n is of size at least 22 and the theorem follows. where i is the state labeled with a. For ease of readability. we are ready to construct the NFA A = (Q. . we could also say that each state is labeled with a symbol.3. First. . . where m = i≤n ri  since m ≥ Σ. let Ai = (Qi . . n − 1}. An such that L(Ai ) = L(ri ). by using the Glushkov construction [15]. . . There should be an even length sequence of blocks: (0 + 1)n $(0 + 1)n #(0 + 1)n $(0 + 1)n # △. Intersecting Regular Expressions 35 (3) Let n ∈ N. . for every symbol there exists at most one state labeled with that symbol in any Ai . . all incoming transitions for that state are labeled with that symbol. . we denote ΣD simply by Σ. the intersection of the above expressions deﬁnes D2n . By Lemma 19(1). . For all i ∈ {0.
Since the Glushkov construction b takes quadratic time [15]. We can thus obtain the following corollary. and hence only use a binary alphabet. Notice that for the results in Theorem 20(2) it is impossible to avoid this. such that any regular expression deﬁning L(rn ) ∩ L(sn ) is of size at least 2n−1 . of size O(n). L(A) = n. such that any regular expression deﬁning 2n i≤m L(ri ) is of size at least 2 . However. qa ∈ Fi }. in every Ai : δ = {(qa . Let Σ be an alphabet with Σ = 2: 1. b. with m = 2n + 1. Finally. qb )  ∀i ≤ i . Now. . the total construction can be done in cubic time. a can denote 0 or an alphabet symbol. Here. as the number of SOREs over an alphabet of ﬁxed size is ﬁnite. . (qa i≤n L(Ai ). if we weaken the statements and allow to use unrestricted regular expressions in Theorem 20(2) and (3) (instead of SOREs and deterministic expressions) it is easy to adapt the proofs to use the language 2 Kn . . there are regular expressions rn and sn over Σ. rm over Σ.36 Succinctness of Regular Expressions i corresponding states are accepting: F = {qa  ∀i ≤ n. . 2. We note that the lower bounds in Theorem 20(2) and (3) do not make use of a ﬁxed size alphabet. For every n ∈ N. there are regular expressions r1 . For every n ∈ N. and we have to construct n automata. Corollary 21. of size O(n2 ). b. there is a transition between qa and i i qb if there is a transition between qa and qb . q i ) ∈ δ i }. .
specifying that a. specifying that there must occur at least two and at most ﬁve a’s.4 Succinctness of Extended Regular Expressions Next to XML schema languages.5 . In this chapter. in the XML schema language Relax NG [22]. The class RE(∩) is a well studied extension of the regular expressions. A problem we consider. but also allow the use of additional operators. For instance. and is often referred to as the semiextended regular expressions. b. These applications. and interleaving (RE(&)) operators. we study the succinctness of regular expressions extended with counting (RE(#)). thus making them harder to handle. usually do not restrict themselves to the standard regular expression using disjunction (+). intersection (RE(∩)). concatenation (·) and star (∗ ). and c may occur in any order. they can have a drastic impact on succinctness. The counting operator allows for expressions such as a2. The interleaving operator allows for expressions such as a & b & c. These RE(#)s are used in egrep [58] and Perl [108] patterns and in the XML schema language XML Schema [100]. for instance. is the translation of extended regular expressions . however. and is used. it is wellknown that expressions extended with the complement operator can describe certain languages nonelementary more succinct than standard regular expressions or ﬁnite automata [102]. Although these operators mostly do not increase the expressive power of the regular expressions. regular expressions are used in many applications such as text processors and programming languages [108].
This has been . which we here improve to 2Ω(n) . The latter problem occurs in the learning of XML schema languages [8. First. On the other hand. However. as XML Schema is the widespread W3C standard. 9. Apart from a pure mathematical interest. #) can be translated in exponential time into an NFA. the latter result has two important consequences. A second consequence concerns the automatic discovery of regular expressions describing a set of given strings. Over a constant size alphabet. However. a √ 2Ω( n) lower bound is reported in [90]. and Relax NG is a more ﬂexible alternative. it has already been shown by Kilpelainen and Tuhkanen [64] and Mayer and Stockmeyer [79] that such an exponential size increase can not be avoided for RE(#) and RE(&). For the standard regular expressions. For RE(#) the complexity of this translation is exponential [64] while it follows from the results in the previous chapter that for RE(∩) it is double exponential. At present these algorithms do not take into account the interleaving operator. the alphabet of the counterexamples grows linear with n. such a translation would be more than desirable. it is wellknown that such a translation can be done eﬃciently [15]. As the translation of extended regular expressions to NFAs already involves an exponential size increase. Therefore. They show that any regular expression deﬁning the language (a1 b1 )∗ & · · · & (an bn )∗ must be of size at least double exponential in n. Compared to √ Ω(n) Ω( n) ). ∩. this gives a tighter bound (22 instead of 22 and shows that the double exponential size increase already occurs for very simple expressions. whereas the alphabet size is constant for the languages in √ Ω( n) this chapter. It should be noted here that Gruber and Holzer [51] obtained similar results. it prohibits an eﬃcient translation from Relax NG (which allows interleaving) to XML Schema Deﬁnitions (which only allows a very limited form of interleaving).38 Succinctness of Extended Regular Expressions into (standard) regular expressions. For the translation from RE(∩) to NFAs. 11].1 We show that also in constructing an expression for the interleaving of a set of expressions (an hence also for an RE(&)) a double exponential size increase can not be avoided. when considering problems such as membership. 1 Ω( √ n) . the result in this chapter. such an approach is less fruitful. and inclusion testing for regular expressions the ﬁrst step is almost invariantly a translation to a ﬁnite automaton. respectively. but for Relax NG this would be wise as this would allow to learn signiﬁcantly smaller expressions. We also consider the translation of extended regular expressions to NFAs. We show that an RE(&. it is natural to ask what the size increase for DFAs The bound following from the results in the previous chapter is 22 Ω(n) improved in [51] to 22 . they improved our 22 bound Ω(n/ log n) to 22 . This is the main technical result of the chapter. For extended regular expressions. equivalence.
49]. 29) √ 2Ω( n) (Thm. 94] problems. Related work. thus giving a double exponential translation. 31) 2 O(n) 22 (Prop. Schott and Spehner give lower bounds for the translation of the interleaving of words to DFAs [96]. and RE(&). 28) (Thm. are the results on state complexity [111]. 39) 2 O(n) 22 (Prop. #) [64] (Prop. we show that this is not possible for the classes RE(#). For each class we show that in a translation to a DFA. we can conclude that given a set of regular expressions. An overview of all results is given in Figure 4. and Gruber and Johannsen [53]. it is not too hard to see that also a DFA of exponential size accepting their intersection can be constructed. but also the classes RE(#) [64. 26) Ω(n) [79] 2 O(n) (Prop. In particular. Succinctness of regular expressions has been studied by Ehrenfeucht and Zeiger [30] and. 50. but diﬀerent in nature. is. 89. Table (b) lists some related results obtained in the previous chapter and [48]. 83] and RE(&) [79] have received interest. we can translate any NFA into a DFA in exponential time. Also related. by Ellul et. RE(∩). from the results in the previous chapter.39 NFA RE(#) RE(∩) RE(&) RE(&. 61. and regular expressions. 32) θ(n) 22 2θ(n) RE RE ∩ RE RE RE & RE 2Ω(n) Ω( 22 √ n) [48] (Ch. more recently.1: Table (a) gives an overview of the results in this chapter concerning the complexity of translating extended regular expressions into NFAs. the RE(∩) and its membership [90.1(a). However. 71] and equivalence and emptiness [35. in which the impact of the application of diﬀerent operations on ﬁnite automata is studied. DFAs. Proposition and theorem numbers are given in brackets. Gruber and Holzer [48. 3) Ω(n) [48] 2 (b) Figure 4. The diﬀerent classes of regular expressions considered here have been well studied.1(b). Of course. 25) 2 2Ω(n) 2Ω(n) Ω(n) 22 Ω(n) 22 DFA (Prop. but can we do better? For instance. a double exponential size increase can not be avoided. In the present chapter. al [31]. . constructing an NFA for their intersection can not avoid an exponential size increase. ∩. 27) (a) RE [64] [51] √ 2Ω( n) (Thm. Some relevant results are listed in Figure 4.
denoted by sh(L). both u is reachable from v. v ∈ U ∧ (u. which will allow us to reduce our questions about the size of regular expressions to questions about the star height of regular languages. For a set of vertices U ⊆ V . where F = {(u. A (directed) graph G is a tuple (V. denoted by cr(G). is again deterministic. the star height of a regular expression r. 4.40 Succinctness of Extended Regular Expressions Outline. and sh(r∗ ) = sh(r)+1. where V is the set of vertices and E ⊆ V × V is the set of edges. E) is strongly connected if for every pair of vertices u. and v is reachable from u. . A may have at most one ﬁnal state and the automaton obtained by inverting every transition in A. A graph (U. such that the inverse of A is again deterministic. The latter two concepts are related through the following theorem due to Gruber and Holzer [48]. A set of edges V ′ ⊆ V is a strongly connected component (SCC) of G if G[V ′ ] is strongly connected and for every set V ′′ . We now introduce the cycle rank of a graph G = (V. A graph G = (V.3. G[V ′′ ] is not strongly connected. E). is the graph (U. denoted by G[U ].2. Then any regular expression 1 deﬁning L is of size at least 2 3 (sh(L)−1) − 1. E).4 we study the translation of extended regular expressions to NFAs. the subgraph of G induced by U . sh(r2 )}.1 Deﬁnitions and Basic Results In this section we present some additional deﬁnitions and basic results used in the remainder of this chapter. for a ∈ Σ. respectively.1 we give some additional necessary deﬁnitions and present some basic results. v ∈ V . In Sections 4. and exchanging the initial and ﬁnal state. Formally. It is inductively deﬁned as follows: (1) if G is acyclic or empty. In Section 4. Theorem 22 ([48]). is the minimal star height among all regular expressions deﬁning L. and regular expressions. F ). and otherwise (3) cr(G) = maxV ′ ∈ SCC(G) cr(G[V ′ ]). That is. then cr(G) = minv∈V cr(G[V \ {v}]) + 1. sh(r1 r2 ) = sh(r1 +r2 ) = max {sh(r1 ). then cr(G) = 0. which is a measure for the structural complexity of G. Intuitively. denoted by sh(r). A language L is bideterministic if there exists a DFA A. Let L be a regular language. v) ∈ E}. 4. otherwise (2) if G is strongly connected. and 4. The star height of a regular language L. sh(∅) = sh(ε) = sh(a) = 0. Let SCC(G) denote the set of strongly connected components of G. DFAs. v)  u. F ) is a subgraph of G if U ⊆ V and F ⊆ E. with V ′ V ′′ . accepting L. equals the number of nested stars in r.
Now. while McNaughton [80] essentially showed that the same holds when contracting edges. Theorem 24(3) is due to McNaughton [80]. q ′ )  ∃a ∈ Σ. It should be clear that B is nonreturning and L(B) = L(A). which inherits all incoming and outgoing edges of u and v. The underlying graph G of A is the graph obtained by removing the labels from the transition edges of A. 3. We show that. [80] Proof. For any regular language L. (q. and F B = F ∪ {qB }. There is a strong connection between the star height of a regular language. F B ). sh(L) = min {cr(A)  A is an NFA accepting L} [29. F ) be an NFA over Σ. Intuitively. contracting an edge between two vertices u and v. cr(B) = cr(A) because qB has no incoming . it has been shown by Cohen [23] that removing edges or nodes from a graph does not increase the cycle rank of a graph. then sh(L) = cr(A). then cr(H) ≤ cr(G). and contracting edges. a ∈ Σ. we construct a nonreturning NFA B = (QB . δ B = δ∪{(qB . with L(A) = L and cr(A) = sh(L). a. a. / B is A extended with a new initial state which only inherits the outgoing transitions of the old initial state. referring to the cycle rank of its underlying graph. as witnessed by the following theorem. a. Theorem 24. Deﬁnitions and Basic Results 41 We say that a graph H is a minor of a graph G if H can be obtained from G by removing edges. First. we construct B sl in two steps. q ′ ) ∈ δ}. E). and F B = F if q0 ∈ F .1. We only have to prove (2). we can construct a nonreturning statelabeled NFA B sl equivalent to A such that cr(A) · Σ ≥ cr(B sl ) from which the theorem follows. δ B . In the following. with L(B) = L(A). sh(L) · Σ ≥ min{cr(A)  A is a nonreturning statelabeled NFA such that L = L(A)}. 1. given A. q0 . with E = {(q. Lemma 23 ([23. means replacing u and v by a new vertex. F ) be an NFA. otherwise. or more formally G = (Q. Thereto. 23]. 2. we often abuse notation and for instance say the cycle rank of A.4. and the cycle rank of the NFAs accepting it. if L is bideterministic. Here. Let A = (Q. Furthermore. q)  q ∈ Q. where A is the minimal trimmed DFA accepting L. Let A = (Q. removing vertices. q) ∈ δ}. qB . δ. we know that there exists an NFA A. q0 . as follows: QB = Q⊎{qB }. δ. 80]). let L be a regular language over an alphabet Σ. (q0 . Theorem 24(1) is known as Eggan’s Theorem [29] and proved in its present form by Cohen [23]. By Eggan’s Theorem (Theorem 24(1)). If a graph H is a minor of a graph G.
From B. k]. Obviously. b. it is shown in [48] that removing a set of states P from a graph can never decrease its cycle rank by more than P . pb )  q. i cr(B[Qi ]) · Σ ≥ cr(B sl [Qsl ]) for all i ∈ [1. k]. B sl is a nonreturning statelabeled NFA with L(B) = L(B sl ). cr(B) = cr(B sl ) = 0. From the deﬁnitions it then follows that cr(B) = cr(A). F sl ) be deﬁned as • Qsl = {q a  q ∈ QB . b ∈ Σ. let P sl = {q a  q ∈ P. for some a ∈ Σ. . and hence cr(B) · Σ ≥ cr(B sl ). We conclude by showing that cr(B) · Σ ≥ cr(B sl ). for a set of states P . and as Qsl  = Σ. if B is strongly connected. each of which captures all incoming transitions of q for one alphabet symbol. Otherwise. we abuse notation and. If QB  = 1 then either the single state does not contain a loop. and hence cr(B) · Σ ≥ cr(B sl ). then cr(B) = cr(B[Q\{q}])+1. which by the induction hypothesis gives us cr(B)·Σ ≥ cr(B sl [Qsl \{q}sl ])+Σ. . deﬁned in the obvious way. We now show cr(B) · Σ ≥ cr(B sl ) by induction on the number of states of B. Therefore. for a set of states P of B. Now. if B consists of strongly connected components Q1 . and observe that (B[Q \ P ])sl = B sl [Qsl \ P sl ] always holds. • δ sl = {(q a . Qk . and hence it follows from Lemma 23 that i cr(B sl ) = maxV ∈SCC(B sl ) cr(B sl [V ]) ≤ maxi≤k cr(B sl [Qsl ]). by construction. For the induction step. such that cr(B) = cr(B sl ) = 0. from which it follows that i cr(B) · Σ ≥ cr(B sl ). δ sl . for some q ∈ QB . i Then. But now. then. in which case cr(B) = 1.42 Succinctness of Extended Regular Expressions transitions and it thus forms a separate strongly connected component in B whose cycle rank is 0. Now. B sl [V ] is a minor of B sl [Qsl ]. b. again. a ∈ Σ}. cr(B sl [Q \ {q}sl ]) + Σ ≥ cr(B sl ). Then. . (q. cr(B sl ) ≤ Σ holds. we now construct the nonreturning statelabeled NFA B sl such sl that cr(B sl ) ≤ cr(B) · Σ = cr(A) · Σ. then cr(B) = maxi≤k cr(B[Qi ]). Otherwise. q0 . by induction. as {q}sl  = Σ. if B is acyclic. cr(B) · Σ = cr(B[Q \ {q}]) · Σ + Σ. for some i ∈ [1. . or the single state contains a self loop. In the following. p) ∈ δ B }. a. sl a • q0 = qB . a ∈ Σ} That is. and • F sl = {q a  q ∈ F B . write B[P ] for the subautomaton of B induced by P . every strongly connected component V of B sl is contained in Qsl . . a ∈ Σ}. with k ≥ 2. p ∈ QB . Finally. Let B sl = (Qsl . B sl contains Σ copies of every state q of B.
n n For RE(#) this is witnessed by the expression a2 . Let r be a RE(&.r. (q1 . let A1 accept L(r1 ). a. Furthermore.r.2 and for RE(&) by the expression a1 & · · · & an . by constructing the NFA by induction on the structure of the expression.t. Succinctness w. Then. respectively. p2 ))  (q2 . (p1 . δ i . except for δ which now equals {((q1 . F i ). The present tighter statement. We give the full construction for the three special operators: i • If r = r1 ∩ r2 . For r = r1 ∩ r2 or r = r1 & r2 . q0 )  qi ∈ Qi ∧ ∃pi ∈ F i such that (qi . For RE(∩). and δ = i≤ℓ δ i ∪ {(qi . q0 . a. deﬁne A = (Q. We show that such a translation can be done in exponential time. however. ∗ r = r1 + r2 . the total construction can be done in time 2O(r) . #). then A is deﬁned exactly as for r1 ∩ r2 . pi ) ∈ δi }.4. a. and r1 this can easily be done using standard constructions. Then. F ) accepting r1 as follows: Q = i≤ℓ Qi . hence. a. Proposition 25. and the induction cases r = r1 r2 . We construct A by induction on the structure of the formula. F = F1 × F2 . and r = a. a. A contains at most ℓ · 2r1  = 2r1 +log ℓ ≤ 2r states. An NFA A with at most 2r states. such that L(r) = L(A). q0 ). let Bi = (Qi . We argue that A contains at most 2r states. a. q2 ). a. and δ = {((q1 . can be constructed in time 2O(r) . For the base cases. q0 . we study the complexity of translating extended regular expressions into NFAs. A contains at most 2r1  ·2r2  = 2r1 +r2  ≤ 2r states. For r = r1 . a. p2 ) ∈ δ2 }. q2 ))  (q1 . ℓ]. r = ε. q2 ). (p1 . k. who already used it for a translation from RE(&) to NFAs.ℓ and. as the intermediate automata never have more than 2r states and we have to do at most r such constructions. p2 ))  (q1 . Proof. δ i . 2]. q0 . This exponential size increase can not be avoided for any of the classes.t. For i ∈ [1. similarly. r = ∅. F i ) accept L(ri ). for i ∈ [1. let B1 to Bℓ be ℓ identical copies i of A1 with disjoint sets of states. 1 2 A = (Q. as already observed by Kilpelainen and Tuhkanen [64] √ and Mayer and Stockmeyer [79]. . i+1 F = k≤i≤ℓ F i .ℓ • If r = r1 . by induction A1 and A2 contain at most 2r1  and 2r2  states k.2. • If r = r1 & r2 . NFAs 43 4. let Ai = (Qi . p2 ) ∈ δ2 }. q0 = (q0 . a.ℓ 0 Now. δ. for a ∈ Σ. q2 ). p1 ) ∈ δ1 ∧ (q2 . a 2Ω( n) lower bound has already been reported in [90]. k. F ) is deﬁned as Q = Q1 × Q2 . NFAs In this section. δ. ∩.2 Succinctness w. q0 . We note that the construction for the interleaving operator is introduced by Mayer and Stockmeyer [79]. p1 ) ∈ δ1 }∪{((q1 .
Proposition 28. and w′ ai ∈ L(rn ) but / wai and w′ ai are either both accepted or both not accepted by A. Proof. wai ∈ L(rn ).r.2 . F ) accepting L(rn ) has at least 22 states. We show that n any DFA A = (Q. but that is identical). or r& contains at least 2n states. Proposition 27. there exists some i ∈ [1. this double exponential size increase can not be avoided.44 Succinctness of Extended Regular Expressions will follow from Theorem 29 and the fact that any NFA with n states can be translated into a DFA with 2n states (Theorem 2(2)). RE(∩). #). we can immediately conclude the following. any DFA accepting L(rn n n . But now. can be constructed in time 22 . Therefore. and the ith position of w′ contains a b (or the other way around. Towards a n contradiction. which gives us the desired contradiction. For any n ∈ N there exists an RE(#) rn of size O(n) such n that any DFA accepting L(rn ) contains at least 22 states. w′ and a state q of A such that both (q0 . there exist an RE(#) r# . from Proposition 25 and the fact that any NFA with n states can be translated into a DFA with 2n states in exponential time (Theorem 2(2)). A DFA A with at most 22 O(r) such that L(r) = L(A). We now move to regular expressions extended with the intersection operator. DFAs In this section. r states. q) ∈ δ ∗ . δ.3 Succinctness w. The succinctness of RE(∩) with respect to DFAs can be obtained along the same lines as the simulation of exponential space turing machines by RE(∩) by F¨rer [35]. q0 . rn is of size O(n) since the integers in the numerical predicate are stored in binary. u ∩ Theorem 29. each of size O(n). as w = w′ . w′ . ∩. there must be two diﬀerent such strings w. such that any NFA accepting r# . For any n ∈ N there exists an RE(∩) rn of size O(n) such that ∩ ) contains at least 22n states. Here. First. Proposition 26. For any n ∈ N. and an RE(&) r& . q) ∈ δ ∗ and (q0 .t. or RE(&). Let r be a RE(&. we study the complexity of translating extended regular expressions into DFAs. w. As there are exactly 22 such strings. for each of the classes RE(#). suppose that A has less than 22 states and consider all strings n of length 2n containing only a’s and b’s. 4. Let n ∈ N and deﬁne rn = (a + b)∗ a(a + b)2 . an RE(∩) r∩ . and hence L(rn ) = L(A). r∩ . 2n ] such that the ith position of w contains an a. We show that. n and A contains less than 22 states.
let encR (i) denote the reversal of enc(i). a binary number in which the rightmost 1 and all following 0’s are marked. j ∈ [0. DFAs 45 Proof. $. D = {$. and σ ∈ U . deﬁne enc(w) = encR (0)a0 enc(0)$encR (1)a1 enc(1)$ · · · encR (2n − 1)a2n −1 enc(2n − 1) and. . enc(w) = 00a00$10b01$01b10$11a11 and hence #00a00$10b01$01b10$11a11#00a00$10b01$01b10$11a11 ∈ G2 . Further. All strings which do not end with enc(2n − 1) = 1n−1 1: Σ∗ (Σ1 + Σ1 Σ1. . Now. The expressions are as follows.. Now. 0. We conclude ∩ by showing that we can construct a regular expression rn of size O(n) deﬁning Gn . We deﬁne N = {0. each capturing a possible mistake in a string.4. rn is simply the disjunction of these expressions. and w = abba. to deﬁne it succinctly by an RE(∩) expression. 2n − 1] are such that j = i + 1 (mod 2n ). 2n −1] let enc(i) denote the nbit marked number encoding i. ﬁnally. lets consider Gn . it is straightforward to n show that any DFA accepting Gn must contain at least 22 states. Thereto. Let n ∈ N. we ﬁrst deﬁne a marked number as a string over the alphabet ∗ ∗ {0. S = {a. for any i ∈ [0. All strings which do not start with #: ε + Σ# Σ∗ . similar to the in one in the proof of Proposition 28. then the bits of i and j which are diﬀerent are exactly the marked bits of j. b. We start by describing the language Gn which will be used to establish the lower bound. #} and for any set U . However. All strings in which two symbols at a distance n + 1 do not match: Σ∗ (N Σn (S + D) + SΣn (S + N ) + DΣn (D + N ))Σ∗ .n−1 ) . a2n −1 . 1. . 0. let Uσ = U \{σ}. Now. b}. Then. for n = 2. #}. 1} deﬁned by the regular expression (0 + 1)∗ 10 + 0 . 1. Succinctness w. 1. b}: {ww  w = 2n }. 1.3. 1}. for a string w = a0 a1 . Then. for n = 2.t.e. For instance. the complement of Gn . Note that Σ = {0. we need to add some additional information to it.r. This will be a variation of the following language over the alphabet {a. 0. a. It is wellknown that this language is hard to describe by a DFA. For instance. deﬁne Gn = {#enc(w)#enc(w)  w ∈ L((a + b)∗ ) ∧ w = 2n }. where the following is observed: if i. These marked numbers were introduced in [35]. Using a standard argument. we construct ∩ a set of expressions. enc(1) = 01 and enc(2) = 10 and they diﬀer in both bits as both bits of enc(2) are marked. i.
the following is the desired expression: Σ∗ (asn b + bsn a)Σ∗ . we obtain: ′ Σ∗ (rn ∩ ((0 + 0)Σ∗ (0 + 1) + (1 + 1)Σ∗ (1 + 0)))Σ∗ . a string is not in Gn if and only if it is accepted by at least one of ∩ the previous expressions. such that L(ri ) = {wσw′  σ ∈ S ∧ w. are of size O(n). n]. i. By induction. deﬁnes exactly Gn . for all i ∈ [0. the following is the desired expression: Σ∗ (rn ∩ σΣ∗ Σσ )Σ∗ . . Furthermore. inductively as follows: r0 = S and. # # All strings which contain a (nonreversed) binary number which is not correctly marked: Σ∗ SN ∗ ((0 + 1)(0 + $ + #) + (0 + 1)N0 )Σ∗ . s0 = Σ∗ #Σ∗ . and hence rn is also of size O(n). Then. for all j ∈ [1. w ∈ Σ∗ ∧ w = i}. Now. n]. in∩ cluding the inductively deﬁned ones. we ﬁrst deﬁne expressions ri . notice that all expressions. deﬁned as the disjunction of all these expressions. Exactly as above. All strings which contain more or less than 2 #symbols Σ∗ (# + ε)Σ∗ + Σ∗ #Σ∗ #Σ∗ #Σ∗ . rn . for all i ∈ [0. si = Σsi−1 Σ ∩ σ∈Σ σΣ∗ σ. we can ′ ′ inductively deﬁne rn such that L(rn ) = {wσw′  σ ∈ D ∧ w. and for all i ∈ [1. Then. v.e. there is a substring of the form enc(i)$encR (j) or enc(i)#encR (j) such that j = i + 1 (mod 2n ). Then. This concludes the proof. We now deﬁne expressions si . w′ ∈ Σ∗ ∧ w = w′  ∧ w ≤ n}.. σ∈N n All strings in which the binary encodings of two numbers surrounding a $ or # do not diﬀer by exactly one. All strings in which the binary encodings of two numbers surrounding an a or b are not each others reverse. rj = r0 + Σrj−1 Σ.n−1 Σ0 ))Σ∗ .46 Succinctness of Extended Regular Expressions All strings in which # occurs before any number other than encR (0) or where $ occurs before encR (0): Σ∗ ($0 + #(Σ0. All strings in which two a or b symbols which should be equal are not equal. n] such that L(si ) = {wu#vwR  u. n]. w′ ∈ Σ∗ ∧ w = w′  ∧ w ≤ i}. Thereto. Hence.
Then. but the set of strings n 0 of the form a2 c2 · · · a2 c2 deﬁned by rc corresponds exactly to the set of strings n 0 deﬁned by r. that any DFA accepting L(rn ∩ Proof. Then. for any i ∈ N. from Proposition 25 and the fact that any NFA with n states can be translated into a regular expression in time 2O(n) (Theorem 2(1)) it immediately follows that: . First. deﬁne pumpi (w) = ai ci ai ci · · · ai ci . In this sense. we can now prove the following theorem. Succinctness w. 2. 4. a string a0 · · · an ∈ L(r) = L(r1 ) ∩ L(r2 ) if c c 2 2 and only if a0 c · · · an c ∈ L(r1 ) ∩ L(r2 ) if and only if a0 c2 · · · an c2 ∈ L(rc ). w ∈ L(r ∩ ) if and only if pump (w) ∈ such that for some k ∈ N and any w ∈ Σ k n & & L(rn ). let rn be the regular expression simulating rn obtained from Lemma 30. and for i = 1. & ∩ Now. Let w = a0 · · · an be a string over an alphabet Σ. We do this by using a technique of Mayer and Stockmeyer [79] which allows. we study the translation of extended regular expressions to (standard) regular expressions. & Theorem 31. deﬁne ri as the expression obtained from ri by replacing any symbol a in ri by ac.r. for the class RE(#) it has already been shown by Kilpelainen and Tuhkanen [64] that this translation can be done in exponential time.r.t. The latter technique can be extended to general RE(∩) and RE(&).4 Succinctness w. For any n ∈ N there exists an RE(&) rn of size O(n2 ) such & ) contains at least 22n states. exactly as before.4. we need some notation. it is easily seen that a string a0 · · · an ∈ L(ri ) if and only if a0 c · · · an c ∈ L(ri ).4. not containing intersection operators. Furthermore. 0 1 k Lemma 30 ([79]). and that an exponential size increase can not be avoided. Let c c be a symbol not occurring in r. To formally deﬁne this. where r1 and r2 are standard regular expressions. Then. Regular Expressions 47 We can now extend the results for RE(∩) to RE(&). We illustrate their technique in a simple case. rc simulates r. Then. Then. rn is of size O(n2 ) and. for an expression r = r1 ∩ r2 .t. and let c be a symbol not in Σ. Regular Expressions In this section. Let n ∈ N and consider the expression rn of size O(n) constructed in n ∩ Theorem 29 such that any DFA accepting L(rn ) contains at least 22 states. not all strings deﬁned by rc are of the form a2 c2 · · · a2 c2 . Using this lemma. to simulate an RE(∩) expression by an RE(&) expression. Let r be an RE(∩) containing k ∩operators. Now. ∗ . in some sense. w ∈ L(r) if and only if pumpk (w) ∈ L(s). there exists an RE(&) s of size at most r2 such that for any w ∈ Σ∗ . consider c c the expression rc = r1 & r2 . it is straightforward n & to show that any DFA accepting L(rn ) contains at least 22 states. So.
1. .48 Succinctness of Extended Regular Expressions Proposition 32. The star height of languages will be our tool for proving lower bounds on the size of regular expressions deﬁning these languages.n−1 where k ∈ N ∪ {0}. we proceed in several steps and deﬁne several families of languages.4. 4. for any v ∈ Vn+1 . We start by observing that. the inverse of the DFA AK accepting Kn is again detern ministic as every transition is labeled with a diﬀerent symbol. n sh(Kn ) = cr(AK ).3. and establish its star height. Then. diﬀerent from the binary encoding introduced in Section 3. For n = 1. ∩.i2 · · · aik . for any n ∈ N.1 that the language Kn . on which all following languages will be based. for any n ∈ N.i1 ai1 . and show that these languages can be deﬁned as the intersection of small regular expressions. for any n ∈ N. Thereto. this then leads to the desired result: a double exponential lower bound on the translation of RE(&) to RE. Since Vn+1 consists of only one strongly connected component. sh(Kn ) = n. we deﬁne the family (Ln )n∈N which is a binary encoding of (Kn )n∈N . We now determine the star height of Kn . Lemma 33. We proceed by induction on n. Let r be a RE(&. by Theorem 24(3). we revisit the family of languages (Kn )n∈N . Finally. #). similar to the simulation of RE(∩) by RE(&) in Section 4. a double exponential size increase can not be avoided. cr(AK ) = n. However. which we denote by Kn . it only remains to show a double exponential lower bound on the translation from RE(&) to standard regular expressions. Kn+1 [Vn+1 \ {v}] is isomorphic to Kn . and hence by the induction hypothesis cr(Kn+1 ) = n + 1. For any n ∈ N. we deﬁne the family (Mn )n∈N which is obtained by simulating the intersection of the previously obtained regular expressions by the interleaving of related expressions. the language Kn is bideterministic. Bringing everything together. consists of all strings of the form a0. First. For the inductive step. We conclude by showing that. Hence. which is exactly what we will do in the rest of this section. AK is the minimal trimmed DFA accepting Kn .1 Kn : The Basic Language Recall from Section 3. the graph is a single node with a self loop and hence by deﬁnition cr(K1 ) = 1. suppose that cr(Kn ) = n and consider Kn+1 with node set Vn+1 . from Theorem 20 it follows that in a translation from RE(∩) to standard regular expressions. Furthermore. A regular expression s equivalent O(r) to r can be constructed in time 22 Furthermore. Hence. cr(Kn+1 ) = 1 + minv∈Vn+1 {cr(Kn+1 [Vn+1 \ {v}])}. ﬁrst observe that the graph underlying AK is the complete graph n (including selfloops) on n nodes. Indeed. n n Thereto. Proof.
and hence there exists a regular expression rK . 0.4.j ). Obviously. We now show that this encoding does not aﬀect the star height.4.n−1 ).r. For any n ∈ N. deﬁne the function ρn as ρn (ai. 1. $. n − 1]}. Ln = {ρn (w)  w ∈ Kn }. Unfortunately. Ideally.j ∈ Σn .j by ρn (ai. The proof is along the same lines as the proof of Lemma 33. the encoding starts by the encoding of the second index. AL will consist of n identical subautomata Bn to Bn n . Regular Expressions 49 4. Here. Let n ∈ N. By Lemma 33. for ai. we will introduce in this section the family of languages (Ln )n∈N which is a binary encoding of (Kn )n∈N over a ﬁxed alphabet. Lemma 35. we would like to use the languages of the family (Kn )n∈N for this as we have shown in the previous section that they have a large star height.i1 · · · aik−1 . We extend the deﬁnition of ρn to strings in the usual way: ρn (a0. So. sh(Kn ) = n. We now show that sh(Ln ) ≥ n. of double exponential size).2 ∈ K3 and hence ρ3 (a0. Therefore. Proof. sh(Ln ) = n.j ) = #enc(j)$enc(i)△enc(i + 1)△ · · · △enc(n − 1)△. let n ∈ N and recall that Kn is deﬁned over the alphabet Σn = {ai. We are now ready to deﬁne Ln .n−1 ) = ρn (a0. this is not possible as the alphabet of (Kn )n∈N grows quadratically with n.i1 ) · · · ρn (aik . a0. where enc(k).2 Ln : Binary Encoding of Kn In this section we want to construct a set of small regular expressions such that any expression deﬁning their intersection must be large (that is. we now construct the minimal DFA AL n 0 n−1 accepting Ln . For n ∈ N. Deﬁnition 34. is precisely the bideterminism property. △}. Thereto. diﬀerent from the one in Chapter 3. For instance. We ﬁrst show that Ln is bideterministic and can then determine its star height by looking at the minimal DFA accepting it. denotes the ⌈log(n)⌉bit marked number encoding k as deﬁned in the proof of Theorem 29. L(rL ) = Ln and sh(rL ) = n. and hence sh(Ln ) ≤ n. To show that Ln is bideterministic. followed by an ascending sequence of encodings of all numbers from the ﬁrst index to n − 1.2 ) = #01$00△01 △10△#10$01△10△ ∈ L3 .4. the reason we use a slightly involved encoding of Kn .1 a1. for k ∈ N. and thus by Theorem 22 can not be deﬁned by small expressions. We ﬁrst show that sh(Ln ) ≤ n. Now.1 a1. with L(rK ) = Kn and sh(rK ) = n. Succinctness w.t. In fact. 1. Let Σ = {0. #.j  i. Let rL be the regular expression obtained from rK by replacing every symbol ai. j ∈ [0. for n = 3.
n n . qn . j / j 0 2 As an example. afi ter passing through Bn which ends by reading a new substring #enc(j). #enc(j). observe that AK . the j j automaton moves to state qi of subautomaton Bn . Then. w. . is a minor of AL . j L. F i ) is the smallest i contains distinct states q i . pi ) ∈ δi . enc(j)△. it follows from Theorem 24(3) that sh(Ln ) = cr(AL ). Figure 4. (qj . n − 1].2: The automaton B3 . qn .2 shows B3 . AL is bideterministic and minimal. one labeled with $ and one labeled with △. This ensures that the subsequent ascending sequence of numbers starts with enc(i). . Hence. the qj each have exactly two incoming transitions. the automaton moves to subautomaton Bn . Indeed. Figure 4. Minimality follows immediately from the fact that AL is both trimmed and bideterministic. j ∈ [0. . Furthermore. i i ∗ i i • (qj .3 shows A3 L To see that An indeed accepts Ln . n − 1]. are equal. the minimal n n DFA accepting Kn . w. qi )  i. F ) is deﬁned n j n−1 as Q = i<n Qi . and / ∗ ∗ i ∗ i • (qn . we can easily contract edges and n 0 n−1 remove nodes from AK such that the only remaining nodes are qn to qn . Thereto. δi . δ. (qn . For any i ∈ [0. . for some i. F = {qn } and δ = i<n δi ∪ {(pi . qj+1 ) ∈ δi . n it suﬃces to show that cr(AL ) ≥ n. . q i and pi . n Now. and for all w = enc(j)△. i i deﬁned as follows. qj+1 ) ∈ δi . pi ) ∈ δi . since Ln is bideterministic and AL is the minimal trimmed DFA n accepting Ln .50 Succinctness of Extended Regular Expressions 0 0 p2 0 2 q0 0 0 △ 2 q1 0 1 △ 2 q2 1 0 △ 2 q3 # 0 1 1 p2 1 0 p2 2 2 Figure 4. since it is shown in Lemma 33 K that cr(An ) = n and by Lemma 23 and the fact that AK is a minor of AL . . the automaton checks correctly whether the numbers which should be equal. notice that all states except the qj only have one incoming transition. AL = (Q. n − 1]}. Now. . and for all w = #enc(j). Bn = (Qi . Therefore. . To see that it is bidetern i ministic. i Furthermore. notice that after reading a substring i #enc(i). $. Now. and n such that they form a complete graph. pi automaton for which Q n n−1 0 0 such that for any j ∈ [0.
This n n completes the proof. rm .4. Every number before a $ should be equal to the number following the next $. 1. . 0. let Σσ = Σ \ {σ}. We deﬁne two sets of regular expressions.t. n − 1].3: The DFA AL . 1. Proof. Lemma 36. . 3 L we know that cr(AL ) ≥ cr(AK ). Regular Expressions 51 $ 00△ 01△ 10△ 1 B3 1 q2 1 q3 #00 #01 p1 0 1 q0 1 q1 p1 1 $ 00△ $ 01△ 10△ #00 0 q2 0 q3 0 B3 p0 0 $ $ #10 p1 2 0 q0 0 q1 #01 p0 1 $ #10 p0 2 $ 2 q0 $ 00△ 2 q1 2 B3 #00 #01 p2 0 01△ 2 q2 10△ 2 q3 p2 1 $ #10 p2 2 Figure 4. each of size O(n). 1}. for all i ∈ [0. 0. and recall that L2n is deﬁned over the alphabet Σ = {0. Furthermore. . D = {$. #.4. The expressions are as follows. Let N = {0. accepting L3 . Every number should be properly marked: (D+ ((0 + 1)∗ 10 + 0 ))∗ . First. $. 1. Succinctness w. with m = 4n + 3. it follows that sh(Ln ) = cr(An ) ≥ n. The format of the string has to be correct: (#N n $N n △(N n △)+ )+ . △} and for any σ ∈ Σ. For every n ∈ N. . there are regular expressions r1 .r. it can be shown that Ln can be described as the intersection of a set of small regular expressions. the ∗ ∗ . #. Let n ∈ N. △}. such that i≤m L(ri ) = L2n .
This approach will partly yield the following family of languages. n − 1]. # # $ Every ascending sequence of numbers should end with enc(2n − 1) = 1n−1 1: (#Σ∗ 1n−1 1△)∗ . deﬁne Mn = {pump4⌈log n⌉+3 (w)  w ∈ Ln } . similar to the simulation of RE(∩) by RE(&) in Section 4. n − 1]. For any n ∈ N. For any n ∈ N. the (i + 1)th bit of the number before an odd $ should be equal to the (i + 1)th bit of the number after the next $. # # $ $ Every two (marked) numbers surrounding a △ should diﬀer by exactly one. # 4. we will ﬁnally show that RE(&) are double exponentially more succinct than standard regular expressions. the (i + 1)th bit of the number before an odd △ should properly match the (i + 1)th bit of the next number: (#Σ∗ $Σn (△Σi ((0 + 0)Σn (1 + 0) + (1 + 1)Σn (0 + 1))Σn−i−1 )∗ (△Σn + ε)△)∗ . we can easily extend the result on the star height of Ln (Lemma 35) to Mn : Lemma 37. n − 1]. for all i ∈ [0.4. Again. # # $ $ Second. As Mn is very similar to Ln . # # $ Second.52 Succinctness of Extended Regular Expressions (i + 1)th bit of the number before an even $ should be equal to the (i + 1)th bit of the number after the next $. for all i ∈ [0. . by the interleaving of related expressions. the (i + 1)th bit of the number before an even △ should properly match the (i + 1)th bit of the next number: (#Σ∗ ((△+$)Σi ((0+0)Σn (1+0)+(1+1)Σn (0+1))Σn−i−1 )∗ ((△+$)Σn +ε)△)∗ . we deﬁne two sets of regular expressions. #Σ∗ (#N i # σ∈N (σΣ∗ $Σ∗ $N i σ)Σ∗ )∗ (ε + #Σ∗ ) .3. (#N i σ∈N (σΣ∗ $Σ∗ $N i σ)Σ∗ )∗ (ε + #Σ∗ ) . We do this by simulating the intersection of the regular expressions obtained in the previous section. for all i ∈ [0. sh(Mn ) = n.3 Mn : Succinctness of RE(&) In this section. First.
since A is a nonreturning statelabeled NFA of minimal cycle rank. w′ ∈ Σ∗ . Now. as these have two incoming transitions: one labeled with $ and one labeled with △. as in the following lemma.r. such that index(L) ≤ k. by Theorem 24(1). Now. wai . we can can construct an NFA B such that L(B) = L ∩ Σ(k) and B is a minor of A. we deﬁne Σ(k) to be the language deﬁned by the expression ( σ∈Σ σ k )∗ . . and let B be as constructed above. the proof n proceeds exactly as the proof of Lemma 35. and thus sh(L) · Σ ≥ sh(L ∩ Σ(k) ). Lemma 38. Furthermore. First. Let A be any nonreturning statelabeled NFA of minimal cycle rank accepting L. cr(B) ≥ sh(L(B)) = sh(L ∩ Σ(k) ).4. F ) be a nonreturning statelabeled NFA. if we merge the last m states of these two sequences (the parts reading cm ). We conclude by giving the construction of B. for any q ∈ Q. Let L be a regular language. Succinctness w. Then.t. with m = 4⌈log n⌉ + 3.4. for which we ﬁrst introduce some notation. q) ∈ δ ∗ } . ina (q) represents the maximal number of a symbols which can be read when entering state q. We show how this implies the lemma. Mn is bideterministic as witnessed by the minimal DFA AM accepting Mn . we need an additional lemma. Notice that index(L) can be inﬁnite. Proof. Naively creating these i two sequences of states leading to qj . Again. The proof is along the same lines as the proof of Lemma 35. we can conclude from Theorem 24(2) that sh(L) · Σ ≥ cr(A). Intuitively. We show that. n In fact. we obtain the minimal trimmed DFA AM accepting Mn . However. and an alphabet Σ. However. Regular Expressions 53 Proof. For k ∈ N. AM is simply obtained from AL by replacing each transition over a n n symbol a by a sequence of states reading am cm . however. we will only be interested in languages for which it is ﬁnite. Further. a ∈ Σ such that wai w′ ∈ L}. since B is a minor of A it follows from Lemma 23 that cr(B) ≤ cr(A). deﬁne the function ina (q) as follows: ina (q) = max{i  i ∈ N ∪ {0} ∧ ∃w ∈ Σ∗ such that (q0 . given a nonreturning statelabeled NFA A accepting L. deﬁne index(L) = max {i  i ∈ N ∧ ∃w. Notice that as i can equal zero and index(L) is ﬁnite. Thereto. However. such that index(L) ≤ k. and k ∈ N. Some i care has to be taken. Then. δ. Let L be a regular language.e. and k ∈ N. q0 . for a language L. ina (q) is well deﬁned for any state which is not useless. and hence cr(A) ≥ sh(L ∩ Σ(k) ). all strings which consist of a sequence of blocks of identical symbols of length k. and a ∈ Σ. i. for the states qj . we obtain a nonminimal DFA. sh(L) · Σ ≥ sh(L ∩ Σ(k) ). the language we will eventually deﬁne will not be exactly Mn . let A = (Q. Therefore..
. For any q ∈ Q. and recompute ina (q) for all q ∈ Q. qf ) ∈ δ ∗ }. L(Ai ) ∩ Σ(k) ⊆ L(Ai+1 ) ∩ Σ(k) . as L(A) ∩ Σ(k) ⊆ L(B) ∩ Σ(k) ⊆ L(B) then easily follows. It suﬃces to show that for all i ∈ [1. such that insymbol(q) (q) < k and symbol(q) = a ⇒ remove (q. where each Ai is obtained from Ai−1 by applying exactly one rule and possibly removing useless states. qf ∈ F such that (q. we ﬁrst prove that L(A) ∩ Σ(k) ⊆ L(B). . q = q0 . we prove for every rule separately that if Ai+1 is obtained from Ai by applying this rule.. q ′ ) from δ. accepting L ∩ Σ(k) . is the desired automaton. Before proving L(Ai ) ∩ Σ(k) ⊆ L(Ai+1 ) ∩ Σ(k) . Apply one of the following rules. let A1 . with (q. That is. we introduce some notation. we deﬁne outa (q) = max{i  i ∈ N ∪ {0} ∧ ∃w ∈ Σ∗ . (4. To show that L(B) = L(A) ∩ Σ(k) . we prove that for any w ∈ L(Ai ) ∩ Σ(k) . remove q from F . such that ina (q ′ ) > ina (q) + 1 ⇒ remove (q. it also holds that index(Ai ) ≤ k. q ′ ) ∈ δ and q = q0 . (c) If exists q ∈ F . into B. until no more changes are made: 1. More precisely. such that insymbol(q) (q) < k ⇒ make q nonﬁnal. a ∈ Σ. n]. be the sequence of NFAs produced by the algorithm. q ′ ) ∈ δ. a. the automaton obtained when no more rules can be applied. Thereto. q ′ ) from δ. a ∈ Σ. with (q.1) . a.54 Succinctness of Extended Regular Expressions Now. An . outa (q) is similar to ina (q) and represents the maximal number of a’s which can be read when leaving q. then the assumption that the removed state or transition is used in an accepting run of Ai on w leads to a contradiction. i. Therefore. that B is a minor of A and that L(B) = L(A) ∩ Σ(k) . ina (q) + outa (q) ≤ k . for any Ai . Here. a. accepting L. q ′ ∈ Q. the following algorithm transforms A. Repeat the following two steps. Remove all useless states from A. any state q of Ai and any a ∈ Σ. q ′ ∈ Q. 2. any accepting run of Ai on w is still an accepting run of Ai+1 on w. It is immediate that B is a minor of A since B is obtained from A by only removing transitions or states. it holds that We are now ready to show that L(Ai ) ∩ Σ(k) ⊆ L(Ai+1 ) ∩ Σ(k) . Thereto. a ∈ Σ. with A1 = A and An = B. n − 1]. . a ∈ Σ. Since of course L(Ai ) ⊆ L(A) holds for all i ∈ [1. (b) If exists q. ai w. a. if possible: (a) If exists q. It remains to show that B.e. .
a. If u = ε. Let (q. we obtain ina (pj ) = k−(i−j). a ∈ Σ. v = bv ′ for b ∈ Σ. v. we describe the obtained contradictions less formal. i.4. q ′ ). Otherwise. As rule (1a) can not be applied to transition (q. b = a and v ′ ∈ Σ∗ and hence pi must have an outgoing blabeled transition.. Furthermore.t. and outa (q ′ ) ≥ k − j − 1. pj+1 ) ∈ δB for all j ∈ [0. again. It immediately follows that ina (q) ≥ j. symbol(pi ) = a. For rules (1b) and (1c). with (q. since ina (q ′ ) > ina (q) + 1. w = uav for u. B only accepts strings in Σ(k) . then p0 = q0 . v. Now. q ′ ) ∈ δB : ina (q) + 1 = ina (q ′ ) . by repeatedly applying equation (4. with b = a = symbol(pi ). By (4. and hence insymbol(q) (q) ≥ k. q ′ ∈ QB . a. i − 1].1). then q must be preceded by k symbol(q) symbols. q ′ ). q0 . . qf ) ∈ δB for some qf ∈ F . (4. u ∈ L(ε + Σ∗ (Σ \ {a})) and v ∈ L(ε + (Σ \ {a})Σ∗ ).4. ina (p0 ) = k−i > 0. To see why. / there must exist i ∈ [1. a. Succinctness w. q ′ ) is removed but (q. q B B does not contain useless states. suppose transition (q. contradiction. and (pj . qf ) ∈ δ ∗ . u. it is easy to see that ina (q ′ ) ≥ ina (q) + 1. it suﬃces to show that L(B) = L(B) ∩ Σ(k) . If v = ε. pi . This gives us the desired contradiction. . Contradiction. i. there ∗ ∗ also exist states p0 . (q0 . k − 1].2). F B ). such that w = uai v. p0 ) ∈ δB . n − 1] and u′ . Since ﬁrst make the following observation. v ′ ∈ Σ∗ such that u = u′ aj and v = ak−j−1 v ′ . a ∈ Σ. . q) ∈ δ ∗ and (q ′ . We consider two cases. u. Towards a contradiction. ina (pi ) ≥ k must hold. i] and. We ′ ) be a transition in δ . . Otherwise. a. we now obtain the following equality: For q. there exists a j ∈ [0. As B is statelabeled. a. then q is entered after reading k symbol(q) symbols. a. for all j ∈ [0. Then. if u = u′ b for u′ ∈ Σ∗ and b = a. B is also nonreturning and hence ina (q0 ) = 0 should hold.. this contradicts equation (4.r. in particular. Regular Expressions 55 For rule (1a).e. We ﬁrst argue that ina (pi ) = k. we have ina (q ′ ) + outa (q ′ ) > j + k − j − 1 + 1 = k. if the transition (q. let B = (QB . suppose that B accepts a string w ∈ Σ(k) . we distinguish two cases. v ∈ Σ∗ . δB . if q is the accepting state of some run on w. and hence insymbol(q) (q) ≥ k. Then. Since L(B) ⊆ L(A) deﬁnitely holds. then p0 has an incoming b transition. a. For rule (1c). B We now show L(B) ⊆ L(A) ∩ Σ(k) . Notice that. then pi ∈ F and hence by rule (1c) ina (pi ) ≥ k. However. q ′ ) occurs in an accepting run of Ai on w.e. is used in an accepting run on w. . as w ∈ L(B).2) We are now ready to show that L(B) ⊆ L(A) ∩ Σ(k) .1) it suﬃces to show that ina (pi ) ≥ k. such that (q0 . as i > 0. (pi . For rule (1b). with symbol(q) = a. As A was nonreturning and we did not introduce any new transitions. for some qf ∈ F . By rule (1b). Since w ∈ Σ(k) . Thereto.
then (2) index(N2n ) ≤ m. For every n ∈ N. it follows immediately from the construction in [79] that any string in N2n ∩ Σ(m) is of the form am cm am cm · · · am cm . we are ﬁnally ready to prove the desired theorem: Theorem 39. rm it is possible to construct regular expressions s1 . again. As furthermore. . it hence follows that M2n = N2n ∩Σ(m) . by Lemma 37. Since i≤m L(ri ) = L2n . . Now. sh(M2n ) = 2n n n and index(N2n ) ≤ m.. . we note that all lower bounds in this chapter make use of a constant size alphabet and can furthermore easily be extended to a binary alphabet. . . . every symbol ai by bi ck−i+1 . As a ﬁnal remark. Σ So. it follows from Lemma 38 that sh(N2n ) ≥ sh(M2 ) = 28 . Corollary 40. but any regular expression deﬁning N2n must. w ∈ i≤m L(ri ) if and only if pumpm (w) ∈ N2n .e. sm such that (1) for all i ∈ [1. and hence. This concludes the proof. ina (p0 ) = 0 should hold. For any language over an alphabet Σ = {a1 . it is shown in [79] that given r1 . . m]. be the regular expressions obtained in Lemma 36 such that i≤m L(ri ) = L2n . . . it is shown in [81] that this transformation does not aﬀect the star height. there exists an RE(&) rn of size O(n2 ) such 1 n that any regular expression deﬁning L(rn ) must be of size at least 2 24 (2 −8) −1. and if we deﬁne N2n = L(s1 ) & · · · & L(sm ). ak }. Let n ∈ N. for any i ∈ [1. Furthermore. and hence the lower bounds on the sizes of the regular expression also carry over. there are regular expressions s1 . . such that any regular expression deﬁning 1 n L(s1 ) & L(s2 ) & · · · & L(sm ) is of size at least 2 24 (2 −8) − 1. . . N2n can be described by the interleaving of the expressions s1 to sm . Proof. by Theorem 22. 2 1 l pumpm (w) for some w ∈ Σ∗ . . . . . . 1 n be of size at least 2 24 (2 −8) − 1. similar to Lemma 30. For any n ∈ N. rm . with m = 4n + 3. each of size O(n). and (3) for every w ∈ Σ∗ . and let r1 . Obviously. i. Furthermore.56 Succinctness of Extended Regular Expressions p0 does not have an incoming a transition. and the lower bounds on the number of states of DFAs trivially carry over. . This completes the proof. . with m = 4n + 3. the size of a regular expression for this new language is at most k + 1 times the size of the original expression. . we obtain a new language over the alphabet {b. k]. and M2n = {pumpm (w)  w ∈ L2n }. sm . . c} by replacing. This completes the chapter. each of size O(n). si  ≤ 2ri . Now.
An overview of the results is given in Tables 5. Therefore.5 Complexity of Extended Regular Expressions In this chapter. &) is an NFA with two additional features: (i) split and merge transitions to handle interleaving. To study these problems.1.1 Automata for RE(#. Outline. and intersection nonemptiness problem for regular expressions extended with counting and interleaving operators.2 we establish the complexity of the basic decision problems for diﬀerent classes of regular expressions. we introduce NFA(#. and. &). inclusion. we deﬁne NFA(#. an NFA(#. &) We introduce the automaton model NFA(#. as done in Chapter 8. we study the complexity of the equivalence. In Section 5. [75] relevant to XML schema languages. 5. we also investigate CHAin Regular Expressions (CHAREs) extended with a counting operator. The main reason for this study is pinpointing the complexity of these decision problems for XML schema languages. In brief. an extension of NFAs with counter and split/merge states for dealing with counting and interleaving operators. In Section 5. &). &)s.2. (ii) counting states and transitions to deal with numerical .1 and 5. al. CHAREs are a subclass of the regular expressions introduced by Martens et.
A conﬁguration γ is a pair (P. δ) where • Q is a ﬁnite set of states. respectively. The new results are printed in bold.1: Overview of some new and known complexity results. . &) is a tuple A = (Q. For readability. . . α) where P ⊆ Q is a set of states and α : Q → {0. such that min(q) ≤ max(q).58 Complexity of Extended Regular Expressions inclusion pspace ([102]) expspace ([79]) expspace ([83]) EXPSPACE EXPSPACE RE RE(&) RE(#) RE(#. &)s to obtain complexity upper bounds in Section 5. An NFA(#. &) equivalence pspace ([102]) expspace ([79]) expspace ([83]) EXPSPACE EXPSPACE intersection pspace ([70]) PSPACE PSPACE PSPACE PSPACE Table 5. f ∈ Q are the start and ﬁnal states. All results are completeness results. s. occurrence constraints. &) as follows. The idea of split and merge transitions stems from J¸drzejowicz and Szepietowski [60]. &) as they can express shuﬄeclosure which is not regular. . NFA(&).2 and Chapter 8. We then deﬁne an NFA(#. None of the cited papers consider the complexity of the basic decision problems of their model. we associate a lower bound min(q) ∈ N and an upper bound max(q) ∈ N ∪ {∞}. and. Their automata are more general than e NFA(#. max(A)} is the value function mapping states . we denote Σ ∪ {ε} by Σε . Counting states are also used in the counter automata of Kilpel¨inen and Tuhkanen [65]. Deﬁnition 41. We will use NFA(#. and NFA(#. Zilio and Lugiez [26] also proposed an automaton model that incorporates counting and interleaving by means of Presburger formulas. &) NFA(#). • s. To every q ∈ Q. • δ is the transition relation and is a subset of the union of the following sets: (1) (2) (3) (4) Q × Σε × Q Q × {store} × Q Q × {split} × Q × Q Q × Q × {merge} × Q ordinary transition (resets the counter) transition not resetting the counter split transition merge transition Let max(A) = max{{max(q)  q ∈ Q ∧ max(q) ∈ N} ∪ {min(q)  q ∈ Q ∧ max(q) = ∞}}. f. a and Reuter [93] although these counter automata operate quite diﬀerently from NFA(#)s.
5.1. Automata for RE(#, &)
59
to the value of their counter. For a state q ∈ Q, we denote by αq the value function mapping q to 1 and every other state to 0. The initial conﬁguration γs is ({s}, αs ). The ﬁnal conﬁguration γf is ({f }, αf ). When α is a value function then α[q = 0] denotes the function obtained from α by setting the value of q to 0 while leaving other values unchanged. Similarly, α[q ++ ] increments the value of q by 1, unless max(q) = ∞ and α(q) = min(q) in which case also the value of q remains unchanged. The rationale behind the latter is that for a state which is allowed an unlimited number of iterations, as max(q) = ∞, it is suﬃcient to know that α(q) is at least min(q). We now deﬁne the transition relation between conﬁgurations. Intuitively, the value of the state at which the automaton arrives is always incremented by one. When exiting a state, the state’s counter is always reset to zero, except when we exit through a counting transition, in which case the counter remains the same. In addition, exiting a state through a noncounting transition is only allowed when the value of the counter lies between the allowed minimum and maximum. The latter, hence, ensures that the occurrence constraints are satisﬁed. Split and merge transitions start and close a parallel composition. A conﬁguration γ ′ = (P ′ , α′ ) immediately follows a conﬁguration γ = (P, α) by reading σ ∈ Σε , denoted γ →A,σ γ ′ , when one of the following conditions hold: 1. (ordinary transition) there is a q ∈ P and (q, σ, q ′ ) ∈ δ such that min(q) ≤ α(q) ≤ max(q), P ′ = (P − {q}) ∪ {q ′ }, and α′ = α[q = 0][q ′++ ]. That is, A is in state q and moves to state q ′ by reading σ (note that σ can be ε). The latter is only allowed when the counter value of q is between the lower and upper bound. The state q is replaced in P by q ′ . The counter of q is reset to zero and the counter of q ′ is incremented by one. 2. (counting transition) there is a q ∈ P and (q, store, q ′ ) ∈ δ such that α(q) < max(q), P ′ = (P − {q}) ∪ {q ′ }, and α′ = α[q ′++ ]. That is, A is in state q and moves to state q ′ by reading ε when the counter of q has not reached its maximal value yet. The state q is replaced in P by q ′ . The counter of q is not reset but remains the same. The counter of q ′ is incremented by one. 3. (split transition) there is a q ∈ P and (q, split, q1 , q2 ) ∈ δ such that min(q) ≤ α(q) ≤ max(q), P ′ = (P − {q}) ∪ {q1 , q2 }, and α′ = α[q = 0][q1 ++ ][q2 ++ ]. That is, A is in state q and splits into states q1 and q2 by reading ε when the counter value of q is between the lower and upper bound. The state q in P is replaced by (split into) q1 and q2 . The counter of q is reset to zero, and the counters of q1 and q2 are incremented by
60 one.
Complexity of Extended Regular Expressions
4. (merge transition) there are q1 , q2 ∈ P and (q1 , q2 , merge, q) ∈ δ such that, for each j = 1, 2, min(qj ) ≤ α(qj ) ≤ max(qj ), P ′ = (P − {q1 , q2 }) ∪ {q}, and α′ = α[q1 = 0][q2 = 0][q ++ ]. That is, A is in states q1 and q2 and moves to state q by reading ε when the respective counter values of q1 and q2 are between the lower and upper bounds. The states q1 and q2 in P are replaced by (merged into) q, the counters of q1 and q2 are reset to zero, and the counter of q is incremented by one. For a string w and two conﬁgurations γ, γ ′ , we denote by γ ⇒A,w γ ′ when there is a sequence of conﬁgurations γ →A,σ1 · · · →A,σn γ ′ such that w = σ1 · · · σn . The latter sequence is called a run when γ is the initial conﬁguration γs . A string w is accepted by A if and only if γs ⇒A,w γf with γf the ﬁnal conﬁguration. We usually denote ⇒A,w simply by ⇒w when A is clear from the context. We denote by L(A) the set of strings accepted by A. The size of A, denoted by A, is Q + δ + Σq∈Q log(min(q)) + Σq∈Q log(max(q)). So, each min(q) and max(q) is represented in binary.
store
dvd
10,12 10,12
cd
store
Figure 5.1: An NFA(#, &) for the language dvd10,12 & cd10,12 . For readability, we only displayed the alphabet symbol on nonepsilon transitions and counters for states q where min(q) and max(q) are diﬀerent from one. The arrows from the initial state and to the ﬁnal state are split and merge transitions, respectively. The arrows labeled store represent counting transitions. An example of an NFA(#, &) deﬁning dvd10,12 & cd10,12 is shown in Figure 5.1. An NFA(#) is an NFA(#, &) without split and merge transitions. An NFA(&) is an NFA(#, &) without counting transitions. Clearly NFA(#, &) accept all regular languages. The next theorem shows the complexity of translating between RE(#, &) and NFA(#, &), and between NFA(#, &) and NFA. In its proof we will make use of the corridor tiling problem. Thereto, a tiling instance is a tuple T = (X, H, V, b, t, n) where X is a ﬁnite set of tiles, H, V ⊆ X × X are the horizontal and vertical constraints, n is an integer in
5.1. Automata for RE(#, &)
61
unary notation, and b, t are ntuples of tiles (b and t stand for bottom row and top row, respectively). A correct corridor tiling for T is a mapping λ : {1, . . . , m}×{1, . . . , n} → X for some m ∈ N such that the following constraints are satisﬁed: • the bottom row is b: b = (λ(1, 1), . . . , λ(1, n)); • the top row is t: t = (λ(m, 1), . . . , λ(m, n)); • all vertical constraints are satisﬁed: ∀i < m, ∀j ≤ n, (λ(i, j), λ(i+1, j)) ∈ V ; and, • all horizontal constraints are satisﬁed: ∀i ≤ m, ∀j < n, (λ(i, j), λ(i, j + 1)) ∈ H. The corridor tiling problem asks, given a tiling instance, whether there exists a correct corridor tiling. The latter problem is pspacecomplete [104]. We are now ready to prove Theorem 42. Theorem 42. (1) Given an RE(#, &) expression r, an equivalent NFA(#, &) can be constructed in time linear in the size of r. (2) Given an NFA(#, &) A, an equivalent NFA can be constructed in time exponential in the size of A. Proof. (1) We prove the theorem by induction on the structure of RE(#, &)expressions. For every r we deﬁne a corresponding NFA(#, &) A(r) = (Qr , sr , fr , δr ) such that L(r) = L(A(r)). ∗ For r of the form ε, a, r1 · r2 , r1 + r2 and r1 these are the usual RE to NFA constructions with εtransition as displayed in text books such as [55]. We perform the following steps for the numerical occurrence and interleaving operator which are graphically illustrated in Figure 5.2. The construction for the interleaving operator comes from [60]. (i) If r = (r1 )k,ℓ and A(r1 ) = (Q1 , s1 , f1 , δ1 ), then • Qr := Qr1 ⊎ {sr , fr , qr };
• min(sr ) = max(sr ) = min(fr ) = max(fr ) = 1, min(qr ) = k, and max(qr ) = ℓ; • if k = 0 then δr := δr1 ⊎{(sr , ε, sr1 ), (fr1 , ε, qr ), (qr , store, sr1 ), (qr , ε, fr )}; and, • if k = 0 then δr := δr1 ⊎{(sr , ε, sr1 ), (fr1 , ε, qr ), (qr , store, sr1 ), (qr , ε, fr ), (sr , ε, fr )}.
62
Complexity of Extended Regular Expressions
(ii) If r = r1 & r2 , A(r1 ) = (Qr1 , sr1 , fr1 , δr1 ) and A(r2 ) = (Qr2 , sr2 , fr2 , δr2 ), then • Qr := Qr1 ⊎ Qr2 ⊎ {sr , fr };
• δr := δr1 ⊎ δr2 ⊎ {(sr , split, sr1 , sr2 ), (fr1 , fr2 , merge, fr )}. Notice that in each step of the construction, a constant number of states are added to the automaton. Moreover, the constructed counters are linear in the size of r. It follows that the size of A(r) is linear in the size of r. The correctness of the construction can easily be proved by induction on the structure of r. (2) Let A = (QA , sA , fA , δA ) be an NFA(#, &). We deﬁne an NFA B = 0 (QB , qB , FB , δB ) such that L(A) = L(B). Formally, • QB = 2QA × ({1, . . . , max(A)}QA ); • sB = ({sA }, αsA ); • FB = {({fA }, αfA )} if ε ∈ L(A), and FB = {({fA }, αfA ), ({sA }, αsA )}, / otherwise; • δB = { (P1 , α1 ), σ, (P2 , α2 )  σ ∈ Σ and (P1 , α1 ) ⇒A,σ (P2 , α2 ) for conﬁgurations (P1 , α1 ) and (P2 , α2 ) of A}. Obviously, B can be constructed from A in exponential time. Notice that the size of QB is smaller than 2QA  · 2A·QA  . Furthermore, as B mimics A by storing the conﬁgurations of B in its states, it is immediate that L(A) = L(B). We next turn to the complexity of the basic decision problems for NFA(#, &). Theorem 43. (1) equivalence and inclusion for NFA(#, &) are expspacecomplete; (2) intersection for NFA(#, &) is pspacecomplete; and, (3) membership for NFA(#) is nphard, membership for NFA(&) and NFA(#, &) are pspacecomplete. Proof. (1) expspacehardness follows from Theorem 42(1) and the expspacehardness of equivalence for RE(&) [79]. Membership in expspace follows from Theorem 42(2) and the fact that inclusion for NFAs is in pspace [102].
• min(sr ) = max(sr ) = min(fr ) = max(fr ) = 1;
. f. Set γj = ({sj }. where wi = a1 · · · ai is the preﬁx of w guessed up to now. αj ) such that γsj ⇒Aj . we guess it symbol by symbol and keep for each Aj one current conﬁguration γj on the tape. &) to NFA(#. δ) has size at most Q + Q · log(max(A)). For j ∈ {1. the algorithm operates as follows. αj ) such that ′ . (3) The membership problem for NFA(#.ii) can be done using only polynomial space. &). s. . . We show that the membership problem for NFA(#)s is nphard by a reduction from a modiﬁcation of integer knapsack. αj ). αj ) ⇒Aj . (2) The lower bound follows from [70].1. Indeed. n}. &) and step (2.5. &) store 63 sr ε sr1 fr1 ε qr k. as a conﬁguration of an NFA(#. the tape contains for each Aj a conﬁguration γj = (Pj . n}. . Automata for RE(#.2: From RE(#. sj .a (Pj j (i) Guess an a ∈ Σ. 1. . . . While not every γj is a ﬁnal conﬁguration ′ (ii) Nondeterministically replace each γj by a (Pj′ . αsj ) for j ∈ {1. Instead of j=1 guessing w at once. We show that the problem is in pspace. Given a set of natural numbers W = {w1 . &)s is easily seen to be in pspace by an ontheﬂy implementation of the construction in Theorem 42(2). (Pj . let Aj = (Qj . &) A = (Q. We deﬁne this problem as follows. δj ) be an NFA(#. . at each time instant. . α′ ). . . wk } and two integers . ℓ ε fr ε if k = 0 sr1 sr sr2 fr1 fr fr2 Figure 5. . Formally. The algorithm accepts when each γj is a ﬁnal conﬁguration. The algorithm proceeds by guessing a Σstring w such that w ∈ n L(Aj ). the overall algorithm operates in pspace. More precisely. &). we can store a conﬁguration using only polynomial space. As the algorithm only uses space polynomial in the size of the NFA(#. 2.wi (Pj . fj .
3. . we show that membership for NFA(&)s is pspacehard. f. In each iteration. f ). Using numerical occurrence constraints. k}. • (q. • (qwi . the problem asks whether there exists a mapping τ : W → N such that m ≤ w∈W τ (w) × w ≤ n. ε q ε store ε m. k}. store. s). We construct an NFA(#) A = (Q. ε. q) for each i ∈ {1. q2 . n f Figure 5. and L(A) = ∅ otherwise. Finally. m. . an accepting computation exists if and only if there is a mapping τ such that m ≤ w∈W τ (w) × w ≤ n. n has a solution. a state qwi for each weight wi .64 Complexity of Extended Regular Expressions store q w1 s ε ε q wk w k . q2 ): States q1 and q2 are read. . The set of states Q consists of the start and ﬁnal states s and f . ε. max(q) = n and min(qwi ) = max(qwi ) = wi for each qwi . mergesplit. m and n. split into states q1 2 . This problem is known to be npcomplete [36]. we describe some nary merge and split transitions which can be rewritten in function of the regular binary split and merge transitions. . Formally. k}. . and a state q. and. . . ′ ′ 1.3: nphardness of membership for NFA(#). a successful computation of A loops at least m and at most n times through state q. wk w1. Before giving the proof. q) for each i ∈ {1. (q1 . Intuitively. . We set min(s) = max(s) = min(f ) = max(f ) = 1. The latter mapping is called a solution. . s. . q1 . we can ensure that a computation accepts if and only if it passes at least m and at most n times through q and a multiple of wi times through each qwi . δ) such that L(A) = {ε} if W. • (qwi . store. The automaton is graphically illustrated in Figure 5. min(q) = m. A also visits one of the states qwi . w1 store . the transitions of A are the following: • (s. . • (q. . ε. . and immediately ′ and q ′ . qwi ) for each i ∈ {1. all in binary notation. Hence.
where the state pi says that the tile at position i has to be updated.5. H. xj ) ∈ V . . . qn ): State q1 is read. . q2 ) is equal to the ′ . So. and im′ . and t = x2 x2 x1 . (q1 . qn . .1. and A can accept 1 3 1 when the state set is {x1 . and are merged into ′ state q1 . the transition (q1 . . q1 2 h state. q2 and q3 are read. . if b = x1 x3 x1 . say xi . x2 . xn . ′ 4. we reduce from corridor tiling. n). q2 . since i j xj can be the ﬁrst tile of the row on top of the current row. the state set always consists of one state pi . . where q is a new auxiliary transitions (q1 . for which (xi . Transitions of type 3 and 4 can be rewritten using (n − 1) regular transitions and (n − 1) new auxiliary ′ ′ states. . mergesplit. . for i every tile xj . . To show that membership for NFA(&)s is pspacehard. merge. Given a tiling instance T = (X. 2 2 1 It remains to describe how A can transform the current row (“state set”). then the initial state set is {x1 . &) 65 ′ ′ ′ 2. . Therefore. q1 . Now. into a state set which describes a valid row on top of the current row. For example. A must at any time reﬂect the current row in its state set. merge. . The automaton proceeds in this manner for the remainder of the row. A contains for every tile x a set of states x1 . split. and 1 (resp. This transformation proceeds on a tile by tile basis and begins with the ﬁrst tile. If A is in state xi . Automata for RE(#. and a number of states which represent the current and next row. . (q1 . 2) can be rewritten using 2 (resp. qn are read. . such that the vertical constraints between xk and xℓ are satisﬁed and such that the horizontal constraints between xℓ and the tile we just placed at the ﬁrst position of the ﬁrst row. . . . For this. and (qh . we construct an NFA(&) A over the empty alphabet (Σ = ∅) which accepts ε if and only if there exists a correct corridor tiling for T . Transitions of type 1 (resp. . q ′ ). 3) new auxiliary states. and is immediately split into states ′ ′ q1 . pn is created. The automaton constructs the tiling row by row. (q1 . q2 . q1 ): States q1 . q1 . b. x3 }. q3 ): States q1 . say xk . . mergesplit. x2 . t. For the second tile of the next row. For example. where n is the length of each row. mediately split into states q1 2 3 ′ ′ 3. say xℓ . . . xj . 4) regular transitions. q2 . q ′ and q ′ . we allow x1 to be replaced by x1 . (recall that an NFA(&) can be in more than one state at once) To do this. q3 . qn . Therefore. the automaton needs to know at any time at which position a tile must be updated. q2 . qh ). an extra set of states p1 . split. in the current row which is represented by x1 in the state set. . are satisﬁed as well. we have to replace the second tile of the current row. q1 . Here. . by a new tile. the states up to position i already represent the . x3 }. . V. this means that the ith tile of the current row is x.
xj . . 2 ≤ m ≤ n. . Mayer and Stockmeyer [79] and Meyer and Stockmeyer [83] already established the expspacecompleteness of inclusion and equivalence for RE(&) and RE(#). From Theorem 42(1) and Theorem 43(1) it then directly follows that allowing both operators does not increase the complexity. xj . the automaton only has to check the vertical constraints with the ﬁrst tile of the previous row. and the horizontal and vertical constraints. extended with counting and interleaving. – ∀xi . f } • Σ=∅ • δ is the union of the following transitions – (s. xm . V. – ∀xi . and the current row is the accepting row. if there exists a correct corridor tiling for T . (xk . . 1 ≤ j ≤ n} ∪ {pi  1 ≤ i ≤ n} ∪ {s. (xj . p1 . xj ∈ X. xi ) ∈ V. p2 . assures that when there is an accepting run of A on ε. It . – (p1 . x1 . f. . and CHAREs. t. b. H. t1 . bn ): From the initial state the automaton immediately goes to the states which represent the bottom row. b1 . n) as follows: • Q = {xj  x ∈ X. m ∈ N. We can now formally construct an NFA(&) A = (Q. split. merge. p(m mod n)+1 . xk ∈ X. mergesplit. s. (xj .66 Complexity of Extended Regular Expressions tiles of the next row. . Clearly. xi ) ∈ V : (p1 .2 Complexity of Regular Expressions We now investigate the complexity of regular expressions. . the automaton has to check the vertical constraint with the mth tile at the previous row. the construction of our automaton. there exists a run of A accepting ε. δ) which accepts ε if and only if there exists a correct corridor tiling for a tiling instance T = (X. xj ): When a tile at i k the mth (m = 1) position has to be updated. x1 ): When the i j ﬁrst tile has to be updated. . this run simulates a correct corridor tiling for T . . xm . and i is the next position to be updated. and the horizontal constraint with the (m − 1)th tile of the new row. in which the updates are always determined by the position pi . the states from position i still represent the current row. Conversely. mergesplit. 1 n 1 n 5. tn . respectively. all states are merged to the accepting state. xi ) ∈ H: m−1 m−1 (pm . f ): When the state set represents a full row (the automaton is in state p1 ).
intersection for RE(#. The lower bound is already known for ordinary regular expressions. (a1 + · · · + an )+ . a#) CHARE(a. For We disregard here the additional restriction used in [9] that every symbol can occur only once.1 We introduce some additional notation to deﬁne subclasses and extensions of CHAREs. where n ≥ 1 and every ai is an alphabet symbol. However. a?) CHARE(a. Bex et al. 1. &) is pspacecomplete.ℓ . 1 . &)s directly to NFAs. the expression a(b + c)∗ d+ (e + f )? is a CHARE. &) is in pspace. &)s. equivalence and inclusion for RE(#. further follows from Theorem 42(1) and Theorem 43(2) that intersection for RE(#.2: Overview of new and known complexity results concerning Chain Regular Expressions. a?. By CHARE(#) we denote the class of CHAREs where also factors of the form (a1 + · · · + an )k. (1) Follows directly from Theorem 42(1) and Theorem 43(1). with k. For instance. Complexity of Regular Expressions inclusion pspace [75] EXPSPACE 67 intersection pspace [75] PSPACE CHARE CHARE(#) CHARE(a. The new results are printed in bold. &) are in expspace. while (ab + c)∗ and (a∗ + b?)∗ are not. unless mentioned otherwise. Proof. &)s but by translating RE(#.2. All results are completeness results. We stress that the latter results could also have been obtained without making use of NFA(#. ℓ ∈ N ∪ {0} are allowed. The class of chain regular expressions (CHAREs) are those REs consisting of a sequence of factors f1 · · · fn where every factor is an expression of the form (a1 + · · · + an ). and 2. (a1 + · · · + an )∗ . a∗ ) CHARE(a. [10] established that the far majority of regular expressions occurring in practical DTDs and XSDs are of a very restricted form as deﬁned next. (2) The upper bound follows directly from Theorem 42(1) and Theorem 43(2). in the case of intersection such a construction should be done in an ontheﬂy fashion to not go beyond pspace. a#>0 ) conp [75] conp [75] coNP in PTIME equivalence in pspace [102] in EXPSPACE in ptime [75] in ptime [75] in PTIME in PTIME np [75] np [75] NP in PTIME Table 5.5. Theorem 44. we prefer the shorter and more elegant construction using NFA(#. or. Although such an approach certainly is possible. (a1 + · · · + an )?.
∀j ≤ 2n . Table 5. we assume that n ≥ 2.2 lists the new and the relevant known results. Proof. We construct two CHARE(#) expressions r1 and r2 such that L(r1 ) ⊆ L(r2 ) if and only if there exists no correct exponential corridor tiling for T. The expspace upper bound already follows from Theorem 44(1). H. • all vertical constraints are satisﬁed: ∀i < m. (λ(i. a?. m} × {1. and a#>0 denotes ak. a∗ denote the factors (a1 + · · · + an ). . x⊥ . H.ℓ . x⊤ . We ﬁrst show that adding numerical occurrence constraints to CHAREs increases the complexity of inclusion by one exponential. j). given a tiling instance. Here. . (λ(i. (a1 + · · · + an )?. The proof for the expspace lower bound is similar to the proof for pspacehardness of inclusion for CHAREs in [75]. Thereto. j). x⊤ . j)) ∈ V . . respectively. 1) = x⊤ . 1) = x⊥ . V.ℓ with k > 0. n) be a tiling instance. . . A correct exponential corridor tiling for T is a mapping λ : {1. . ∀j < 2n . The latter problem is easily shown to be expspacecomplete [104]. x⊤ ∈ X. n) where X is a ﬁnite set of tiles. V. Theorem 45. We proceed with the reduction from expcorridor tiling. H. • all horizontal constraints are satisﬁed: ∀i ≤ m. whether there exists a correct exponential corridor tiling. x⊥ . . . Without loss of generality.68 Complexity of Extended Regular Expressions the following fragments. and. λ(i + 1. . a. We reduce from expcorridor tiling. The expcorridor tiling problem asks. 2n } → X for some m ∈ N such that the following constraints are satisﬁed: • the ﬁrst tile of the ﬁrst row is x⊥ : λ(1. λ(i. x⊥ . • the ﬁrst tile of the mth row is x⊤ : λ(m. with n = 1. we list the admissible types of factors. A tiling instance is a tuple T = (X. The main diﬀerence is that the numerical occurrence operator allows to compare tiles over a distance exponential in the size of the tiling instance. V ⊆ X × X are the horizontal and vertical constraints. and n is a natural number in unary notation. let T = (X. The proof is a reduction from expcorridor tiling. inclusion for CHARE(#) is expspacecomplete. while a# denotes ak. j + 1)) ∈ H. and (a1 + · · · + an )∗ .
2 +1 X ∗ △X△ . This expression detects rows that are too short. Notice that k = O(X2 ). . by strings of the form △R0 △R1 △ · · · △Rm △. The following regular expressions detect strings of this form which do not encode a correct tiling for T : ∗ ∗ • X△ △X 0. contain more than 2n symbols. that is. .2 −1 △ $e$e$ · · · $e$ ∗ ∗ $X△. that is. Thereto. k}. ∗ ∗ • X△ △X 2 +1.$ . Set Σ = X ⊎ {$. (x1 . ek be an enumeration of the above expressions. . We encode candidates for a correct tiling by a string in which the rows are separated by the symbol △.2 −1 △X△ △x⊤ X 2 −1.$ $ Notice that both r1 and r2 are in CHARE(#) and can be constructed in polynomial time. Complexity of Regular Expressions 69 As expspace is closed under complement. This means that k + 1 consecutive $symbols of the remaining 2k $symbols in uwu′ must be matched onto the $symbols in $e1 $e2 $ · · · $ek $. uwu′ ∈ L(r2 ). We are now ready to deﬁne r1 and r2 : k times e k times e r1 r2 = = n n n n ∗ $e$e$ · · · $e$ △x⊥ X 2 −1. the ﬁrst and the last “$” of uwu′ is always matched onto the ﬁrst and last “$” of r2 . (x1 . △}. For ease of exposition. . n n n n n n Let e1 . that is.$ $e1 $e2 $ · · · $ek $X△. contain less than 2n symbols.5. Because of leading and trailing X△ expressions.2 x2 X△ . let L(r1 ) ⊆ L(r2 ). $} by X△. R0 is the bottom row and Rm is the top row. the expspacehardness of inclusion for CHARE(#) follows. L(e) ⊆ L(ei ). . that is. for every x1 . x2 ∈ X.2 −1 △X△ . By assumption. . These expressions detect all violations of horizontal constraints. It remains to show that L(r1 ) ⊆ L(r2 ) if and only if there is no correct tiling for T .2 −1 △X△ △x⊤ X 2 −1. u′ ∈ L($e$e$ · · · $e$) and n n n n ∗ w ∈ △x⊥ X 2 −1.2 −1 △. We ﬁrst show the implication from left to right. for every i ∈ {1.2. . belongs to X 2 . x2 ) ∈ H. . Notice that uwu′ contains 2k + 2 times the symbol “$”. Moreover. It is straightforward that a string w in (†) does not match k ei if i=1 and only if w encodes a correct tiling. ∗ ∗ • X△ x1 x2 X△ . (†) in which each Ri represents a row. These expressions detect all violations of vertical constraints. . ∗ Let e = e1 · · · ek . Moreover. x2 ∈ X. 2 ∗ ∗ • X△ x1 X△ . we denote X ∪ {△} by X△ and X ∪ {△. for every x1 . x2 ) ∈ V . Let uwu′ be an arbitrary string in L(r1 ) such that u. This expression detects rows that are too long.
a#) expression with factors f1 . in which these results ﬁrst appeared. Then. To show the implication from right to left. w does not encode a correct tiling. (2) conphardness is immediate since inclusion is already conpcomplete for CHARE(a. w is matched onto some ei . a?. In the extended abstract presented at ICDT’07. • a? by a[0. . where a is an alphabet symbol. 8]b[2.2 (3) intersection for CHARE(a. (1) equivalence for CHARE(a. w encodes a correct tiling. . where u. we replace successive subexpressions a[i1 .5 a?bb?b?b1.70 Complexity of Extended Regular Expressions Hence. the sequence normal form of aa?a2. let r = f1 · · · fn be a CHARE(a. Proof. . it follows that two CHARE(a. a?. 2 . j1 + j2 ] until no further replacements can be made anymore. and. we replace every factor of the form • a by a[1. j1 ] and a[i2 . Then w ∈ k i=1 L(ei ) and. a#) is in ptime.2 −1 △ of r1 deﬁnes all candidate tilings. 10]. a?. Obviously. fn . u′ ∈ L($e$e$ · · · $e$). As ak. Theorem 46. the above algorithm to compute the sequence normal form of CHARE(a. the complexity was wrongly attributed to lie between pspace and expspace. To this end. the system T has no solution. So. a?) expressions are equivalent if and only if they have the same sequence normal form (which is deﬁned below). a?.2 −1 △X△ △x⊤ X 2 −1. It can then be tested in linear time whether two sequence normal forms are the same. a?. Adding numerical occurrence constraints to the fragment CHARE(a.ℓ is equivalent to ak (a?)ℓ−k . . The sequence normal form is then obtained in the following way. a?) expressions [75]. • ak. a#) is conpcomplete. assume that there is a string uwu′ ∈ L(r1 ) that is not in r2 . (1) It is shown in [75] that two CHARE(a. a#) expressions are equivalent if and only if they have the same sequence normal form. 1]. a#) expressions can be implemented in polynomial time. intersection in np and inclusion in conp. n n n n ∗ As the subexpression △x⊥ X 2 −1. a#) is npcomplete. It remains to argue that the sequence normal form of CHARE(a. hence.7 is a[3. (2) inclusion for CHARE(a. 1]. We call a the base symbol of the factor a[i. a?) keeps equivalence in ptime. a?. a#) expressions can be computed in polynomial time. First. a?. j]. ℓ]. For instance.ℓ by a[k. j2 ] carrying the same base symbol a by a[i1 + i2 .
Formally. • C1 is the set of all bj [kj . We can assume that they are in sequence normal form. Let w = a11 · · · apn .. 1 We compute the Ci as follows. That is. aj with the same base symbol a. . we represent strings w by their sequence normal form as discussed above. Denote bi [ki . Complexity of Regular Expressions 71 We show that the problem remains in conp. • Then. – bh = ai : fh can match the ﬁrst symbol of api . Given a compressed string w and an expression r in sequence normal form. Let r1 and r2 be two CHARE(a. and ∀h < j. To show that L(r1 ) ⊆ L(r2 ). If this is the case we merge these elements to ai+j which gives a proper compressed string. and add ah to the compressed string w. Furthermore. for all i ∈ {2. Lemma 47. To this end.h−1}. It is however possible that in the compressed string there are two consecutive elements ai . .. In particular. We call such strings compressed. we guess a compressed string w of polynomial size for which w ∈ L(r1 ). . where we take each string w as the regular expression deﬁning w. We guess w ∈ L(r1 ) in the following / manner.be =ai−1 ke and max = e∈{j. . n}. ℓi ] n by fi .. Then min ≤ pi−1 ≤ max.5. the following conditions should hold: – j < h: fh occurs after fj in r.h−1}. For every position i of w (0 < i ≤ n). ℓm ]. but w ∈ L(r2 ). p Proof. For each factor a[k. ℓj ] such that a1 = bj . We iterate from left to right over the factors of r1 . . ℓj ] ∈ Ci−1 such that pi−1 ai−1 ∈ fj · · · fh−1 and ai = bh . ℓ] we guess an h such that k ≤ h ≤ ℓ. deciding whether w ∈ L(r) is in ptime. ℓ] of r. this algorithm is capable of guessing every possible string deﬁned by r1 . .. . we compute Ci from Ci−1 . kh = 0. ℓh ] ∈ Ci when there is a fj = bj [kj . .be =ai−1 ℓe . a#)s. .. if be = ai−1 then ke = 0: in between factors fj and fh it is possible to match only symbols ai−1 . and r = b1 [k1 . – Let min = e∈{j. These are all the factors of r which can match the ﬁrst symbol of w.. That is. fj ∈ Ci when ap1 · · · ai−1 ∈ L(f1 · · · fj−1 ) and ai = bj . we deﬁne Ci as a set of factors pi−1 b[k. The following lemma shows that testing w ∈ L(r2 ) can be done in polyno/ mial time... This algorithm gives a compressed string of polynomial size which is deﬁned by r1 . i – ∀e ∈ {j. h − 1}. pi−1 symbols ai−1 should be matched from fj to fh−1 .2. a?. fh = bh [kh . ℓ1 ] · · · bm [km .
if n = n i i i i then ak1 · · · akn ∈ L(r1 ) \ L(r2 ). rn be CHARE(a. a#) expression can never p match a strict superstring of ai i for i ∈ {1. m − 1}. The np algorithm then consists of guessing a compressed string w of polyn nomial size and verifying whether w ∈ i=1 L(ri ). If n L(ri ) = ∅. Finally. we have that m ≤ min{ri   i ∈ {1. and ℓi ≤ ℓi . . . . . n}}. n}. . . rn . . it is immediate that every string in n 1 . . . . Proof. . . . . . . . n}} and. We claim that L(r1 ) ⊆ L(r2 ) if and only if n = n′ and for ′ ′ every i ∈ {1. rn by their sequence normal form. ki ≥ ki . w ∈ L(r) if and only if there is an fj ∈ Cn such that apn ∈ n L(fj · · · fn ). ℓ′ ′ ]. . . . rn . then there exists a string w = ap1 · · · apm ∈ m 1 i=1 n L(ri ) such that m ≤ min{ri   i ∈ {1. . . . ℓ1 ] · · · an [kn . Furthermore. n}. rn . . the lemma follows. a?. . . As in the proof of Theorem 46(2) we represent a string w as a compressed string. ai = a′ . inclusion. Cn can be done in polynomial time. Conversely. and intersection for CHARE(a. (as deﬁned in the proof of Theorem 46) Let r1 = a1 [k1 . no ji can be larger than the largest integer occurring in r1 . .72 Complexity of Extended Regular Expressions Then. . If there exists an i such that ℓi > ℓ′ . Since w is matched by every expression r1 . then n i 1 aℓ1 · · · aℓn ∈ L(r1 ) \ L(r2 ). . this veriﬁcation step can be done in polynomial time by Lemma 47. we exhibit a tractable subclass with numerical occurrence constraints: Theorem 49. . . Lemma 48. and since a factor of a CHARE(a. . equivalence. a?. . ℓ′ ] · · · a′ ′ [kn′ . . . For inclusion. Let r1 . . (3) nphardness is immediate since intersection is already npcomplete for CHARE(a. . We show that the problem remains in np. for each i ∈ i=1 {1. . . rn . with m 1 i=1 ai = ai+1 for every i ∈ {1. Suppose that there exists a string w = ap1 · · · apm ∈ n L(ri ). n}. . Indeed. a#>0 )s in sequence normal form. a#) expressions. . As the latter test and the computation of C1 . since w is matched by every expression r1 . . ji is not larger than the largest integer occurring in r1 . a#>0 ) are in ptime. . . ℓn ] ′ ′ ′ and r2 = a′ [k1 . i ′ . If we represent r1 . . . Proof. let r1 and r2 be two CHARE(a. . The upper bound for equivalence is immediate from Theorem 46(2). . Notice that every number ki and kj is 1 1 n n greater than zero. . . a?) expressions [75]. or if there exists an i such that a = a′ or k < k ′ . . .
x . . . 1. .1 [ki. Finally.x = max{ki. For intersection. if the above conditions hold.x  1 ≤ i ≤ n}. m1 }. j. j ∈ {1. n} and x ∈ {1. ri = ai. and. for every i ∈ {1.mi [ki. Km .5. . ℓi. . . let. Indeed. If ai. n} and x ∈ {1. a#>0 ) in sequence normal form. . . (ii) for every i. . ℓi. if condition (iii) does not hold.x  1 ≤ i ≤ n} ≤ min{ℓi. n}. . .x = aj. then the intersection between ri and rj is empty.x  1 ≤ i ≤ n} for every x ∈ {1. testing conditions (i)–(iii) can be done in linear time. .x  1 ≤ i ≤ n} and ℓj. ai. . . m1 }. .j is greater than zero. We claim that n L(ri ) = ∅ if and i=1 only if (i) m1 = m2 = · · · = mn . j ∈ {1. and x such that ki.x  1 ≤ i ≤ n}. . .mi ] be a CHARE(a. If mi = mj for some i. Finally. . Then the intersection between ri and rj is empty.x = min{ℓi. max{ki. . . then we also have that the intersection between ri and rj is empty.2. n}. . m1 }. . . Notice that every number ki.mi . So assume that condition (i) holds. . take i. . . It is straightforward to test these conditions in linear time.1 i=1 1 where Kx = max{ki.1 ] · · · ai.x for some i. Complexity of Regular Expressions 73 L(r1 ) is also in L(r2 ).x = aj. . . we have that aK1 · · · a1. . . j ∈ {1. . . (iii) for every x ∈ {1.1 . m1 }.m1 is in n L(ri ).
.
XML Schema expands the vocabulary of regular expression by a counting operator. it requires additionally that it is also clear how to go from one position to the next. On the other hand. but not strongly deterministic since it is not clear over which star one should iterate when going from one a to the next.6 Deterministic Regular Expressions with Counting As mentioned before. For example. Strong determinism restricts regular expressions even further. when matching a string from left to right against an expression. (a∗ )∗ is weakly deterministic. intuitively requires that. without looking ahead. The reason is that. it is always clear against which position in the expression the next symbol in the string must be matched. Although the latter example illustrates the diﬀerence between the notions . because it is not clear in advance to which a in the expression the ﬁrst a of the string aaa must be matched. and all other a’s against the rightmost one (similarly for the b’s). the equivalent expression b∗ a(b∗ a)∗ is weakly deterministic: the ﬁrst a we encounter must be matched against the leftmost a in the expression. weak determinism (also called oneunambiguity [17]). the expression (a + b)∗ a is not weakly deterministic. but restricts them by requiring the expressions to be deterministic. For example. Although there is a notion of determinism which has become the standard in the context of standard regular expressions. The most common notion. there are in fact two slightly diﬀerent notions of determinism. Intuitively. we do not know whether the current a we read is the last symbol of the string or not.
except for unary languages. First. counter NFAs (CNFAs). they in fact almost coincide for standard regular expressions. This result prohibits an eﬃcient algorithm translating a weakly deterministic expression into an equivalent strongly deterministic one. Then. Br¨ggemannKlein [15] has shown that any weak u deterministic expression can be translated into an equivalent strongly deterministic one in linear time. Weakly deterministic expressions with counting. the amount of nondeterminism introduced depends on the concrete values of the counters. However. translating regular expressions into NFAs. BruggemannKlein [15] has shown that the Glushkov construction. if such an expression exists. We investigate the natural extension of the Glushkov construction to expressions with counters. are more expressive than strongly deterministic ones. Combining the results on CDFAs with the latter result then also allows to infer better upper bounds on the inclusion and equivalence problem 1 In fact. For instance. This contrasts with the situation of standard expressions where such a linear time algorithm exists [15]. this situation changes completely when counting is involved. not all regular languages are deﬁnable by weakly deterministic expressions with counting.76 Deterministic Regular Expressions with Counting of weak and strong determinism. . converting expressions to CNFAs.3 + b)2. So. the aim of this chapter is an indepth study of the notions of weak and strong determinism in the presence of counting with respect to expressiveness. yields a DFA if and only if the original expression is deterministic. and show that weakly deterministic expressions can be exponentially more succinct than strongly deterministic ones. Indeed. on which they coincide. she u gives a procedure to transform expressions into star normal form which rewrites weakly determinisistic expressions into equivalent strongly deterministic ones in linear time. Second.3 + b)3. as we will show. For instance. on the other hand. the deterministic counterpart of CNFAs. Therefore. However. and investigate the complexity of some related problems. weakly deterministic expressions with counting are strictly more expressive than strongly deterministic ones. but the very similar (a2. succinctness. We show that strongly deterministic expressions with counting are equally expressive as standard deterministic expressions. We ﬁrst give a complete overview of the expressive power of the diﬀerent classes of deterministic expressions with counting.1 However.3 b is not. We also present an automaton model extended with counters.2 b is weakly deterministic. Br¨ggemannKlein did not study strong determinism explicitly. (a2. the algorithm for deciding whether an expression is weakly deterministic is nontrivial [66]. and complexity. We show that the resulting automaton is deterministic if and only if the original expression is strongly deterministic. it is shown that boolean operations can be applied eﬃciently to CDFAs. we investigate the diﬀerence in succinctness between strongly and weakly deterministic expressions with counting.
general regular expressions with counting have also received quite some attention [40. we deﬁne CNFA. It should be said that XML Schema uses weakly deterministic expressions with counting.4 as none of these models allow to derive all results in Sections 6. However. SperbergMcQueen [101] and Koch and Scherzinger [68] introduce a (slightly diﬀerent) notion of strongly deterministic expression with and without counting. Related work.2 we study the expressive power of weak and strong deterministic expressions with counting. and in Section 6. and in Section 6. there are several automata based models for diﬀerent classes of expressions with counting with as main application XML Schema validation. respectively.5. 64. Kilpelainen [63] shows that inclusion for weakly deterministic expressions with counting is coNPhard. it is also noted by SperbergMcQueen [101]. as is the case for weak determinism [66].1. A detailed examination of strong and weak determinism of regular expressions with counting intends to ﬁll this gap. Outline. and Sartiani [24] have investigated the inclusion problem involving subclasses of deterministic expressions with counting. 83]. In Section 6. that “Given the complications which arise from [weakly deterministic expressions].5 we study the Glushkov . Finally. and a lack of a detailed analysis of their diﬀerences when counting is allowed. Apart from the work already mentioned. We introduce a new automata model in Section 6. We follow the semantical meaning of SperbergMcQueen’s deﬁnition. it is shown to be decidable whether a language is deﬁnable by a deterministic regular expression.3 their relative succinctness. Ghelli. Here.77 of strongly deterministic expressions with counting. SperbergMcQueen introduces the extension of the Glushkov construction which we study in Section 6. the seminal paper is by BruggemannKlein and Wood [17] where. Further. Zilio and Lugiez [26]. one of its developers.4 and 6. and SperbergMcQueen [101].4. In Section 6. in particular. it might be desirable to also require that they be strongly deterministic as well [in XML Schema]. by Kilpelainen and Tuhkanen [65]. 37. we provide some additional deﬁnitions.5. Concerning deterministic languages without counting. Conversely. we show that testing whether an expression with counting is strongly deterministic can be done in cubic time.” The design decision for weak determinism is probably inspired by the fact that it is the natural extension of the notion of determinism for standard expressions. and Colazzo. In Section 6. Further. while using the technical approach of Koch and Scherzinger.
1 Preliminaries We introduce some additional notation concerning (deterministic) RE(#) expressions. For an iterator sk. respectively. Therefore. We conclude in Section 6. DREW . We say that an RE(#) r is in normal form if for every nullable subexpression sk. A subexpression rk.ℓ is nullable if and only if k = 0.3 b1. In Sections 6. however.2. b1.6 and 6. 1 1 1 1 For two symbols x. is exactly the notion of determinism as deﬁned in Section 2.2 is a factor of s. lower(sk. let base(sk.ℓ ) := k. v. Let r and s be expressions such that s is a subexpression of r. An RE(#) expression r is weakly deterministic if. Sometimes we will use the following observation. ubw ∈ L(r) and a = b imply that a = b. we denote by lca(x. which follows directly from the deﬁnitions: Remark 50.ℓ ) := ℓ. otherwise it is unbounded. y) the smallest subexpression of r containing both x and y.ℓ . w ∈ Char(r)∗ and all symbols a. A regular language is weakly deterministic with counting if it is deﬁned by some weakly deterministic RE(#) expression. in the expression s = (a0.2 )3.l of r we have k = 0. this intuitively means that a number of iterations of s are suﬃcient to satisfy r. Further.ℓ ) := s.8. we deﬁne here the notion of weak determinism. For instance. Recall that for an expression r. W . and Char(r) denotes the symbols occurring in r. Any RE(#) can easily be normalized in linear time. For a regular expression r. without counting.ℓ is an iterator of r.ℓ is bounded when ℓ ∈ N.3 is not.78 Deterministic Regular Expressions with Counting construction translating RE(#) to CNFA. Deﬁnition 51.7 we consider the complexity of testing whether an expression with counting is strongly deterministic and decision problems for deterministic RE(#). Weak determinism. for all strings u. respectively.4 . and upper(sk. we say that a subexpression of r of the form sk. 6. y of a marked expression r. are denoted by DRE# . r denotes a marking of r. An RE(#) expression r is nullable if ε ∈ L(r). the conditions uav. we write s r when s is a subexpression of r and s ≺ r when s r and s = r. We say that sk. When r and s are iterators. The classes of all weakly deterministic languages with counting. This notion. b ∈ Char(r). we assume that all expressions used in this chapter are in normal form. whenever ﬁrst(s) ⊆ ﬁrst(r) and last(s) ⊆ last(r). Then we say that s is a factor of r. but a0. For completeness.
an expression such as ((a + ε)(b + ε))1. Intuitively. when matching a string from left to right.2 )3. ∞ b Figure 6. a∗ + b∗ is strongly deterministic in our deﬁnition. For a strongly deterministic expression. we can go from a to b by either going forward or by iterating over the counter.1.3 (b0.4 is strongly deterministic as it is always clear over which counter we must iterate. as ε can be matched by both a∗ and b∗ . 1 a · 1 2 d 0.2 )3. but not in theirs. (a2. . Strong determinism. also (a1.1 contains the parse tree of the expression (a0. when matching ab. we follow the semantic meaning of the deﬁnition by SperbergMcQueen [101]. For the deﬁnition of strong determinism.3 b are not weakly deterministic. 3 0.1 )0. (a + b)∗ a and (a2. Thereto. Preliminaries · + 3 79 0.∞ + c0.1: Parse tree of (a0. Intuitively. Figure 6. as we have a choice of counters over which to iterate when reading multiple a’s.1 )0. Counter nodes are numbered from 1 to 4.6. Conversely. although it is weakly deterministic.∞ )d. Therefore. For instance.2 going from a to b implies going forward. an expression is weakly deterministic if. without looking ahead in the string.4 is not strongly deterministic. A bracketing of a regular expression r is a labeling of the counter nodes of 2 The diﬀerence with Koch and Scherzinger is that we allow diﬀerent derivations of ε while they forbid this. while using the formal approach of Koch and Scherzinger [68] (who called the notion strong oneunambiguity)2 . we will additionally require that we always know how to go from one position to the next. we always know against which symbol in the expression we must match the next symbol. we always know where we are in the expression. ∞ c 4 0. when matching a string against the expression from left to right. whereas going from b to a iterates backward over the counter. By the same token. an expression is weakly deterministic if. We denote the parse tree of an RE(#) expression r by pt(r). we distinguish between going forward in an expression and backward by iterating over a counter.∞ )d. while b∗ a(b∗ a)∗ and (a2.3 (b0.∞ + c0. in the expression (ab)1.3 + b)2.2 b are.2 will not be strongly deterministic. For instance.3 + b)3. For instance. Indeed.
as shown in Figure 6. v. we do not allow a derivation of ε in the derivation tree. DREW u forms a strict subclass of the regular languages.80 Deterministic Regular Expressions with Counting pt(r) by distinct indices.3 (([3 b]3 )0. respectively.∞ + ([4 c]4 )0. and every expression in star normal form is strongly deterministic. Figure 6.3 (b0. ]i  i ∈ N}.ℓ of r with index i with ([i s]i )k. For example. Therefore. . and a symbol a ∈ Σ such that uαav and uβaw are both correctly bracketed and in L(r). counting. S DRES . all inclusions from left to right Klein [15].∞ )d.∞ is strongly deterministic with counting. [15] 54 55 DRES = DREW = DRE# = DRE# S W [15] 54 DRES = DREW = DRE# S 59 60 60 REG(Σ) (if Σ = 1) REG(Σ) (if Σ ≥ 2) DRE# W The equality DRES = DREW is already implicit in the work of Br¨ggemannu 3 By this result and by deﬁnition.∞ + c0. for any alphabet Σ. ([1 ([2 a]2 )0. depending on the alphabet size. denoted REG(Σ).∞ )d is a bracketing of (a0. respectively. the paper gives a transformation of weakly deterministic regular expressions into equivalent expressions in star normal form. 6. w over Σ ∪ Γ. Numbers refer to the theorems proving the (in)equalities.1. The complete picture of the relative expressive power of the diﬀerent classes depends on the size of Σ.1 )0. However. 3 Strong determinism was not explicitly considered in [15]. A regular expression r is strongly deterministic with counting if r is weakly deterministic and there do not exist strings u. We say that a string w in Σ ⊎ Γ is correctly bracketed if w has no substring of the form [i ]i . for which the parse tree is shown in Figure 6.ℓ . The class DRE# . denotes all languages deﬁnable by a strongly deterministic expressions with. without. Concretely.2 Expressive Power Br¨ggemannKlein and Wood [17] proved that. The bracketing r of r is then obtained by replacing each subexpression sk.1 ]1 )0. That is. Deﬁnition 52. where Γ := {[i . we simply number the nodes according to the depthﬁrst lefttoright ordering.2: An overview of the expressive power of diﬀerent classes of deterministic regular languages.2. A standard regular expression (without counting) is strongly deterministic if the expression obtained by replacing each subexpression of the form r∗ with r0. a bracketed regular expression is a regular expression over alphabet Σ⊎Γ. strings α = β over Γ.
Denote by r1 and r2 the 1 subexpressions of s that correspond to r1 and r2 .1 . while the intermediate lemmas are only used in proving the former theorems. L(r1 ) = {ε}. 55. we can hence assume the latter. Then.6. EC stands for “eliminate counter” and ECE for “eliminate counter and epsilon”: (a) for all a ∈ Σ. DRES = DRE# .ℓ .ℓ is weakly deterministic. These strings exist as L(r1 ) and L(r2 ) are not equal to {ε}.ℓ . We need a small lemma to prepare for the equality between DRES and DRE# . and 60.1 ) := ECE(r) + ε (f) EC(rk.ℓ by ε. s + ε or ε + s by s0. We recursively transform r into a strongly deterministic expression EC(r) without counting such that L(r) = L(EC(r)). Let r1 r2 be a factor of rk. Theorem 54. Let w = [1 uv]1 . EC(a) := a (b) EC(r1 + r2 ) := EC(r1 ) + EC(r2 ) (c) EC(r1 r2 ) := EC(r1 )EC(r2 ) (d) if r nullable then EC(r0. As in the former case we are done.ℓ is not strongly deterministic. Therefore. Expressive Power 81 already hold. iteratively replacing any subexpression of the form s1. 59. The rest of this section is devoted to proving the other equalities and inequilaties in Theorems 54. S Proof.ℓ denote the bracketed version of rk. Let r be an arbitrary strongly deterministic regular expression with counting. both [1 u]1 [1 v]1 w and [1 uv]1 w are in L(s) and thereby show that rk.ℓ is not strongly deterministic. Indeed. In these rules. Then. the lemma already holds.1 or ε. sε or εs by s. x x+1 Deﬁne x := max{0. Proof. Let rk. If rk.∞ ) := ECE(r) · · · ECE(r) · ECE(r)∗ . and ℓ ≥ 2. Let Σ be an arbitrary alphabet. r2 nullable. L(r2 ) = {ε}. S Lemma 53.ℓ with r1 . Let s := [1 r]k. using the rules below.1 ) := EC(r) (e) if r not nullable then EC(r0. v) be a nonempty string in L(r1 ) (respectively. Then rk. Let u (respectively. We can assume without loss of generality that no subexpressions of the form s1. assume that rk.1 and εk.ℓ be a marked version of rk.ℓ is not weakly deterministic. L(r2 )).2. k − 2}. yields an expression which is still strongly deterministic and which is either equal to ε or does not contain s1.1 or ε occur.
Case (b’) is also trivial.ℓ ) := ECE(r) ECE(r)(· · · ) + ε Similarly as above. The ﬁrst nontrivial case is (c’). The base cases of the induction are (a) EC(a) := a and (a’) ECE(a) := a. and L(a0. for all expressions r to which ECE(r) can be applied. we prove by simultaneous induction on the application of the rules (a)–(g) and (a’)–(g’) that L(EC(r)) = L(r). (a’) for all a ∈ Σ. This is why we need Lemma 53. Here. then ECE(rk. the recursion ECE(r)(· · · )+ε contains ℓ−1 occurrences of r.∞ ) := ECE(r)ECE(r)∗ (g’) if k = 0 and ℓ ∈ N \ {1}.g)). (f). / Towards a contradiction. ECE(a0.1 ) = L((a + ε)(b + ε)). or (ii) r1 r2 is a factor of a subexpression of the form sk. this subexpression is a factor.ℓ ) (e’) ECE(r0. ECE(a) := a (b’) ECE(r1 + r2 ) := ECE(r1 ) + ECE(r2 ) (c’) ECE(r1 r2 ) := EC(r1 )EC(r2 ) (d’) if k = 0. We prove that L(r) = L(EC(r)) and that EC(r) is a strongly deterministic regular expression.1 b0. whenever ECE is applied to a subexpression.1 ) = (a + ε)(b + ε). To show L(r) = L(EC(r)).82 Deterministic Regular Expressions with Counting (g) if ℓ ∈ N\{1} then EC(rk. we will see that ECE will never be applied to such a subexpression. and the factor relation is clearly transitive. If ε ∈ L(r1 r2 ) then (c’) is clearly correct. ECE(r) · · · ECE(r) denotes a kfold concatenation of ECE(r). as EC is always applied to a strongly deterministic expression. These cases are clear.1 b0. the cases (b)–(e) are trivial. For instance. for any strongly deterministic expression r.ℓ ) := EC(rk. and the recursion in ECE(r)(ECE(r)(· · · ) + ε) + ε contains ℓ − k occurrences of r.1 ) := ECE(r) (f’) ECE(r0. Correctness for cases (f) and (g) is immediate from Remark 50. This implies that r1 and r2 are nullable. The latter is important as ECE(r) does not equal L(r) \ {ε} for all r.ℓ with ℓ ≥ 2 (cases (f. The reason that there must always be such a superexpression to which case (e). For the induction.1 with s not nullable (case (e)).ℓ ) := ECE(r) · · · ECE(r)· ECE(r)(ECE(r)(· · · )+ ε) + ε In the last two rules. and that L(ECE(r)) = L(r) \ {ε}. . then ECE(rk. suppose that ε ∈ L(r1 r2 ). Notice that we only apply rule (c’) when rewriting r if either (i) r1 r2 is a factor of some s0. However. r1 r2 must in both cases be a factor since.
there exist marked strings u. with 1 ≤ i < j ≤ k + 1. f(b2 ) = b1 . Hence. k + 1].∞ . which contradicts the precondition of rule (e). we also let f denote its natural extension mapping strings to strings. In case (i). But then. Further. as both f(u)f(x)f(v) and f(u)f(x)f(w) are in L(rk.∞ ) to its corresponding symbol in Char(rk. Second. Indeed. Then. otherwise ECE(r) would not be weakly deterministic.∞ ).∞ and let f : Char(EC(rk. We next prove that EC(r) and ECE(r) are strongly deterministic regular expression. For instance. By abuse of notation. of rk. (d’) is also correct.∞ ) is not weakly deterministic. (f). and r1 + r2 is strongly deterministic. when r = (aba)1. (g). As L(r1 ) = L(EC(r1 )). Furtermore.∞ . consider the strings w1 = dm(uxv) and w2 = dm(uyw) and notice that dm(ux) = dm(uy). Cases (f).∞ . L(r2 ) = L(EC(r2 )). for any i ∈ [1.∞ be a marking of rk. x and y cannot occur in the same ECE(r)i .∞ ) = ECE(r)1 · · · ECE(r)k ECE(r)k+1 . whenever r is strongly deterministic. f(b1 ) = b1 . and f(a1 ) = a1 . Then. f(a2 ) = a2 .∞ ) and let EC(rk. let rk.2. This concludes the proof that L(EC(r)) = L(r). rules (b)–(e) and (b’)–(e’) are also immediate. that EC(rk. By Remark 50. which is also a contradiction. Let EC(rk.∞ ) is not strongly deterministic.∞ ) be the natural mapping associating each symbol in EC(rk. (f’). and thus also the strong determinism. and f(a4 ) = a2 . Expressive Power 83 or (g) is applied is that the recursive process starts by applying EC. We distinguish two cases. In case (ii).6. Therefore. For instance.∞ ) we immediately obtain a contradiction with the weak determinism. for rule (b) we know by induction that EC(r1 ) and EC(r2 ) are strongly deterministic. f(a3 ) = a1 . Cases (e’)–(g’) are also trivial.∞ )) → Char(rk. Now. We only investigate case (f) because these four cases are all very similar. w and marked symbols x and y such that uxv and uyw in L(EC(rk.∞ in the bracketed version of rk. towards a contradiction. EC(r) = (a1 b1 a2 )(a3 b2 a4 )∗ . We prove this by induction on the reversed order of application of the rewrite rules.∞ . First. Let [m be the opening bracket corresponding to the iterator rk.ℓ and therefore r is not strongly deterministic. which are immediate. it follows that also EC(r1 ) + EC(r2 ) is strongly deterministic. or (g) to some superexpression of which r1 r2 is a factor before applying ECE to r1 r2 . Then. assume f(x) = f(y).∞ )) with x = y and dm(x) = dm(y). and assume ﬁrst that this is due to the fact that EC(rk. then EC(r) = (aba)(aba)∗ . The induction base rules are (a) and (a’). contradicting the induction hypothesis. v. Lemma 53 claims that sk. r = (a1 b1 a2 )1.∞ ) denote a marked ∗ version of EC(rk. r1 and r2 nullable implies that r in rule (e) must be nullable (because r1 r2 is a factor of r). Hence we can assume x ∈ Char(ECE(r)i ) and y ∈ Char(ECE(r)j ). and (g’) are more involved. we can construct two . assume f(x) = f(y). we must have applied (e). assume.
∞ by adding brackets to w1 and w2 such that (i) in the preﬁx dm(ux) of w1 .∞ is not strongly deterministic. as this reason concerns the iterators. According to the above. Consider the expression r0.∞ ) is not strongly deterministic.∞ such that ]m2 [m2 is a substring of α2 but not of β2 . By deﬁnition. it must be the case that ECE(r)∗ is not strongly deterministic. . Since ECE(r) is strongly deterministic but ECE(r)∗ is not. We will do so by characterizing the weakly deterministic languages with counting over a unary alphabet in terms of their corresponding minimal DFAs. a contradiction. over a unary alphabet. as dm(ux) = dm(uy) this implies that rk. and • ]m1 [m1 is a substring of α1 but not of β1 . e. j [m brackets occur. this is due to the failure of the second reason in Deﬁnition 52. Let [m1 be the opening bracket corresponding to the iterator ECE(r)∗ . δ(qn−1 . • α1 . Hence. β1 ∈ Γ∗ . But. Shallit [91].∞ and therefore rk.. so that • u1 α1 av1 and u1 β1 aw1 are correctly bracketed and accepted by the brack∗ eted version of ECE(r) .g. and shows that EC(r) is indeed strongly deterministic. it suﬃces to show that. there also exist correctly bracketed strings u2 α2 av2 and u2 β2 aw2 accepted by the bracketed version of r0.∞ is not strongly deterministic. but we repeat them here for completeness. i [m brackets occur (indicating i iterations of the outermost iterator) and (ii) in the preﬁx dm(uy) of w2 . We say that a DFA over Σ = {a} is a chain if its start state is q0 and its transition function is of the form δ(q0 . . (Shallit used tail to refer to what we call a chain. It is easily seen that. This hence leads to the desired contradiction. The following notions come from.84 Deterministic Regular Expressions with Counting correctly bracketed words deﬁned by the bracketing of rk. every strongly deterministic expression is also weakly deterministic. we ﬁrst introduce some notation.) Deﬁnition 56. DRE# = DRE# W S Proof. Thereto. every weakly deterministic language with counting can also be deﬁned by a strongly deterministic expression with counting. . . Let Σ be an alphabet with Σ = 1. Then. if EC(rk. Hence.∞ (which has a lower bound 0 instead of k) and let [m2 be the opening bracket corresponding to its outermost iterator. a) = qn . where qi = qj when i = j. and all subexpressions are strongly deterministic due to the induction hypothesis. A DFA is a chain followed by a cycle if its transition . we can take strings u1 α1 av1 and u1 β1 aw1 . Theorem 55. a) = q1 . This proves that r0.
We say that L is ultimately xperiodic if L is (n0 . The crux of the proof then lies in Lemma 57. any ﬁnite language can clearly be deﬁned by a strongly deterministic expression. If L is ultimately xperiodic. a) = qn+1 . δ(qn . qn+m .ℓ is also ultimately xperiodic.g. k ∈ N ∪ {0}. and ℓ ∈ N ∪ {∞} with k ≤ ℓ. i. moreover. It is well known (see.. . . . However. any ultimately xperiodic language is also ultimately periodic. δ(qn+m . Hence. δ(qn−1 . after which we conclude with the proof of Lemma 57. i. Expressive Power 85 function is of the form δ(q0 . Indeed. L contains the string anx . while an inﬁnite but ultimately periodic language can be deﬁned by a strongly deterministic expression of the form an1 (an2 (· · · ank−1 (ank )∗ · · ·+ε)+ε). a) = qn . .. the string of a’s of length nx. . . The cycle states of the latter DFA are qn . x)periodic for some n0 ∈ N. then Lk.6. for which at most one of the cycle states is ﬁnal. [91]) and easy to see that the minimal DFA for a regular language over a unary alphabet is deﬁned either by a simple chain of states. We say that a unary regular language L is ultimately periodic if L is inﬁnite and its minimal DFA is a chain followed by a cycle. Lemma 58. they have length 0 (modulo x). the opposite does not always hold. can be diﬀerent from 0. Indeed. . . a) = q1 . . . we ﬁrst need a more reﬁned notion of ultimate periodicity and an additional lemma.e. Notice that these notions imply that L is inﬁnite. x)periodic if (i) L ⊆ L((ax )∗ ). a) = qn . . Lemma 57. In an ultimately periodic language only all suﬃciently long string must have the same length y (modulo x). then L ∈ DRE# if and only if W L is either ﬁnite or ultimately periodic. and (ii) for every n ∈ N such that nx ≥ n0 . Before we prove this lemma. where qi = qj when i = j. . . or a chain followed by a cycle.. for a ﬁxed y. x ∈ N. note that it implies Theorem 55.2. Let Σ = {a}. the following lemma adds that for languages deﬁned by weakly deterministic regular expressions only one node in this cycle can be ﬁnal. where ani denotes the ni fold concatenation of a. Thereto. δ(qn+m−1 .e. a) = qn+m . To this fact. Let L be a language. Clearly. We say that L over alphabet {a} is (n0 . in an ultimately xperiodic language all strings have lengths which are multiples of x. e. it only remains to prove Lemma 57. which. and L ∈ REG(Σ).
a⌊n/k0 ⌋x ∈ L and a(⌊n/k0 ⌋+r0 )x ∈ L. . Every string in Lk. that is. Hence. We will show that Lk. and If r has a disjunction r1 + r2 then either L(r1 ) = {ε} or L(r2 ) = {ε}.1 . Take n ∈ N such that nx ≥ k0 (n0 + x). pi ∈ {0. at most one of the nodes is ﬁnal. then it is ultimately periodic. We therefore only need to argue that.2) Indeed. (6. .ℓ ⊆ L((ax )∗ ).ℓ is (k0 (n0 + x). x)periodic.1 · · · )p3 . anx = a⌊n/k0 ⌋x · · · a⌊n/k0 ⌋x ar0 x . Since r is over a unary alphabet. take n0 ∈ N such that L is (n0 . such that L(r) is inﬁnite.ℓ . we want to write r as r = (r1 (r2 (· · · (rn )pn . (6. Our ﬁrst goal is to bring r into a normal form. for every inﬁnite unary language.ℓ is a concatenation of (possibly empty) strings in L. it only has a single occurrence of a. x)periodic. We are now ﬁnally ready to prove Lemma 57 and thus conclude the proof of Theorem 55 Proof.e. it follows that the length of every string in Lk. . W if an inﬁnite language is in DRE# . . we can make the following observations: There are no subexpressions of the form r1 r2 in r in which L(r1 ) > 1. We can assume without loss of generality that r does not contain concatenations with ε. Hence. 1}.1 )p2 . we only need to show that. Let Σ = {a} and let r be a weakly deterministic regular expression with counting over {a}. i. Since L is (n0 . Let k0 := max{k. (c) rn is a tower of counters. Hence. Furthermore.86 Deterministic Regular Expressions with Counting Proof.1) . It is wellknown that. if these statements do not hold. the minimal DFA is a chain followed by a cycle.ℓ is a multiple of x.. in this cycle. n. which implies that anx ∈ Lk. Hence. Lk. any ﬁnite language is in DRE# and any ultiW mately periodic language can already be deﬁned by a strongly deterministic expression. [of Lemma 57] Clearly. can W be deﬁned with a DFA of the correct form. Therefore. x)periodic. (b) r has no occurrences of ε. where the dots abbreviate a k0 fold concatenation of a⌊n/k0 ⌋x and r0 := n mod ⌊n/k0 ⌋. Since the length of every string in L is a multiple of x.1 )p1 . then r is not weakly deterministic. where (a) for each i = 1. ( k0 − 1)x ≥ n0 n since k0 ≥ 1 and therefore ⌊ k0 ⌋x ≥ n0 . anx ∈ Lk0 . More speciﬁcally. and hence is also in DRE# . which proves n the lemma. 1}.
in which at most one of the cycle nodes is ﬁnal.k s1. and merging states q1 and q3 into a new state. Since A1 is a chain. it suﬃces to prove now that we can translate rn into a DFA which is a chain followed by a cycle.1 over alphabet {a. . with 1 ≤ i ≤ n − 1. .2). and concludes our proof.k s. The initial state of A is the initial state of A1 and its ﬁnal state set is the union of the ﬁnal state sets in A1 and A2 .1 )p2 .k )ℓ. is not a singleton.1 )p1 .1 )pn−1 .∞ be the smallest subexpression of rn in which the upper bound is unbounded. Notice that we can translate the regular expression r′ = (r1 (r2 (· · · (rn−1 (X)pn . while preserving incoming and outgoing transitions. Indeed. These replacements preserve the language deﬁned by r and preserve weak determinism. .1 · · · )p3 . X) = q2 (and q2 is ﬁnal in A1 ). Therefore.∞ ) is ultimately xperiodic for some x ∈ N. and (iii) all subexpressions of the form (s + ε) and (ε + s) with s0.) We will ﬁrst prove that L(sk. removing state q2 .2. we know that condition (d) must hold because r is weakly deterministic. at the end of the chain. (Such an expression must exist. and L(rn ) is inﬁnite. Expressive Power (d) L(r1 ). L(rn−1 ) are singletons.1 in which only L(rn ) is inﬁnite. we iteratively replace (i) all subexpressions of the form (sk. X} into a DFA which is a chain and which reads the symbol X precisely once. 87 Notice that. (ii) all subexpressions of the form sk1 . these replacements turn r in the normal form above adhering to the syntactic constraints (a)–(c). this also proves that L(rn ) is ultimately xperiodic.ℓ1 with ak. Let sk.k1 sk2 .1 until no such subexpressions occur anymore. In order to achieve this normal form. It thus only remains to show that L(rn ) is ultimately periodic. Furthermore. then we can obtain A by taking the union of the states and transitions of A1 and A2 .1 )p1 . Hence.2).6.ℓ with skℓ. and thus ultimately periodic. More formally.1 )p2 .1 (iv) all subexpressions of the form ak. and q3 is the initial state of A2 .1) and (6.kℓ .ℓ2 with sk1 +k2 . if r has the normal form and one of L(ri ). if A1 and A2 are the DFAs for r′ and rn respectively.k1 +ℓ2 . Due to Lemma 58 and the structure of rn . Due to (6. then this would immediately violate (6. if q1 is the unique state in A1 which has the transition δ(q1 .1 · · · )p3 . . then we can intuitively obtain the DFA A for r by concatenating these two DFAs. where s = ak1 . we can assume that r = (r1 (r2 (· · · (rn )pn . . since L(rn ) is inﬁnite. L(A) = L(r) is ultimately periodic if and only if L(A2 ) = L(rn ) is ultimately periodic.
or it does not. It remains to verify that ℓ. then L(sk. note that m = yzℓ + y1 and thus m = y(y0 + y1 ) + y1 = y0 y + y1 (y + 1).h2 )k.y2 )h1 .z )k. Let n0 = y 2 xkz. Take any m ≥ y 2 kz.∞ .∞ . Hence.3 (b + ε))∗ .y+1 )z. Then. As r is weakly deterministic. then it is 1 1 n n of the form (((ax.e. To show that there also exists an n0 such that sk. Clearly. as m ≥ y 2 zk. and. For y1 this is clear as y1 = m − yzℓ.88 Deterministic Regular Expressions with Counting It therefore only remains to show that L(sk. we must show that y0 and y1 are positive numbers. y0 and y1 satisfy the desired conditions.∞ due to the replacement (i) above in our normalization.y2 with y1 < y2 . and note that it suﬃces to show that am is deﬁned by ((ay.∞ ) is of the form (ax.y+1 )z. We show that for any m such that mx ≥ y 2 xkz. hence. Let Σ be an alphabet with Σ ≥ 2. We now show that such ℓ. there are two possibilities: either s contains a subexpression of the form (s′ )y1 . y + 1) a’s.x )∗ ).x )k. also y0 + y1 = ℓz. and.x )y. 1 1 1 (((ax. x)periodic. y0 . it suﬃces to show that (((ax.∞ .∞ ) ⊆ L((ax.∞ contains a subexpression of the form (s′ )y1 . L(r) ∈ DRE# . A witness for this strict inclusion is r = (a2. and hence must be at least as big as k. Finally.. By deﬁnition. let y1 = m − yzℓ and y0 = ℓz − y1 . let ℓ be the biggest natural number such that yzℓ ≤ m.y+1 )z. This concludes the proof of Lemma 57. DRE# S DRE# W Proof. Further. and y1 indeed exist. It is immediate that L(sk. the length of every string in L(sk. Thereto. and the hi .∞ deﬁnes a subset of sk. DRE# ⊂ DRE# . y1 such that m = y0 y + y1 (y + 1) and y0 + y1 = ℓz. i. let z = h1 · h2 · · · hn . and observe that. which is a contradiction. it suﬃces to show that the S W inclusion is strict for a binary alphabet Σ = {a.∞ and. For y0 . the latter implies that ℓ is not chosen to be maximal.z )k. amx is deﬁned by (((ax. y0 ≥ yz − y1 .z )k.∞ deﬁnes all such strings of length mx ≥ n0 . If not. to show that m = y0 y + y1 (y + 1). or z = 1 when n = 0.x )y.x )y. 1 i h2 denote a nesting of iterators.x )y1 . W By applying the algorithm of Br¨ggemannKlein and Wood on r for testing u . hence.z )k. as then also yz(ℓ + 1) ≤ m. We already observed that ℓ ≥ k. while y0 (respectively. it must hold that ℓ ≥ y and ℓ ≥ k. To this end. where x can equal 1. If sk.∞ ) is a multiple of x. b}. by deﬁnition of y0 . That is. In this case.y+1 )z. Assuming y0 to be negative implies that yz ≥ y1 = m − yzℓ. Then.∞ deﬁnes all strings of length mx with m ∈ N and mx ≥ n0 . However.h2 ) · · · )h1 . and ℓ is chosen such that yzℓ ≤ m.∞ ) is ultimately xperiodic for some x ∈ N. Theorem 59.∞ ) is clearly (xk.y2 with y1 < y2 . ℓ denotes the number of iterations the topmost iterator does. The latter is the case if there exists an ℓ ≥ k and positive natural numbers y0 . recall that ℓ ≥ y. L(sk. y1 ) denotes the number of times the inner iterator reads y (respectively.
we ﬁrst give a lemma used in its proof.y + ε where k ∈ N. there exists an RE(#) expression r over alphabet {a} which is weakly deterministic and of size O(n) such that every strongly deterministic expression s. suppose that L(r) does not contain ε. prohibiting a W S translation from weakly to strongly deterministic expressions with counting. x ∈ N ∪ {0}. r is a nesting of iterators. Then. Thereto. L((aaa)∗ (a + aa)) ∈ DRE# . It . it can be seen that L(r) ∈ / DREW .y with L(s) > 1 and (x. First. y) = (1. and hence DRE# / W W 6. every language deﬁned by a regular expression with counting is regular. with L(r) = L(s). We prove that the inclusion is strict already for a unary alphabet Σ = {a}. However. constructs this expression. We can easily see that L(r) ∈ / DREW by applying the algorithm of Br¨ggemannKlein and Wood [17] to u L(r). given a weakly deterministic expression known to be equivalent to a strongly deterministic one. 1). / S Theorem 60.6. is of size at least 2n . Then L(r) can be deﬁned by one of the following expressions: (1) (ak. we therefore have that L(r) ∈ DRE# . Before proving this theorem. By Theorem 54.2 we learned that DRE# strictly contains DRE# . Succinctness 89 whether a regular language L(r) is in DREW [17]. r can not have subexpressions of the form sx. However.3. As r is over a unary alphabet.k )x. this is not the case: Theorem 61. and y ∈ N ∪ {∞}. one could still hope for an eﬃcient algorithm which. Lemma 62.y (2) (ak. Proof. it suﬃces to show that the inclusion is strict. consider the expression r = (aaa)∗ (a + aa).k )x. and thus also not in DRES . Hence. For every n ∈ N. DRE# W REG(Σ) Proof. Let r be a strongly deterministic regular expression over alphabet {a} with only one occurrence of a. since r is strongly deterministic. Let Σ be an arbitrary alphabet.3 Succinctness In Section 6. it follows from Theorems 54 and 55 that REG(Σ). However. Since r has one occurrence of a. Clearly.
possibly deﬁning ε. s is not weakly deterministic. (s0. In the ﬁrst case.ℓ a. we can rewrite r as ((ak. The reasons are as follows: • (a).k )ℓ. it cannot have subexpressions of the form sx. ℓ) = (0. Clearly.1 . does not contain an a.k )ℓ.ℓ with (k. 1) and L(s) − {ε} > 1.k )x. These expressions.2N )1. in the second case.y )0.ℓ . We can assume without loss of generality that r does not have subexpressions of the form s1. .1 .2 for N = 2n .1 . say s2 . So.k )x. the size of r is O(n). 1 1 or (ak. except a2N +1 .1 )0. ak. ak1 . the only remaining case is that r is a nesting of iterators.1 )0.y or ((ak. We ﬁrst argue that s is in a similar but more restricted normal form as the one we have in the proof of Lemma 57.ℓ1 ak2 . then (a) if s has a disjunction of the form s1 + s2 then either L(s1 ) = {ε} or L(s2 ) = {ε}.y ) + ε. Setting k := k1 × · · · × kn proves this case. Note that r deﬁnes all strings of length at least N +1 and at most 4N . We are now ready to prove Theorem 61.ℓ2 .y with (x. in fact. Thereto.k2 · · · )kn .ℓ by skℓ. (b) there are no subexpressions of the form s1 s2 in s in which L(s1 ) > 1 or ε ∈ L(s1 ).1 .k1 )k2 . aak. we are done due to the above argument.y . where introduced by Kilpel¨inen [63] when studying the inclusion problem for a weakly deterministic expressions with counting.1 by s0. Again.k )x. If r is of the form s1 + s2 . since r is strongly deterministic.kℓ and (s0. and hence L(s2 ) = {ε}.(b): otherwise. Second.1 .90 Deterministic Regular Expressions with Counting follows immediately that r is of the form ((· · · (ak1 . one of minimal size. [of Theorem 61] Let r be the weakly deterministic expression (aN +1. among those minimal expressions. (c) there are no subexpressions of the form s1. Proof. suppose that L(r) contains ε. by replacing subexpressions of the form (sk.ℓ . 1) and L(s1 ) − {ε} > 1. the lemma is immediate and. then one of s1 and s2 . r can either be rewritten to (ak. (d) there are no subexpressions of the form (s1 )k. If s is minimal and strongly deterministic. let s be a strongly deterministic regular expression for L(r) with a minimal number of occurrences of the symbol a and. if L(s) does not contain ε. y) = (0.kn )x. We prove that every strongly deterministic expression for L(r) is of size at least 2n . Then. Hence.
for each string of the form aN +1 . since then L(p1 p2 ) = 1 and p1 p2 can be rewritten.k . Formally. we can assume without loss of generality that s does not have any disjunctions. where either p1 or p2 is an iterator which contains a conjunction.9 or a7. assume that w1 and w2 end with the same symbol x. towards a contradiction.7 )8. We will now argue that. .3. If p1 contains a conjunction we know due to (b) that L(p1 ) = 1 and then p1 can be rewritten. as desired. the only remaining operators are counting and conjunction. . .6.e.1 )2. let w1 and w2 be two diﬀerent strings in {uN +1 . Indeed.1 since the latter expression has the same or a smaller size as the former ones. .1 which violates the deﬁnition of sm . Notice that sm can still be nontrivial according to (a)–(d).1 due to (d) and since L(sm ) = ∞. p3 cannot be of the form pk.1 where each si (i = 1. We now argue that sm is either of the form s′ or s′′ s′ . u2N }. But this means that sm = p1 (p4 p5 )0. and 91 • (d): otherwise. u2N be the unique strings in L(s) such that ui  = i. It now follows analogously as in the reasoning for p3 above that sm is either of the form s′ or s′′ s′ . 2N ]. Because of (b) and (c). We claim that the last symbol in each ui is diﬀerent. assume w1 to be the shorter string. ℓ ≥ 2.1 . p3 = p4 p5 . such as (a7. for every i ∈ [N + 1. being the innermost expression in the normal form for s above (i. where s′ and s′′ have only one occurrence of the m m m m m symbol a.1 )0. s is not minimal. So. therefore. m − 1) is of the form ak. Due to (a). Let . . by (c). . making the size of s at least N = 2n . let uN +1 . . where s′ and s′′ have m m m m m only one occurrence of the symbol a.2 )0. . its last position is matched onto a diﬀerent symbol in s. it / then follows that s is of the form s = s1 (s2 (· · · (sm )0. and the fact that ε ∈ L(s). . . We already observed above that sm does not have any disjunctions. Succinctness • (c): otherwise. Thereto. . sm should have been p4 p5 ). sm cannot be of the form p1 p2 p3 . . Due 3 to (c).ℓ 4 4 with (k. . 1). p3 cannot be of the form p0. . Towards a contradiction. This implies that s contains at least N occurrences of a. Hence. This shows that sm has at most one conjunction. Hence. . From (a)–(d).1 · · · )0. we can replace every subexpression s1 + s2 or s2 + s1 with (s1 )0.. Suppose. Due to (d). sm is of the form p1 p2 . ℓ) = (0. If p2 is an iterator that contains a conjunction then p2 is of the form p0. when L(s2 ) = {ε}. Without loss of generality. .2 . at least three occurrences of a). that sm has at least two conjunctions (and. which implies that s is not strongly deterministic.7 ((a2. a2N .
. It follows that the size of s is at least 2n . ai+k . To explain the update mechanism formally. and ¬φ1 . A can make a transition (q. φ2 ∈ Guard(C). It then updates the counter variables according to the update π. δ. . φ. α) where q ∈ Q is the current state and α : C → {1. and moves into state q ′ . δ : Q × Σ × Guard(C) × Update(C) × Q is the transition relation. An update is a set of statements of the form cv++ and reset(cv) in which every cv ∈ C occurs at most once. sm−1 and that x must be the rightmost symbol in s. A nondeterministic counter automaton (CNFA) is a 6tuple A = (Q. C. ′ However. A conﬁguration is a pair (q. π. . reads a. Lemma 62 says that it cannot deﬁne such a language. false. . this implies that x cannot occur in s1 . . This is a contradiction.e. this means m that s′ must always deﬁne a language of the form {w  vw ∈ L(r)}. and cv < k are in Guard(C). 6. L(s′ ) hence m contains the strings ai . as sm only contains a single occurrence of the symbol a. when reset(cv) ∈ π. an update π transforms α into π(α) by setting cv := 1. By Update(C) we denote the set of all updates. F : Q → Guard(C) is the acceptance function. Again due to the structure of s and sm . Deﬁnition 63. φ1 ∨ φ2 . denoted Guard(C). q ′ ) whenever it is in state q. ai+k+2 .4 Counter Automata Let C be a set of counter variables and α : C → N be a function assigning a value to each counter variable. q0 ∈ Q is the initial state. τ ) where Q is the ﬁnite set of states. Moreover. C is the ﬁnite set of counter variables. and cv := cv + 1 when cv++ ∈ π and α(cv) < τ (cv). max(A)} is the function mapping counter variables to their current value. and guard φ is true under the current values of the counter variables. Finally.1 · · · )0. that applying the value assignment α to the counter variables results in satisfaction of φ. ai+k+2N for some i ≥ 1 and k ≥ 0. Considering the language deﬁned by s. . we have that true. For φ ∈ Guard(C). and τ : C → N assigns a maximum value to every counter variable. let max(A) = max{τ (c)  c ∈ C}. Due to the structure of s and since both w1 and w2 are in L(s). as follows: for every cv ∈ C and k ∈ N. . . We inductively deﬁne guards over C.1 . in a way we explain next. Otherwise.1 )0. As sm is either of the form s′ or s′′ s′ m m m this implies x occurs in s′ . cv = k. we denote by α = φ that α models φ. Intuitively. Thereto. we introduce the notion of conﬁguration. .. where m v is a preﬁx of dm(w1 ). . . F. . i. the value of cv remains unaltered. . . . then so are φ1 ∧ φ2 . q0 . when φ1 . .92 Deterministic Regular Expressions with Counting s = s1 (s2 (· · · (sm )0. . a.
A conﬁguration γ ′ = (q ′ . there is at most one transition (q. Proof. τ ) can easily be completed.a) φ. F. as nonreachable conﬁgurations have no inﬂuence on the determinism of runs on the automaton. Deﬁne the completion Ac of A as Ac = (Qc . q0 . q ′ ) ∈ δ}. α′ ) immediately follows a conﬁguration γ = (q. as this deﬁnition only concerns reachable conﬁgurations. 5. for all q ∈ Q. Given a CDFA A. qs ) and. α) by reading a ∈ Σ. denoted by A. 1. q ′ ) ∈ δ such that α = φ. is Q + q∈Q log τ (q) + F (q) + θ∈δ θ. 6. α) and symbol a. Counter Automata 93 Let α0 be the function mapping every counter variable to 1. if there exists (q. for every reachable conﬁguration γ = (q. α) is ﬁnal if α = F (q). 2. Moreover. φ. The initial conﬁguration γ0 is (q0 . Deciding whether a CNFA A is deterministic is pspacecomplete. φ. By the same token. it is not straightforward to test whether an CNFA is deterministic (as will also become clear in Theorem 64(6)). We ﬁrst note that a CNFA A = (Q. for any conﬁguration (q. Theorem 64. We denote by L(A) the set of strings accepted by A. a) = {φ  (q. δ c is δ extended with (qs . π. F c . τ c ) with Qc = Q ∪ {qs }. true. A CNFA A is deterministic (or a CDFA) if. δ. a ∈ Σ.6. then so is A. q0 .4. ∅. However. membership for nondeterministic CNFA is npcomplete. with . emptiness for CDFAs and CNFAs is pspacecomplete. π. q ′ ) ∈ δ with α = φ and α′ = π(α). there is always a transition which can be followed by reading a. Given CNFAs A1 and A2 . a CDFA which accepts the complement of A can be constructed in polynomial time. A conﬁguration γ is reachable if there exists a string w such that γ0 ⇒w γ. denoted γ →a γ ′ . Note that. a CNFA A accepting the union or intersection of A1 and A2 can be constructed in polynomial time. π. membership for word w and CDFA A is in time O(wA). when A1 and A2 are deterministic. For a string w = a1 · · · an and two conﬁgurations γ and γ ′ . a. the size of A. A conﬁguration (q. δ c . for any value function α. a ∈ Σ deﬁne Formulas(q. 3. A string w is accepted by A if γ0 ⇒w γf where γf is a ﬁnal conﬁguration. 4. a. a. α) and for every symbol a ∈ Σ. That is. C. we denote by γ ⇒w γ ′ that γ →a1 · · · →an γ ′ . α0 ). we choose not to take them into account. We say that A is complete if. it holds that α = φ∈Formulas(q. The size of a transition θ or acceptance condition F (q) is the number of symbols which occur in it plus the size of the binary representation of each integer occcurring in it. a. C. φ. For q ∈ Q.
for all q ∈ Q. π1 ∪ π2 . φ. (1) Given two complete CNFAs A1 . (s′ . . Therefore. from every reachable conﬁguration only one transition can be followed when reading a symbol. a packet size p. δ. δi . τ1 ∪ τ2 ). F. Fi . s2 ) = F1 (s1 ) ∨ F2 (s2 ). a. we can match the string w from left to right. s2 ). a. m]. 2}. and • F (s1 . It is easily veriﬁed that if A1 and A2 are both deterministic. The latter problem is NPcomplete. q. s2 ) = F (s1 )∧F (s2 ). s′ ))  (si . We now prove the six statements. a. For the intersection of A1 and A2 . where F ′ (q) = ¬F (q). Further. even when all integers are given in unary. . F ′ . and m. we can decide in nondeterministic polynomial time whether a string w is accepted by a CNFA A. δ. Hence. Hence. φ1 ∧ φ2 . . . To show that it is nphard. s′ ) ∈ δi for i = 1 2 i 1. their union can be deﬁned as A = (Q1 × Q2 . only the acceptance condition diﬀers: F (s1 . given a set of weights W = {n1 . and check whether this sequence forms a valid run of A on w. decide whether there is a partitioning of W = S1 ⊎S2 ⊎· · ·⊎Sm such that for each i ∈ [1. qs ). τ ). We will construct a string w and a CNFA A such that w ∈ L(A) if and only if there exists a proper partitioning for W = {n1 . the deﬁnition is completely analogue. where φc = ¬ φ∈Formulas(q. and the maximum number of packets m. (2) Given a complete CDFA A = (Q. 1}. (q1 . (As the initial value of the countervariables is 1. q2 ). Note also that Ac is deterministic if and only extended with F s if A is deterministic. (3) Since A is a CDFA. then A is also deterministic. where Ai = (Qi . Here. δ. τi ). deﬁnes the complement of A. C. F c is F a. φi . F. This is the problem. πi . maintaining at any time the current conﬁguration of A. Finally. q. we do a reduction from bin packing [36]. nk }. When reading the string w the automaton will nondeterministically guess for each ni to which of the m sets ni is assigned (by going to its corresponding state) and maintaining the running sum of the diﬀerent sets in the countervariables. while every transition can be made in time O(A). Hence. nk }.q c (q ) = false. (4) Obviously. A2 . • δ = {((s1 . . C.q . τ ). φa.a) φ. Ci . It suﬃces to guess a sequence of w transitions. m] a state qj and countervariable cvj . . Let Σ = {#. p. the CDFA A′ = (Q. A will have an initial state q0 and for every j ∈ [1. . testing membership can be done in time O(wA). and w = #1n1 ##1n2 # · · · #1nk #. qi . .94 Deterministic Regular Expressions with Counting c the tuples (q. from now on we can assume that all CNFAs and CDFAs under consideration are complete. . we need to make w + 1 transitions. we actually store the running sum plus 1) In the end it can then easily be veriﬁed whether the chosen partitioning satisﬁes the desired properties. we have that n∈Si n ≤ p. C1 ⊎ C2 .
A Petri net is 1safe if M (s) ≤ 1 for every place s and every reachable marking M . 95 • For all i ∈ [1. That is. . A marking is a mapping M : S → N. The expression M →t M ′ denotes that M enables transition t. cvm }. qm }. Counter Automata Formally. The theorem then follows. Mn−1 such that M →t1 M1 · · · Mn−1 →tn M ′ . qi ). . where S and T are ﬁnite sets of places and transitions. A = (Q. q0 . M0 ). If t is enabled at M . (5) We show that emptiness is in pspace for CNFAs. A Petri net is a pair N = (N. and • for all i ∈ [1. As npspace = pspace.4. M2 . true. 1}. . Instead of guessing w at once. The algorithm for the upperbound guesses a string w which is accepted by A. T. E). . and its ﬁring leads to the successor marking M ′ which is deﬁned for every place s by M ′ (s) = M (s) + E(t. . . {cv1 . That is. We show that the emptiness problem for CDFAs is pspacehard by a reduction from reachability of 1safe Petri nets. ∅. M0 ) and a marking M . we hence have a nondeterministic polynomial space algorithm for the complement of the emptiness problem. τ (cvi ) = p + 2. We construct a CDFA A such that L(A) = ∅ if and only if M is not reachable from M0 . {cvi ++}. t). δ. When γ is an accepting conﬁguration. • F (q0 ) = i∈[1. . ∅. τ ) can be deﬁned as follows: • Q = {q0 . is M reachable from M0 . . F. s) − E(s. t) = 1}. A will accept all strings at1 · · · tk such that . 1.m] cvi ≤ p + 1. and have thus shown that L(A) is nonempty. M ⇒σ M ′ denotes that there exist markings M1 . (q0 . and that the marking reached by the ﬁring of t is M ′ . . and deﬁned by •t = {s  E(s. where N is a net and M0 is the initial marking.6. and E ⊆ (S × T ) ∪ (T × S) → {0. m]. and pspacehard for CDFAs. reachability for 1safe Petri nets is the problem. then it can ﬁre. we have guessed a string which is accepted by A. The latter problem is pspacecomplete [32]. #. true. A net is a triple N = (S. and pspace is closed under complement. (qi . #. Since any conﬁguration can be stored in space polynomial in A (due to the maximum values τ on the countervariables). . (qi . does there exist a sequence σ such that M0 ⇒σ M . q0 ) ∈ δ. qi ). it follows that emptiness for CNFAs is in pspace. . such that A can be in γ after reading the already guessed string w. given a 1safe Petri net (N. and for q = q0 . . The preset of a transition t is denoted by •t. Given a (ﬁring) sequence σ = t1 · · · tn . true. m]. A transition t is enabled at a marking M if M (s) > 0 for every s ∈ •t. F (q) = false. we guess it symbol by symbol and store one conﬁguration γ.
The set of states Q = {q0 . the guard φ is used to check whether the preconditions for the ﬁring of this transition are satisﬁed. we set the alphabet of A to {a. tn }. is there exactly one run on N . q1 }. τ ) is deﬁned as follows: • (q0 . . . since npspace = pspace and pspace is closed under complement. The string is accepted if after executing this ﬁring sequence. . there is a transition to every state ti . we guess a string w character by character and only store the current conﬁguration. F. S. Here. tn }. . π. where M ′ is the current marking. δ. . we maintain a countervariable for every place s. S. which can be followed by reading a t and is used to update the countervariables according to the ﬁring of ti . q0 . where M ′ is the current marking. From its initial state q0 it goes to its central state qc and sets the values of the countervariables to the initial marking M0 . A will simulate the working of the Petri net on the ﬁring sequence given by the string. the result follows. from the states ti . s)} ∪ {s++  E(t. the CDFA A = ({q0 . If at any time it is possible to follow more than one transition at once. and . t) > E(t. . q1 ) ∈ δ. From qc . there is one countervariable for every place: C = S. Again. which will have value 1 when M ′ (s) = 0 and 2 when M ′ (s) = 1. and simulates the working of N . A is nondeterministic. q0 . given a 1safe Petri net N = (N. Furthermore. (6) To show that it is in pspace. let N = (S. qc } ∪ T. T. t. τ ) is constructed as follows: M (s)=0 s =1∧ M (s)=1 s = 2. where π = {s++  M0 (s) = 1}. • For every t ∈ T : (q1 . More formally. . We construct a CNFA A which is deterministic if and only if one run is true for N . t1 . This transition can be followed by reading a t if and only if the values of the countervariables satisfy the necessary conditions to ﬁre the transition ti . Therefore. true. the countervariable for place s will have value 1 when M ′ (s) = 0 and 2 when M ′ (s) = 1. To show that the problem is pspacehard. and the updates are used to update the values of the countervariables according to the transition. δ. As in the previous proof. when T = {t1 . Formally. To do this. τ (s) = 2. The automaton now works as follows. qc . and F (q1 ) = • For every s ∈ S. E). To achieve this. a. A = ({q0 . the countervariables correspond to the given marking M . π. • F (q0 ) = false. F. t)}. q1 ) ∈ δ. there is a transition back to qc . The latter problem is pspacecomplete [32]. s) > E(s. M0 ). we associate a transition in A. t} and A will accept strings of the form at∗ . This is the problem.96 Deterministic Regular Expressions with Counting M0 ⇒t1 ···tk M . With every transition of the Petri net. φ. . we do a reduction from one run of 1safe Petri nets. Then. Then. where φ = s∈•t s = 2 and π = {reset(s)  E(s.
Moreover. t. (a1. • For all s ∈ S. a marked symbol x and an iterator c of r we denote by iteratorsr (x. r) = 1 [a1. We ﬁrst provide some notation and terminology needed in the construction below. where π = {reset(s)  E(s. we denote by iteratorsr (x.2 b1 )3. where x and y are marked symbols and c is either an iterator or null.2 ]. then iterators(a1 . cn ] contain a sequence of nested subexpressions.4 . Moreover. y. 1 We now deﬁne the set follow(r) for a marked regular expression r. the states of Gr will be a designated start state plus a state for each symbol in Char(r).6. and iterators(a1 ) = [a1. this set lies at the basis of the transition relation of Gr . we show how an RE(#) expression r can be translated in polynomial time into an equivalent CNFA Gr by applying a natural extension of the wellknown Glushkov construction [15]. c) the list of all iterators of c which contain x.4 ]. . we will always assume that they are ordered such that c1 ≺ c2 ≺ · · · ≺ cn . let iteratorsr (x) be the list of all iterators of r which contain x. As in the standard Glushkov construction. For example. • For all ti ∈ T . a. Intuitively. Therefore. π. s)} ∪ {s++  E(ti . whenever r is clear from the context. b1 ) = [a1. t. qc ) ∈ δ. Therefore. s) > E(s. y. From RE(#) to CNFA • (q0 . For a marked expression r. 1 1 1 1 1 ((a1.2 . y) all iterators of r which contain x but not y. (qc . ti ) ∈ δ. π. we omit it as a subscript. where π = {s++  M0 (s) = 1}. where φ = s∈•ti 97 s = 2. except c itself. .2 b1 )3. For marked symbols x. true. qc ) ∈ δ. they form a sequence of nested subexpressions. c) then contains the information we need for Gr to make a . That is.4 )5. for all q ∈ Q. τ (s) = 2. φ. true. .2 . (a1. A triple (x. c). (ti . .4 )5. y. The set follow(r) contains triples (x. the contribution of this section lies mostly in the characterization given below: Gr is deterministic if and only if r is strongly deterministic. Finally. ti ) > E(ti . as seen in the previous section. • For all ti ∈ T .6 ].6 .2 b1 )3.5. We refer to Gr as the Glushkov counting automaton of r. 6. if r = ((a1. iterators(a1 .2 b1 )3. and • F (q) = true. CDFAs have desirable properties which by this translation also apply to strongly deterministic RE(#) expressions. We emphasize at this point that such an extended Glushkov construction has already been given by SperbergMcQueen [101]. ∅. Note that all such lists [c1 .5 From RE(#) to CNFA In this section. ti )}.
y)). dm(y). true. . • reset([c1 . We next deﬁne the transition function. c)) ∪ update(c). c) are reset by going to y. reset(cv(cn ))}. C. y)) and π := reset(iterators(x. Finally. If c = null. 1 We introduce a counter variable cv(c) for every iterator c in r whose value will always denote which iteration of c we are doing in the current run on the string. . F. . y. the iterators in iterators(x. we test whether we have done an admissible number of iterations of all iterators in which x is located. . y. . y) are reset. y in ﬁrst(s2 ). When iterating over a bounded iterator. y. F (qx ) := valuetest(iterators(x)). τ ). null) for x in last(s1 ). if c equals null. qy ) ∈ δ. F (q0 ) := true if ε ∈ L(r). we have to check that we can still do an extra iteration. c)) ∧ upperboundtest(c) and π := reset(iterators(x. The counter variable is reset to 1. The set of states Q is the set of symbols in r plus an initial state. . We now deﬁne the Glushkov counting automaton Gr = (Q. Formally. For any symbol x ∈ last(r). • upperboundtest(c) := cv(c) < upper(c) when c is a bounded iterator and. • all tuples (x. . When we leave the iterators c1 . upperboundtest(c) := true. otherwise.. q0 . Q := {q0 }⊎ x∈Char(r) qx . . cn we have to check that we have done an admissible number of iterations for each iterator. When iterating over an iterator. because at the time we reenter this iterator. i. and s = sk. . When leaving some iterators. its ﬁrst iteration is started. this transition iterates over c and all iterators in iterators(x. s) for x in last(s1 ).e. . If c = null. .98 Deterministic Regular Expressions with Counting transition from state x to y. y in ﬁrst(s1 ). . . . for all bounded iterators 4 Recall that dm(y) denotes the demarking of y. δ. (q0 . dm(y). qy ) ∈ δ. c) ∈ follow(r). the set follow(r) contains for each subexpression s of r. . F (qx ) := false. their values must be reset. . then φ := valuetest(iterators(x. If c = null. The acceptance criteria of Gr depend on the set last(r).ℓ . For every / element x ∈ last(r). cn ]) := {reset(cv(c1 )). Lastly. π. For all y ∈ ﬁrst(r). φ. . • update(c) := {cv(c)++}. Otherwise. we start a new iteration and increment its number of transitions. then φ := valuetest(iterators(x. and • all tuples (x. We deﬁne a number of tests and update commands on these counter variables: • valuetest([c1 . Here. cn ]) := ci (lower(ci ) ≤ cv(ci )) ∧ (cv(ci ) ≤ upper(ci )). ∅. and s = s1 s2 . Let C = {cv(c)  c is an iterator of r}.4 For every element (x. we deﬁne a transition (qx .
to a marked symbol x with iterators(x) = [c1 . . c) ∈ follow(r). . We usually omit the subscript Gr from word and brackseq when it is clear from the context. As every triple (x.ℓ of r with index i with ([i s]i )k.5. Thereto. φ. . . cn ] and let iterators(y. φ. . . Then.6. c) = [c1 . y) = [c1 . wordGr (t) =]ind(c1 ) · · · ]ind(cn ) [ind(dm ) · · · [ind(d1 ) y. γi−1 →ai γi by using transition ti = (qi−1 . . . αi ). y. we also say that t is generated by (x. . . c) = [d1 . null) ∈ follow(r). . . . We now want to translate transition sequences into correctly bracketed words. . c) in follow(r) corresponds to exactly one transition t in Gr . where tn = (qx . qy ) and is generated by (x. a. L(Gr ) = L(r). n − 1. τ (cv(c)) = lower(c) as there are no upper bound tests for cv(c). This theorem will largely follow from Lemma 66. qy ) and is generated by (x. . for every i = 1. Let iterators(x. . with iterators(y) = [c1 . . with c = null. . . a. . π. we set brackseqGr (σ) = wordGr (t1 ) · · · wordGr (tn ) wordGr (y). and for all unbounded iterators c. φi . Finally. for which we ﬁrst introduce some notation. wordGr (t) =]ind(c1 ) · · · ]ind(cn ) [ind(dm ) · · · [ind(d1 ) y. then wordGr (t) = [ind(cn ) · · · [ind(c1 ) y. . . qy ). . τ (cv(c)) = upper(c) since c never becomes larger than upper(c). From RE(#) to CNFA 99 c. π. we ﬁrst associate a bracketed word wordGr (t) to each transition t of Gr . Here. y. ai . . π. . The goal will be to associate transition sequences of Gr with correctly bracketed words in r. It is accepting if there exists a sequence of conﬁgurations γ0 γ1 . . Let iterators(x. . φ. γi = (qi . y. Theorem 65. γn such that γ0 is the inital conﬁguration. . tn . . a. dm ]. . (ii) If t = (qx . we use ind(c) to denote the index i associated to iterator c. we associate wordGr (x) =]ind(c1 ) · · · ]ind(cn ) . qy ). Recall that the bracketing r of an expression r is obtained by associating indices to iterators in r and by replacing each iterator c := sk. We sometimes also say that σ encodes brackseqGr (σ). . . . dm ]. x) = [d1 . qi ). Notice that brackseqGr (σ) is a bracketed word over a marked alphabet. . γn is a ﬁnal conﬁguration. . πi . cn ] and iterators(y. and. For every RE(#) expression r. y. . . a. In the following. for each i. (iii) If t = (qx . cn ]. for a transition sequence σ = t1 .ℓ . π. Gr is deterministic if and only if r is strongly deterministic. . c). tn of Gr is simply a sequence of transitions of Gr . A transition sequence σ = t1 . We distinguish a few cases: (i) If t = (q0 . Moreover. . Then. Now. the bracketing of r. φ. cn ]. . .
the lemma follows from the induction hypothesis. Proof. and no incoming. assume w ∈ Char(s1 )∗ . by the induction hypothesis. Hence. for a transition sequence σ. Hence. and. let strip(w) denote the word obtained from w by removing all brackets. transitions. w ∈ L(s). there is no / / accepting transition sequence σ on Gs such that brackseq(σ) = w. for all x ∈ Char(s2 ). we can restrict attention to correctly bracketed words below. with brackseq(σ) = w. we ﬁrst ﬁx a bracketing r of r and we assume that all expressions s. First. with brackseq(σ) = x. w ∈ L(s) if and only if w ∈ L(s1 ) if and only if. if ε ∈ L(s2 ). s2 are correctly bracketed subexpressions of r. assume ε ∈ L(s2 ). Hence. / Then.. Then. Let w be a correctly bracketed word. deﬁne run(σ) = strip(brackseq(σ)). i. For the induction below. notice that brackseq(σ) is a correctly bracketed word for every accepting transition sequence σ on Gr . by construction of Gs . we have / that x ∈ last(s) and hence F (qx ) = false. and the fact that ε ∈ L(s2 ). the set of accepting transition sequences of Gs is exactly the union of the accepting transition sequences of Gs1 and Gs2 . and L(s) = x. Lemma 66. Then. (2) L(r) = {run(σ)  σ is an accepting transition sequence of Gr }. Let r be a marked regular expression with counting and Gr the corresponding counting Glushkov automaton for r. By construction of Gs . As the initial states only have outgoing. also s = x. the set of all correctly bracketed words in L(s) is the union of all correctly bracketed words in L(s1 ) and L(s2 ). with ε ∈ L(s2 ). observe that Gs is constructed from Gs1 and Gs2 by identifying their initial states and taking the disjoint union otherwise. • s = s1 + s2 .100 Deterministic Regular Expressions with Counting For a bracketed word w. First. Further.e. It is easily seen that the only accepting transition sequence σ of Gs consists of one transition t. w contains only symbols of s1 . Then. • s = s1 · s2 . (1) For every string w. • s = x. s1 . We ﬁrst prove (1) by induction on the structure of r. there is an accepting transition sequence σ on Gs1 . We distinguish a few cases. we have that w is a correctly bracketed word in L(r) if and only if there exists an accepting transition sequence σ of Gr such that w = brackseq(σ). brackseq(σ) = w if and only if σ is also an accepting transition sequence . This is due to the fact that. Clearly.
π. with n ∈ [k. . . We claim that σ = t1 . . Then. by deﬁnition. . ti i . tn n . the target state of m 1 i t1 be qyi and the target state of ti i be qxi . . t2 . tn+2 . . σn on Gs1 encoding wi . with w1 ∈ Char(s1 )∗ and w2 ∈ Char(s2 )∗ . The case w ∈ Char(s2 )∗ can be handled analogously. . . . by the induction hypothesis. t1 . t3 . If w is not of the form w = w1 w2 .ℓ] . Let σ1 = t1 . n]. . y. there exist accepting transition sequences σ1 (resp. note that after following transition t. by the induction hypothesis. as they were after following tn+1 in Gs2 . φ. It now m m m 1 2 2 2 suﬃces to show that σ is an accepting transition sequence on Gs and . . . with qx the target state of tn .5. Then. . ℓ]. for i = 1. null) ∈ follow(s). t2 .ℓ and w be a correctly bracketed word. . s) ∈ follow(s). we show that there is an accepting transition sequence σ on Gs encoding w. t2 2 . σ2 ) on Gs1 (resp. . . tn . σ is an accepting transition sequence on Gs . . . This can be done using the same reasoning as above. the guard of t. Conversely. and qy the target state of tn+1 . Hence. . and wi ∈ L(s1 ). Hence. . . Further. . w ∈ L(s) if and only if wi ∈ L(si ). . As w is correctly bracketed. . n − 1]. to see that brackseqGs (σ) = brackseqGs1 (σ1 )brackseqGs2 (σ2 ) it suﬃces to observe that wordGs1 (qx )wordGs2 (tn+1 ) = wordGs (t). . For all i ∈ [1. . the fact that σ1 is an accepting transition sequence on Gs1 ensures that t can be followed in Gs . This settles the case w ∈ Char(s1 )∗ . Hence. This ensures that σ1 and σ2 can indeed be composed to the accepting transition sequence σ in this manner. t1 1 . Further. n]. w2 ). . To see that σ is an accepting transition sequence on Gs note that the guard F (qx ) (in Gs1 ) is equal to φ. . . let m ti be the unique transition generated by (xi . . and deﬁne σ = t1 . qy ) be the unique transition of Gs generated by the tuple (x. we need to show that any such transition sequence σ can be decomposed in accepting transition sequences σ1 and σ2 satisfying the same conditions. Let s = ([j s1 ]j )k. . for each i ∈ [1. . for all i ∈ [1. tn . let t = (qx . 2 if and only if. • s = s1 [k. . From RE(#) to CNFA 101 on Gs . n]. Finally. consider the case that w contains symbols from both s1 and s2 . a. assume w is of this form. For all i ∈ [1. Gs2 ) encoding w1 (resp. . y i+1 . as desired.6. then we immediately have that w ∈ L(s) nor can there / be a transition sequence σ on Gs encoding w. with brackseqGs (σ) = w. tm . t. tm is an accepting transition sequence on Gs with brackseq(σ) = brackseq(σ1 )brackseq(σ2 ). wi = ε. . all counters are reset. for some mi . . there exist valid transition sequences σ1 . . . Further. tn−1 . . let σi = ti . assume w ∈ L(s). tn and σ2 = tn+1 . we can write w = [j w1 ]j [j w2 ]j · · · [j wn ]j . First. . and hence brackseq(σ) = w.
This is both due to the fact that we do n such iterations. we have k = 0. by induction. The following lemma. and wordGs (t) = wordGs1 (t). Therefore. which we state without proof. s) ∈ follow(s). note that.e. note that we simply execute the diﬀerent transition sequences σ1 to σn one after another. and that we ﬁnally arrive in an accepting conﬁguration. It suﬃces to note that σ contains n transitions t1 to tn . Therefore.ℓ with s nullable. and the third is immediate from the deﬁnitions. wordGs (ti ) = wordGs1 (xi )]j [j wordGs1 (ti+1 ). To see that σ is an accepting transition sequence. To prove the second point. assume that there exists an accepting transition sequence σ on Gs encoding w. it suﬃces to observe that L(s) = {strip(w)  w ∈ L(s) and w correctly bracketed} = {strip(brackseq(σ))  σ is an accepting transition sequence on Gs } = {run(σ)  σ is an accepting transition sequence on Gs } Here. for all i ∈ [1. with n ∈ [k. n − 1]. i. However. the ﬁrst equality follows immediately from Lemma 67 below. it suﬃces to observe that wordGs (t1 ) = [j wordGs1 (t1 ). y. by means of the transitions t1 to tn . generated by a tuple of the form (x. This allows to decompose σ into n accepting transition sequences σ1 to σn and apply the induction hypothesis to obtain the desired result. As these transitions reset all counter variables (except cv(s)) the counter variables on each of these separate runs always have the same values as they had in the runs on Gs1 . ℓ]. as required by the normal form for RE(#). for all other transitions t 1 occurring in σ. w = [j brackseqGs1 (σ1 )]j [j brackseqGs1 (σ2 )]j · · · [j brackseqGs1 (σn )]j . L(s) = {run(σ)  σ is an accepting transition sequence on Gs }. This can be done using arguments analogous to the ones above.102 Deterministic Regular Expressions with Counting brackseq(σ) = w. is immediate from the deﬁnitions. wordGs (tn n ) m 1 1 n = wordGs1 (tmn )]j . separated by iterations over the topmost iterator s. and each iteration increments cv(s) by exactly 1. The reasons are analogous to the ones for the constructed transition sequence σ in the previous case (s = s1 · s2 ). ℓ]. . for subexpressions sk. notice that it is important in this lemma that. We must show w ∈ L(s). Conversely. it suﬃces to argue that the transitions ti can safely be followed in the run σ. To see that brackseqGs (σ) = w. for some n ∈ [k. the second equality from the ﬁrst point of this lemma.
suppose r is not strongly deterministic. .5. For the right to left direction. v. and hence there exists words u. i+1] be the smallest number such that i+1 ′ ′ tj = tj . w and symbols x. Then. and symbol a ∈ Σ such that uαav and uβaw are correctly bracketed and in L(r). which must exist as ti+1 = ti+1 . let i = strip(u) and observe that word(ti+1 ) = αx and word(t′ ) = βx. as ti+1 is a transition to qx . . tn and σ2 = t′ . . j Hence. Here. tn . . and i+1 thus as word(ti+1 ) = word(t′ ) also ti+1 = t′ . ti+1 = t′ . v. and w and x such that uαxv and uβxw are correctly bracketed and in L(r). w over Σ ∪ Γ. for any two transitions t and t′ . Then. Consequently. with t = t′ who share the same source . Let r ∈ RE(#). Thereto. both words must use the same marked symbol x for a as r is weakly deterministic. and (2) L(Gr ) = {dm(run(σ))  σ is an accepting transition sequence on Gr } by deﬁnition of Gr and the run predicate. [of Theorem 65] Let r ∈ RE(#).e. From RE(#) to CNFA Lemma 67. σ2 = t1 . but there exist words u. by Lemma 66. We are now ready to prove Theorem 65. But then. and r be the bracketing of r. we show that then Gr is not deterministic. u is either empty or ends with a marked symbol). t′ such that brackseq(σ1 ) = uαxv and brackseq(σ2 ) = n uβxw. by Lemma 66. . by Lemma 66. Gr is not deterministic. there exist words u. or it is but violates the second criterion in Deﬁnition 52. Then. Let u = dm(ux). Without loss of generality we can assume that α and β are chosen to be maximal (i. . As. Then. and Gr its corresponding counting Glushkov automaton. observe that (1) L(r) = {dm(w)  w ∈ L(r)}. r a marking of r. words α = β over Γ. and uyw ∈ L(r). there exist transition sequences σ1 = t1 . By reasoning exactly as in i+1 i+1 the previous case. it then follows that Gr is not deterministic. there exist accepting transition sequences σ1 = t1 . First. suppose r is not weakly deterministic. . we ﬁrst make two observations. . i+1 and t′ to qy . v. We ﬁrst show that L(r) = L(Gr ).6. . Let j ∈ [1. suppose r is weakly deterministic. we have that Gr can be in some conﬁguration γ and can both follow transition tj and t′ while reading the jth symbol of u. y. . Here. . We next turn to the second statement: r is strongly deterministic if and only if Gr is deterministic. 103 Proof. . r can be not strongly deterministic for two reasons: either it is not weakly deterministic. L(r) = {strip(w)  w ∈ L(r) and w is correctly bracketed}. Then. . . . L(r) = {run(σ)  σ is an accepting transition sequence on Gr }. by deﬁnition of r. t′ such that run(σ1 ) = uxv and m 1 run(σ2 ) = uyw. . Let u = i. Before proving the converse direction. we hence obtain L(r) = L(Gr ). Second. First. with qx = qy . with x = y but dm(x) = dm(y) such that uxv ∈ L(r). after reading the preﬁx of u of length j − 1.
.. deﬁning strong determinism. v. The latter is due to the upper bound tests present in Gr . c′ ) ∈ follow(r). when y = y ′ . and a. y suﬃcient to show that r is not weakly deterministic. we provide Algorithm 1 for testing whether a given expression is strongly deterministic. . . y. β words over Γ. c) ∈ follow(r) and t′ on (x. both run(σ1 ) ∈ L(r) and run(σ2 ) ∈ L(r). v. there is some string which brings γ to an accepting conﬁguration. . We now distinguish two cases. The second observation we make is that Gr is reduced. dm(w). we have that word(t) = word(t′ ). is of a semantic nature. and hence. as x = y. assume that Gr is not deterministic. But then. i.e. respectively). y. violate the condition in Deﬁnition 52 and hence show that r is not strongly deterministic. y. . t′ (for uav and uaw. Now. with α. Indeed. . and hence also not strongly deterministic. word(tu+1 ) = αx = word(t′ u+1 ) = βy. By Lemma 66. In particular. 6. and the target of t′ is qy′ .104 Deterministic Regular Expressions with Counting state qx . y. null) ∈ follow(r) and t′ on (x. we know that t is computed based on either (1) (x. Writing run(σ1 ) = uxv and run(σ2 ) = uyw (such that u = u. are both correctly bracketed words in L(r). Kilpel¨inen a and Tuhkanen [66] give a cubic algorithm for RE(#). for every reachable conﬁguration γ. it holds that α = β. Second. and brackseq(σ2 ) = uαy w. and x and y their associated marked symbols. As Gr is reduced and not deterministic there exist words u. First. with c = c′ . β. Therefore. To decide weak determinism. w and symbols x. We show that r is not strongly deterministic. v = v and w = w) we then obtain the strings u. . Further. which runs in cubic time. writing brackseq(σ1 ) = uαxv. c) ∈ follow(r) or (2) (x. α. uaw ∈ L(Gr ) witnessed by accepting transition sequences σ1 = t1 .6 Testing Strong Determinism Deﬁnition 52. by Lemma 66. w and symbol a such that uav. Note that dm(x) = dm(y) = a. for all i ∈ [1. . tn and σ2 = t′ . assume x = y. m 1 such that ti = t′ . Hence. while Br¨ggemannu Klein [15] gives a quadratic algorithm for RE by computing its Glushkov . dm(v). assume x = y. that then word(t) = αy and word(t′ ) = βy ′ . but tu+1 = t′ i u+1 . Let qx be the target of ′ transition tu+1 and qy the target of transition tu+1 . we can see that dm(u). Otherwise. We then consider brackseq(σ1 ) and brackseq(σ2 ) which. note that tu+1 and t′ u+1 share the same source state. we immediately have word(t) = word(t′ ). by the ﬁrst observation above. u]. if y = y ′ . observe that if the target state of t is qy . In both cases it is easily seen that α = β.
Recall that we write s when s is a subexpression of r and s ≺ r when s r and s = r. while linear in the output automaton (consider. Moreover.g. We next show that Algorithm 1 is correct. with dm(y) = dm(y ′ ). r ← marked version of r 2: Initialize Follow ← ∅ Compute ﬁrst(s). (x. Let r ∈ RE(#). Only when the alphabet is ﬁxed. y ∈ ﬁrst(s1 )} 12: if F ∩ Follow = ∅ then return false if s = s1 s2 or s = sk. c) ∈ follow(s). dm(y))  x ∈ last(s1 ). Proof. By Theorem 65 it suﬃces to show that the algorithm isStrongDeterministic (r) returns true if and only if Gr is deterministic. 5 .6. isStrongDeterministic (r) returns true if and only if r is strongly deterministic. r a marking of r. in bottomup fashion do 6: if s = s1 s2 then if last(s1 ) = ∅ and ∃x. y ∈ ﬁrst(r) with x = y and dm(x) = dm(y) then return false for each subexpression s of r. it runs in time O(r3 ). r Theorem 68.. For any r ∈ RE(#). Thereto. Returns true if r is strong deterministic. y. (a1 + · · · + an )(a1 + · · · + an )). last(s). This is the case if and only if there exist marked symbols x.6. y. we ﬁrst extract from Algorithm 1 the reasons for isStrongDeterministic (r) to return false. e.ℓ] else if s = s1 . y ′ ∈ ﬁrst(s) and y = y ′ (Line 4) 2. Testing Strong Determinism 105 Algorithm 1 isStrongDeterministic. is the Glushkov automaton of a deterministic expression of size linear in the expression. such that either 1.ℓ . with ℓ ≥ 2 then 10: if ∃x. false otherwise. y. y = y ′ and upper(c) ≥ 2 (Line 10) There sometimes is some confusion about this result: Computing the Glushkov automaton is quadratic in the expression. with ℓ ≥ 2 and k < ℓ then 1 14: Follow ← Follow ⊎ F return true automaton and testing whether it is deterministic5 . (x. y ′ . and Gr the corresponding Glushkov counting automaton. for all subexpressions s of r 4: if ∃x. y ′ . c) ∈ follow(s). y ∈ ﬁrst(s2 )} [k. y ∈ ﬁrst(s1 ) with x = y and dm(x) = dm(y) then return false F ← {(x. dm(y))  x ∈ last(s1 ). y ∈ ﬁrst(s1 ) with x = y and dm(x) = dm(y) then return false 8: F ← {(x.
α) with α(cv(cm )) = lower(cm ).ci−1 ] and α = upperboundtest[ci ] . i.. Hence. . y ′ . y. let iterators(x) = [c1 . and upper(c) > lower(c) (Line 12) 4. i − 1]. c ≺ c′ .106 Deterministic Regular Expressions with Counting 3.. To see that the transition generated by (x. c) ∈ follow(s).. for all m ∈ [j. We need to show that the transitions generated by (x. . cn ] and. set c = ci and c′ = cj . The reachability of this conﬁguration γ will always follow from Lemma 69. y ′ . Case 3: Set γ = (qx . y ′ . y = y ′ and lca(x. In each of the six remaining there are always two distinct tuples in follow(s) which generate distinct transitions with as source state qx and target states qy and qy′ . y) ≺ c′ and upper(c′ ) ≥ 2 (Line 12) 5. j − 1]. (x.. c′ ) ∈ follow(s). (x. y ′ ∈ ﬁrst(s) and y = y ′ . y. c) ∈ follow(s). y. Suppose case 1 holds. (x. c′ ). also α = valuetest[c1 . and hence i < j. y ′ )... Gs is not deterministic. α) from which both transitions can be followed by reading dm(y). n]. null) ∈ follow(s). y. null) ∈ follow(s). The latter because upper(ci ) ≥ 2 > α(cv(ci )). y ′ . . (x. These transitions can both be followed when the ﬁrst symbol in a string is dm(y). and upper(c) > lower(c) (Line 12) 6. upper(c) ≥ 2. lca(x. m]. y. (x. Gs can also follow the transition generated by (x. and dm(y) = dm(y ′ ). This is due to the fact that α = valuetest[c1 .ci−1 ] and α = upperboundtest[ci ] . and lca(x. y) ≺ lca(x. y ′ ) (Line 12) We now show that Gs is not deterministic if and only if one of the above seven conditions holds. y) = lca(x. (x.cj−1 ] and α = upperboundtest[cj ] as upper(c′ ) ≥ 2 > α(cv(cj )). for all m ∈ [1. y. In each of the following cases. The latter due to the fact that upper(c) > lower(c) = α(cv(ci )). for all m ∈ [i. in each of the cases. (x. but its precise form depends on the particular case we are in. null) ∈ follow(s). y. c) can be followed. and α(cv(cm )) = 1. . (x. upper(c′ ) ≥ 2. c′ ) ∈ follow(s). it suﬃces. Hence. we always have c ≺ c′ . α) with α(cv(cm )) = lower(cm ). with qy = qy′ . there is a transition from q0 to qy and one from q0 to qy′ .... y. null) ∈ follow(s)... and upper(c) ≥ 2. c) can both be followed from γ. for all m ∈ [1. y ′ ) (Line 7) 7. when applicable.. y ′ . (x. to construct a reachable conﬁguration γ = (qx .. y ′ . . Therefore. We ﬁrst verify the right to left direction by investigating the diﬀerent cases. (x. null) ∈ follow(s). Case 2: Set γ = (qx . y ′ . c) and (x. null) ∈ follow(s). When both c and c′ occur.e. Then. On the other hand. and α(cv(cm )) = 1. c ≺ lca(x. note that again α = valuetest[c1 .
and using the facts that (1) upper(c′ ) ≥ 2 and (2) iterators(x. for some i′ < j (as lca(x. y) = lca(x. y. y ′ . c) and t2 by (x. We argue that in this case the conditions for one of the seven cases above must be fulﬁlled. if γ = (q0 . 4. y) = c. We next turn to the left to right direction. lca(x. we can see that the conditions in case 1 hold. null) and t2 by (x. Thereto. y) and lca(x. n]. c′ ) with lca(x. However. ci′ ]. y ′ ). There are indeed six possibilities. Arguing as before it can be seen that in each case both transitions can be followed from this conﬁguration. t1 is generated by (x. α). and that lca(x. . null) and c ≺ lca(x. y ′ . y) ≺ c′ . null) with lca(x. α) then t1 and t2 must be the initial transitions of a run. t1 is generated by (x. y) ≺ c′ ). t1 is generated by (x. Testing Strong Determinism 107 Case 4: Set γ = (qx . As. and c′ ) contain x. null) with lca(x. 3. y ′ . Cases 5. y ′ . First. hence forcing isStrongDeterministic (s) to return false. 6. lca(x. y. j − 1]. α) with α(cv(cm )) = lower(cm ). y ′ ) are always subexpressions whose topmost operator is a concatenation. c) and t2 by (x. 6. these six cases only give us the diﬀerent possibilities of how the transitions t1 and t2 can be generated by tuples in follow(s). Therefore. 7. ignoring symmetries. . We will investigate the tuples in follow(s) which generated t1 and t2 and show that they fall into one of the six remaining cases mentioned above. for instance. and α(cv(cm )) = 1. α) with α(cv(cm )) = lower(cm ). y. 5. y ′ ). t1 is generated by (x. c′ ) with c ≺ c′ . we can deduce that Gs can again follow both transitions. assume γ = (qx . c) and t2 by (x. y. We still need to argue that the additional conditions imposed by the cases mentioned above apply. null) and t2 by (x. y ′ . y). and 7: In each of these cases we can set γ = (qx . t2 can be followed by reading a symbol a. y) = [c1 . Note that all subexpressions under consideration (i. These are the reasons why all these expressions are in a subexpression relation. y. we ﬁrst note that for all iterators c and c′ under consideration.6.6. t1 is generated by (x. as there are no transitions returning to q0 . when x ∈ ﬁrst(s). for all m ∈ [j. . . n]. y) ≺ lca(x. for all m ∈ [1.e. y. with x ∈ Char(r). y ′ . furthermore. Arguing as above. y ′ ). Assume Gs is not deterministic and let γ be a reachable conﬁguration such that two distinct transitions t1 . t1 is generated by (x. q0 has exactly one transition to each state qx . y). lca(x. . for all m ∈ [1. and why we never have. which we immediately classify to the case they will belong to: 2. c. null) and t2 by (x. c).
Further. Thereto. Let x ∈ Char(r). we only provide some intuition by constructing. let w0 be a string such that there exists a u such that w0 xu ∈ L(base(c1 )). the guard of transition t1 contains the condition cv(c) < upper(c) as upperbound test. the for loop iterates over a linear number of nodes in the parse tree of r. for any i ∈ [1. and α(cv(c)) = 1. which is immediate from the fact that t1 = t2 . n]. .t. length) string such that wi ∈ (Char(ci+1 ) \ Char(ci ))∗ and there exist u ∈ Char(ci )∗ and v ∈ Char(ci+1 )∗ such that wn uxv ∈ L(ci+1 ). Then. connects the diﬀerent iterators. be a conﬁguration such that α(cv(ci )) ∈ [1. n − 1]. Computing the ﬁrst and last sets for each subexpression s of r can easily be done in time O(r3 ) as can the test on Line 4. a string w which brings Gs from its initial conﬁguration to γ. such a transition t can never be followed and is not relevant. It only remains to prove the following lemma: Lemma 69. for any transition t generated by (x. upper(ci )]. . implies that t1 and t2 are generated by the same tuple in follow(s) and are hence equal. whereas the guard of transition t2 contains the condition cv(c) ≥ lower(c). Altogether this yields a quadratic algorithm. c) the guard φ contains the condition upperboundtestc := cv(c) < 1. However. These can only simultaneously be true when upper(c) > lower(c). In both cases. n]. This settles the correctness of the algorithm. which can never be true. assuming y = y ′ . we argue on a case by case basis. for every i ∈ [1. for every i ∈ [1. . This already shows that possibilities 4 and 7 above. For the additional cases.r. Proof. Then. intuitively. This lemma can easily be proved by induction on the structure of r. Concatenating such a string vi with itself allows to iterate ci . for i ∈ [1. To do each iteration of the loop in quadratic time. We further deﬁne. n].t.r. Thereto. Indeed. Similarly. We require these strings . Let γ = (qx . length) string such that wn ∈ (Char(s)\Char(cn ))∗ and such that there exist u ∈ Char(cn )∗ and v ∈ Char(s)∗ such that wn uxv ∈ L(s). In each iteration we then need to do at most a quadratic number of (constant time) lookups and writes to the table. y. α). suppose for instance upper(c) = 1. Cases 3 and 5: We need to show upper(c) > lower(c). . one needs to implement the set Follow as a twodimensional boolean table. a marked string wi which. We conclude by arguing that the algorithm runs in time O(r3 ). We construct w by concatenating several substrings. Indeed. respectively. as this is a bit tedious. given such a conﬁguration γ. for all other countervariables c. γ is reachable in Gr . Finally. cn ]. let vi be a nonempty string in L(base(ci )). Cases 2 and 6: We additionally need to show y = y ′ . and iterators(x) = [c1 . let wn be a minimal (w. let wi be a minimal (w. Hence. indeed imply cases 4 and 7.108 Deterministic Regular Expressions with Counting upper(c) ≥ 2 and upper(c′ ) ≥ 2 must surely hold.
intersection for RE(#) is pspacecomplete (Chapter 5) 2.7 Decision Problems for RE(#) We recall the following complexity results: Theorem 70. 1. By using the results of the previous sections we can. A such 2 that A′ accepts the complement of A2 . For the upper bound this is trivial. for DRE# . W By combining (1) and (2) of Theorem 70 we get the complexity of inter# section for DREW and DRE# . (1) We show that inclusion for DRE# is in pspace. give a pspace upperbound for both problems. inclusion and equivalence for RE(#) are expspacecomplete [83].7. which by Theorem 65 are indeed deterministic. (2) The result for DRE# is immediate from Theorem 70(1) and (2). which can be decided using only polynomial space by Theorem 64(1). Finally. intersection for DREW is pspacecomplete [75]. (1) equivalence and inclusion for DRE# are in pspace. whereas S the lower bound follows from the fact that standard weakly deterministic regular expressions can be transformed in linear time into equivalent strongly deterministic expressions [15]. unfortunately.6. we construct the CDFAs A′ . inclusion and equivalence for DREW are in ptime. inclusion for DRE# is conphard [64]. 3. 2 2 This can all be done in polynomial time and preserves determinism according to Theorem 64. Theorem 71. S however.5. Then. Given two S strongly deterministic RE(#) expressions r1 . L(r1 ) ⊆ L(r2 ) if and only if L(A) = ∅. A2 for r1 and r2 using the construction of section 6. S W Proof. Decision Problems for RE(#) 109 w1 to wn to be minimal to be sure that they do not allow to iterate over their corresponding counter. . the desired string w is dm(wn )(vn )α(cv(cn )) dm(wn−1 )(vn−1 )α(cv(cn−1 )) · · · dm(w0 )dm(x) . we construct two CDFAs A1 . Then. This W result also carries over to DRE# . This is not the case for the inclusion and S equivalence problem. 6. S (2) intersection for DRE# and DRE# is pspacecomplete. and A is the intersection of A1 and A′ . r2 .
although expspace also remains possible. • Can the Glushkov construction given in Section 6.5 be extended such that it translates any weakly deterministic expression with counting into a CDFA? Kilpelainen and Tuhkanen [65] have given a Glushkovlike construction translating weakly deterministic expressions into a deterministic automata model. Weakly deterministic expressions have the advantage of being more expressive and more succinct than strongly deterministic ones. strongly deterministic expressions are conceptually simpler (as strong determinism does not depend on intricate interplays of the counter values) and correspond naturally to deterministic Glushkov automata. whereas for weak determinism. by our results. conp and pspace are the most likely. a class of languages much better understood than the weakly deterministic languages with counting. strongly deterministic expressions are expressivily equivalent to standard deterministic expressions. However. However. due to the lack of upperbounds on the counters. also carries over to strongly deterministic expressions with counting.8 Discussion We investigated and compared the notions of strong and weak determinism for regular expressions in the presence of counting. this can still be anything between ptime and pspace. this automaton model is mostly designed for doing membership testing and complexity bounds can not directly be derived from it. .110 Deterministic Regular Expressions with Counting 6. one might wonder if the weak determinism demanded in the current standards for XML Schema should not be replaced by strong determinism. For these reasons. • What are the exact complexity bounds for inclusion and equivalence of strongly and weakly deterministic expressions with counting? For strong determinism. The answer to some of the following open questions can shed more light on this issue: • Is it decidable if a language is deﬁnable by a weakly deterministic expression with counting? For standard (weakly) deterministic expressions. Moreover. The latter also makes strongly deterministic expressions easier to handle as witnessed by the pspace upperbound for inclusion and equivalence. whereas for weakly deterministic expressions only a trivial expspace upperbound is known. a decision algorithm is given by BruggemannKlein and Wood [17] which.
Part II Applications to XML Schema Languages 111 .
.
[77] subsequently characterized the expressiveness of singletype EDTDs in several syntactic and semantic ways. XML Schema [100] extends the expressiveness of DTDs by a typing mechanism allowing contentmodels to depend on the type rather than only on the label of the parent. they deﬁned an extension of DTDs equivalent in expressiveness to singletype EDTDs: ancestorguarded DTDs. [85] therefore abstracted XSDs by singletype EDTDs. Unrestricted application of such typing leads to the robust class of unranked regular tree languages [16] as embodied in the XML schema language Relax NG [22]. Martens et al. an XML schema deﬁnes a tree language. Murata et al. however. An advantage of this language is that it makes the expressiveness of XSDs more apparent: the content model of an element can only depend on regular string properties of the string formed by the ancestors of that element. restricts this typing: it forbids the occurrence of diﬀerent types of the same element in the same content model. The for historical reasons still widespread Document Type Deﬁnitions (DTDs) can then be seen as contextfree grammars with regular expressions at righthand sides which deﬁne the local tree languages [16]. The Element Declarations Consistent constraint in the XML Schema speciﬁcation. Ancestorbased DTDs can therefore be used as a typefree frontend for XML Schema. As they can be interpreted both .7 Succinctness of PatternBased Schema Languages In formal language theoretic terms. Among them. The latter language is commonly abstracted in the literature by extended DTDs (EDTDs) [88].
2. They considered regular (Reg). s). EDTDs. we study in this chapter the complexity of translating between the two semantics and into the formalisms of DTDs. there exist examples matching this upper bound. Furthermore. except the ones marked with a star. For all nonpolynomial complexities. Our results are summarized in Table 7. linear (Lin).114 Succinctness of PatternBased Schema Languages other semantics 2exp (83(1)) 2exp (83(6)) \ (85(1)) \ (85(6)) poly (89(1)) poly (89(7)) poly (89(1)) poly (89(7)) EDTD exp (83(2)) 2exp (83(7)) exp (85(2)) 2exp (85(7)) poly (89(2)) poly (89(8)) poly (89(2)) poly (89(8)) EDTDst exp (83(3)) 2exp (83(8)) exp (85(3)) 2exp (85(8)) poly (89(3)) poly (89(9)) poly (89(3)) poly (89(9)) DTD exp* (83(5)) 2exp (83(10)) exp* (85(5)) 2exp (85(10)) poly (89(6)) poly (89(12)) poly (89(6)) poly (89(12)) P∃ (Reg) P∀ (Reg) P∃ (Lin) P∀ (Lin) P∃ (SLin) P∀ (SLin) P∃ (DetSLin) P∀ (DetSLin) Table 7. Deterministic strongly linear (DetSLin) patterns are strongly linear patterns in which additionally all horizontal expressions s are required to be deterministic [17]. Theorem numbers are given between brackets. where r and s are regular expressions. These correspond to the regular expressions. and more DTDlike. rather than ancestorbased DTD. and XPath expressions of the form //w or /w. We. In the remainder of the chapter. however. fully specifying every allowed combination. Kasneci and Schwentick [62] studied the complexity of the satisﬁability and inclusion problem for patternbased schemas under the existential (∃) and universal (∀) semantics. A patternbased schema is a set of rules of the form (r. A snapshot of their results is given in the third and fourth column of Table 7. XPath expressions with only child (/) and descendant (//). These results indicate that there is no diﬀerence between the existential and universal semantics. in an existential and universal way. enforcing constraints only where necessary. An XML tree is then existentially valid with respect to a rule set if for each node there is a rule such that the path from the root to that node matches r and the child sequence matches s. whereas the universal semantics is more liberal. Both for the pattern languages . we use the name patternbased schema. show that with respect to succinctness there is a huge diﬀerence. respectively. as it emphasizes the dependence on a particular pattern language.1. and singletype EDTDs.1: Overview of complexity results for translating patternbased schemas into other schema formalisms. and strongly linear (SLin) patterns. horizontally matches s. The existential semantics is exhaustive. it is universally valid if each node vertically matching r.
there are translations only requiring polynomial size increase. as listed in Table 7.1. is it equivalent to a DTD? Outline.1. All results. and strongly linear expressions. For linear patterns the expressiveness is incomparable. Theorem numbers for the new results are given between brackets. the practical study in [77] shows that the sort of typing used in XSDs occurring in practice can be described by such patterns. are completeness results. we introduce patternbased schemas. our results show that the general class of patternbased schemas is illsuited to serve as a frontend for XML Schema due to the inherent exponential or double exponential size increase after translation. an alternative characterization of singletype EDTDs.1 Preliminaries and Basic Results In this section. Finally. 7. Furthermore. In Section 7. and introduce some additional terminology for describing succinctness results. .2.2. Preliminaries and Basic Results simplification exptime (83(4)) exptime (83(9)) pspace (85(4)) pspace (85(9)) pspace (89(4)) pspace (89(10)) in ptime (89(5)) in ptime (89(11)) satisfiability exptime [62] exptime [62] pspace [62] pspace [62] pspace [62] pspace [62] in ptime [62] in ptime [62] inclusion exptime [62] exptime [62] pspace [62] pspace [62] pspace [62] pspace [62] in ptime [62] in ptime [62] 115 P∃ (Reg) P∀ (Reg) P∃ (Lin) P∀ (Lin) P∃ (SLin) P∀ (SLin) P∃ (DetSLin) P∀ (DetSLin) Table 7. Fortunately. unless indicated otherwise. the universal semantics is exponentially more succinct than the existential one when translating into (singletype) extended DTDs and ordinary DTDs. In Section 7. respectively.7. we study the complexity of the simpliﬁcation problem: given a patternbased schema. Reg and Lin. recall some known theorems. Only when resorting to SLin patterns.4. 7. we study patternbased schemas with regular.2: Overview of complexity results for patternbased schemas. Our results further show that the expressive power of the existential and the universal semantics coincide for Reg and SLin. we introduce some additional deﬁnitions concerning schema languages.3. and patternbased schemas. albeit a translation can not avoid a double exponential size increase in general in the former case. linear. and 7.
respectively. Denote by P∃ (t) = {v ∈ Dom(t)  ∃(r. (rm . s) ∈ P.116 Succinctness of PatternBased Schema Languages 7. we make use of unary trees. 3. s). . s1 ). Each pair (ri . if w ∈ T∃ (P ) then for every preﬁx w′ of w and every nonleaf node v of w′ . Deﬁnition 73. v ∈ P∃ (w′ ).1. s) ∈ P it holds that ancstr(v) ∈ L(r) implies chstr(v) ∈ L(s). Denote by P∀ (t) = {v ∈ Dom(t)  ∀(r. When for every string w ∈ Σ∗ there is a rule (r. a tree t and a string w. In this case. ancstr(v) ∈ L(r) ⇒ chstr(v) ∈ L(s)} the set of nodes in t that are universally valid. . si are regular expressions. si ) of a patternbased schema represents a schema rule. A tree t is universally valid with respect to a patternbased schema P if. . v ∈ P∀ (t). s) ∈ P such that ancstr(v) ∈ L(r) and chstr(v) ∈ L(s). and each rule (r. A patternbased schema P is a set {(r1 . There are two semantics for patternbased schemas. Deﬁnition 72. t ∈ T∃ (P ) if and only if for every node v of t. We also refer to the ri and si as the vertical and horizontal regular expressions.1 PatternBased Schema Languages We recycle the following deﬁnitions from [62]. Similarly. sm )} where all ri . L(r) ∩ L(r′ ) = ∅. s′ ) ∈ P of diﬀerent rules. In this context. A tree t is existentially valid with respect to a patternbased schema P if. ancstr(v) ∈ L(r)∧chstr(v) ∈ L(s)} the set of nodes in t that are existentially valid. for every node v of t. Further. v ∈ P∃ (t). there is a rule (r. which can be represented as strings. then we say that P is complete. we refer to the last position of w as the leaf of w. For a patternbased schema P . if w ∈ T∀ (P ) then for every preﬁx w′ of w and every nonleaf node v of w′ . 1. In some proofs. we write P =∃ t. 2. . when for every pair (r. We denote the set of Σtrees which are existentially and universally valid with respect to P by T∃Σ (P ) and T∀Σ (P ). s) ∈ P. In this case. respectively. v ∈ P∀ (w′ ). Deﬁnition 74. we write P =∀ t. we abuse notation and write for instance w ∈ T∃ (P ) meaning that the unary tree which w represents is existentially valid with respect to P . for every node v of t. (r′ . s) ∈ P such that w ∈ L(r). We conclude this section with two basic results on patternbased schemas. t ∈ T∀ (P ) if and only if for every node v of t. Lemma 75. We often omit Σ if it is clear from the context what the alphabet is. . then we say that P is disjoint. 4.
Proof. λ). δ). and vice versa. Whenever we translate among patternbased schemas. s) such that ancstr(v) ∈ L(r) and thus chstr(v) ∈ L(s).2 Characterizations of (Extended) DTDs We start by introducing another schema formalism equivalent to singletype EDTDs. We show that if P is complete and disjoint. q ′ ) ∈ δ. where q ∈ Q is the state such that q0 ⇒A. we drop this assumption. we add a sink state q to Q and for every pair q ∈ Q. Proof. .3) These are in fact just a restatement of the deﬁnition of universal and existential satisfaction and are therefore trivially true. a ∈ Σ. (4) The proof of (2) carries over literally for the existential semantics. v ∈ P∀ (w) and thus v ′ ∈ P∀ (w). Since w′ is a preﬁx of w. suppose v ∈ P∀ (t). v ∈ P∃ (t) if and only if v ∈ P∀ (t). Lemma 76.7. with A = (Q. Then. q0 .1. First. An automatonbased schema D over vocabulary Σ is a tuple (A. we implicitly assume that this is also the case for automatonbased schemas and the patternbased schemas deﬁned next. s) ∈ P such that ancstr(v) ∈ L(r) and chstr(v) ∈ L(s). 7. q0 . δ). T∃ (P ) = T∀ (P ). Lemma 78. there ′ must be a node v of w such that ancstrw (v) = ancstrw (v ′ ) and chstrw (v) = ′ chstrw (v ′ ). Any automatonbased schema D can be translated into an equivalent singletype EDTD D′ in time at most quadratic in the size of D. for which there is no transition (q. q0 . F ) is a DFA and λ is a function mapping states of A to regular expressions. Obviously. be an automatonbased schema. (2) Consider any nonleaf node v ′ of w′ . Let D = (A. δ. there is a rule (r. That is. It follows that v ∈ P∃ (t). We start by making A complete. where A = (Q. (1. suppose v ∈ P∃ (t). Automatonbased schemas are implicit in [77] and introduced explicitly in [76]. chstr(v) ∈ L(λ(q)). then for any node v of any tree t.1. ancstr(v) ∈ L(r′ ) for any other vertical expression r′ in P . Because DTDs and EDTDs only deﬁne tree languages in which every tree has the same root element. λ). By Lemma 75(1). and by the disjointness of P . For any complete and disjoint patternbased schema P . we often omit F and represent A as a triple (Q. Conversely.ancstr(v) q. a. A tree t is accepted by D if for every node v of t. The lemma then follows from Lemma 75(1) and (3). By the completeness of P there is at least one rule (r. this does not inﬂuence any of the results of this chapter. It thus follows that / v ∈ P∀ (t). Because the set of ﬁnal states F of A is not used. Remark 77. Preliminaries and Basic Results 117 Proof.
λ) with A = (Q. Σ′ . Construct D′ = (Σ. we see that the number of types in D′ can never be exceeded by the number of transitions in A.1. there exists a tree t ∈ L(d) and a node u ∈ Dom(t) such that labt (u) = ai . For every EDTD D. qj ) ∈ δ. A is guaranteed to be deterministic. . Conversely. . s. qn } for some n ∈ N. λ(ai ) = µ(d(ai )). µ) be a singletype EDTD. For the time complexity of the algorithm. Note that since D is a singletype EDTD. 2. λ(q ) = ∅. For any type ai ∈ Σ′ and any string w ∈ L(d(ai )) there exists a tree t ∈ L(d) which contains a node v with labt (v) = ai and chstrt (v) = w. bj ∈ Σ′ . sd . with L(D) = L(D′ ). A regular tree language is deﬁnable by a DTD if and only if it is closed under labelguarded subtree exchange. si . b. aj is guaranteed to exist and since A is a DFA aj is uniquely deﬁned. q0 = s. for any type ai ∈ Σ′ . and for ai . Lemma 80 ([77]). Σ′ . bj ) ∈ δ if µ(bj ) = b and bj occurs in d(ai ). let D = (Σ. Let Q = Σ′ . qi ) ∈ δ. A regular tree language T is closed under labelguarded subtree exchange if it has the following property: if two trees t1 and t2 are in T . d. Then. This notion is graphically illustrated in Figure 7. d. Let Q ∪ {q } = {q0 . a. can be constructed in time polynomial in the size of D. Since A is complete. to every type one regular expression from D′ is assigned which yields a quadratic algorithm. . Σ′ . and there are two nodes v1 in t1 and v2 in t2 with the same label. a trimmed EDTD D′ . a. . Finally. An EDTD D = (Σ. Lemma 79 ([88]). Let si be such that s is the root symbol of any tree deﬁned by D and (q0 .118 Succinctness of PatternBased Schema Languages t1 v1 ∈ T v2 t2 ∈T ⇒ t1 v2 ∈ T Figure 7. then t1 [v1 ← subtreet2 (v2 )] is also in T . 1. (ai . Let D be a trimmed EDTD.1: Closure under labelguarded subtree exchange we add (q. Finally. µ) as follows. Further. q0 . d(ai ) = λ(qi ). where any symbol a ∈ Σ is replaced by aj when (qi . s. q ) to δ. d. then Σ′ = {ai  a ∈ Σ ∧ qi ∈ Q} and µ(ai ) = a. The equivalent automatonbased schema D = (A. µ) is trimmed if for for every ai ∈ Σ′ . δ) is constructed as follows. .
replace every symbol a ∈ ∆ / in r by ∅. can be constructed in time linear in the size of r. . Similarly.7. . (2) The algorithm proceeds in two steps. such that L(r− ) = L(r) ∩ ∆∗ . Let r1 . Let L ⊆ Σ∗ be a regular language and suppose there exists a set of pairs M = {(xi . 2. . . . and ∅ + s = s + ∅ = s. sm be regular expressions. which can be done in polynomial time according to Theorem 2(4). construct an NFA Ai . use the following rewrite rules on subexpressions of r as often as possible: ∅∗ = ε. which can be done in time polynomial in the size of B. an expression r− . This gives us r− which is equal to ∅ or does not contain ∅ at all. Theorem 82.3 Theorems We use the following theorem of Glaister and Shallit [45]. Finally. A regular expression r. this can be done in time O(2k·n ). with L(r) = i≤n L(ri )\ i≤m L(si ). the DFA C accepts L(A) ∩ L(B ′ ) and can again be obtained by a product construction on A and B ′ which requires polynomial time in the sizes of A and B ′ . for every i ≤ n. Then. Proof. with L(r) = i≤n L(ri ) \ i≤m L(si ) can be obtained from C in time exponential in the size of C (Theorem 2(1)) and thus yields a double exponential algorithm in total.1. sj . . ∅s = s∅ = ∅. C is of exponential size in function of the input. s1 . j ≤ n and i = j. . For k the size of the largest NFA. compute the DFA B ′ for the complement of B by making B complete and exchanging ﬁnal and nonﬁnal states in B. For every i ≤ m. / Then any NFA accepting L has at least n states. . Preliminaries and Basic Results 119 7. i ≤ n. Then. construct an NFA Bi . with L(r− ) = L(r) ∩ ∆∗ . First. (1) First. Then. wi )  1 ≤ i ≤ n} such that • xi wi ∈ L for 1 ≤ i ≤ n. r. B can also be computed in time exponential in the size of the input. such that L(ri ) = L(Ai ). Theorem 81 ([45]). with L(si ) = Bi . 1. Then. j ≤ m. let A be the DFA accepting i≤n L(Ai ) obtained from the Ai by determinization followed by a product construction. For any regular expressions r and alphabet ∆ ⊆ Σ. We make use of the following results on transformations of regular expressions. can be constructed in time double exponential in the sum of the sizes of all ri . Therefore. and let Bi be the DFA accepting i≤m L(Bi ) again obtained from the Bi by means of determinization and a product construction.1. . and • xi wj ∈ L for 1 ≤ i. rn .
4 Succinctness As the focus in this chapter is on succinctness results we introduce some additional notation to allow for a succinct notation of such results. 4. a monotonically increasing function g : N → N and an inﬁnite family of schemas sn ∈ S with sn  ≤ g(n) such that the smallest s′ ∈ S ′ with L(s) = L(s′ ) is at least of size f (g(n)). . F F We write S ⇒ S ′ if S → S ′ and there is an f ∈ F . hold for those elements in S which do have an equivalent element in S ′ . simplification for P∃ (Reg) is exptimecomplete. 3. we study the full class of patternbased schemas which we denote by P∃ (Reg) and P∀ (Reg). exp 2exp 2. Essentially all these double exponential lower bounds are due to the fact that in these translations one necessarily has to apply operations. such as intersection and complement. nk k. In this case we also write S → S ′ and S ⇒ S ′ whenever F F and k. Theorem 83. whereas the translation from a schema under existential semantics to an EDTD is “only” exponential. Further. 7. as shown in Chapter 3. on regular expressions.c c2 s′ ∈ S ′ . respectively. exp and 2exp we denote the classes of functions k. In the translation from a patternbased schema under existential semantics to an EDTD such operations are not necessary which allows for an easier translation. and the translation from a patternbased schema under universal semantics to an EDTD are double exponential. exp 1.2 Regular PatternBased Schemas In this section.c c22 . P∃ (Reg) ⇒ EDTDst . This also implies that s′  ≤ f (s). L(s′ ) = L(s).c cnk . For a class S and S ′ of representations of schema languages. which. P∃ (Reg) ⇒ P∀ (Reg). we write S → S ′ if there is an f ∈ F such that for every s ∈ S there is an s′ ∈ S ′ with L(s) = L(s′ ) which can be constructed in time f (s). The results are shown in Theorem 83.1. Notice that the translations among schemas with diﬀerent semantics. and F a class F of functions from N to N. yields double exponential lower bounds.120 Succinctness of PatternBased Schema Languages 7. respectively. By poly. By L(s) we mean the set of trees deﬁned by s. P∃ (Reg) ⇒ EDTD. we write S → S ′ if there exists an s ∈ S such that for every F F nk S → S ′ and S ⇒ S ′ .
10. rC deﬁnes any word w which is deﬁned by all vertical expressions contained in C but is not deﬁned by any vertical expression not contained in C. . P∀ (Reg) ⇒ P∃ (Reg). . . We show that we can construct a complete and disjoint patternbased schema P ′ such that T∃ (P ) = T∃ (P ′ ) in time double exponential in the size of P . ′ and thus v ∈ P∃ (t). n} be the unique set for which ancstr(v) ∈ L(rC ) and chstr(v) ∈ L(sC ). Let P = {(r1 . ancstr(v) ∈ L(rC ) and chstr(v) ∈ L(sC ). By Theorem 82(1). n}. such an i must exist. Since v ∈ P∃ (t).7. P ∃ ∃ it suﬃces to prove that for any node v of any tree t. ancstr(v) ∈ L(ri ) and chstr(v) ∈ L(si ). 8. By Lemma 75(3). . 9. the expressions rC can be constructed in time double exponential in the size of the ri and si . Proof. (rn . . ∅)} ∪ {(rC . by deﬁnition of rC and sC . . C = ∅ and there is an i ∈ C with chstr(v) ∈ L(si ). .2. v ∈ P∃ (t) if and only if ′ v ∈ P∃ (t): ′ • v ∈ P∃ (t) ⇒ v ∈ P∃ (t): Let C = {i  ancstr(v) ∈ L(ri )}. simplification for P∀ (Reg) is exptimecomplete. 7. 2exp 2exp 2exp 2exp 2exp We conclude by showing that P ′ can be constructed from P in time double exponential in the size of P . That is. ′ • v ∈ P∃ (t) ⇒ v ∈ P∃ (t): Let C ⊆ {1. T∃ (P ′ ) = T∀ (P ′ ) and thus T∃ (P ) = T∀ (P ′ ). For any nonempty set C ⊆ {1. Here. . . sn )}. P = {(r∅ . . from which it follows that v ∈ P∃ (t). and choose some i ∈ C for which chstr(v) ∈ L(si ). We show that T (P ) = T (P ′ ). . By Lemma 76. . Then. . By deﬁnition of sC . P∀ (Reg) ⇒ DTD. . ′ is disjoint and complete. n} ∧ C = ∅}. sC )  C ⊆ {1. The expressions sC can easily be constructed in linear time by taking the disjunction . P∃ (Reg) → DTD. .i∈C L(ri ) and by r∅ the expres/ sion deﬁning Σ∗ \ 1≤i≤n L(ri ). (1) We ﬁrst show P∃ (Reg) → P∀ (Reg). 6. P∀ (Reg) ⇒ EDTD. and. . Regular PatternBased Schemas exp 121 5. P∀ (Reg) ⇒ EDTDst . s1 ). denote by rC the regular expression which deﬁnes the language i∈C L(ri )\ 1≤i≤n. Denote by sC the expression deﬁning the language ′ i∈C L(si ). But then. Then.
the complement of L(sn ). First. there exists a string u such that wu ∈ / L(rn ). v ∈ P∀ (w). Let n ∈ N. / / w ∈ T∃ (Pn ) = T∀ (P ). So. let n ∈ N and let rn be a regular expression over Σ satisfying the conditions of Lemma 84. For every n ∈ N. By Lemma 84. First. Furthermore. Finally. any rule (rC . thereby proving that the size of P must be at least double exponential in n. Σ)}. there exists a string u such that wu ∈ L(rn ). Let Σa = Σ ⊎ {a}. Lemma 84. ε). s) ∈ P ∧ ε ∈ L(s)} as the set of vertical regular expressions in P whose / corresponding horizontal regular expression does not contain the empty string. and thus wu ∈ T∃ (Pn ) by deﬁnition of Pn and so wu ∈ T∀ (P ). Here. We now show that L(r) = Σ∗ \ L(rn ). As w is not deﬁned by any expression in U . for every nonleaf node v of w. there exists a regular expression sn of size linear in n over an alphabet Σ such that any regular expression deﬁning Σ∗ \ L(sn ) must be of size at least double exponential in n. w ∈ T∃ (P ) = T∀ (P ). / wa ∈ L(rn ). and we must construct an exponential number of these rules. the expression s which deﬁnes L(r) ∩ Σ∗ deﬁnes exactly L(sn ) \ Σ∗ . since . Further. we slightly extend Theorem 15. deﬁne Pn = {(rn . for some r′ ∈ U . We show that any expression r deﬁning the complement of rn must be of size at least double exponential in n. Deﬁne rn = sn + Σ∗ a as all strings which are deﬁned by sn or have a as last a symbol. s′ ) ∈ P with w ∈ L(r′ ) it holds that ε ∈ L(s′ ). It follows that r must also be of size at least double exponential in n. Then. s can be computed from r in time linear in the size of r. by Lemma 75(1). there is a regular expressions rn of size linear in n such that any regular expression r deﬁning Σ∗ \ L(rn ) is of size at least double exponential in r. T∃ (Pn ) deﬁnes all unary trees w for which w ∈ L(rn ). w ∈ T∀ (P ) which leads to the desired contradiction. which yields and algorithm of double exponential time complexity. v ∈ P∀ (w). Then. for any rule (r′ . / Proof. Now. This complement consists of all strings which do not have a as last symbol and are not deﬁned by sn . But. and again towards a contradiction suppose w ∈ L(rn ). sC ) requires at most double exponential time to construct. and thus for the leaf node v of w. suppose w ∈ L(r′ ). 2exp To show that P∃ (Reg) ⇒ P∀ (Reg). s must be of size at least double exponential in n and by Theorem 82(2). But then. let w ∈ L(rn ) and towards a contradiction suppose w ∈ L(r). by Theorem 15. Deﬁne U = {r  (r. By Theorem 15. let r be the disjunction of all expressions in U . (Σ∗ . Let P be a patternbased schema with T∃ (Pn ) = T∀ (P ). So. note that rn satisﬁes the extra condition: for every w ∈ L(rn ). rn has the property that for any string w ∈ L(rn ). Conversely. By Lemma 75(2). Then.122 Succinctness of PatternBased Schema Languages of the right expressions.
δ) is the product automaton for A1 . Finally. which can be done in exponential time (Proposition 3. al. Furthermore. which is exptimecomplete [62]. w ∈ T∀ (P ) by / Lemma 75(1). . exp exp (4) For the upper bound. .qi ∈Fi L(si ) is simply the disjunction of these si and can be constructed in linear time. an NTA(NFA) is a nondeterministic tree automaton where the transition relation is represented by an NFA. . s′ ) in P it holds that ε ∈ L(s′ ). Recall also that DTD(NFA) is a DTD where content models are deﬁned by NFAs. we ﬁrst construct an NTA(NFA) AP with L(AP ) = T∃ (P ). construct for every ri a DFA Ai = (Qi . DP is of size exponential in P . a DTD(NFA) DP such that L(AP ) ⊆ L(DP ) is always true and L(AP ) = L(DP ) holds if and only if L(AP ) is deﬁnable by a DTD. and by deﬁnition of U for the rule (r′ . in time polynomial in the size of AP . . Let P be a patternbased schema . if m is the size of the largest vertical expression in P . Given a patternbased schema P . Then. exp Further. A = (Q1 × · · · × Qn . We construct an automatonbased schema D = (A.qi ∈Fi L(si ). which implies P∃ (Reg) → EDTD. . (23) We ﬁrst show P∃ (Reg) → EDTDst . This gives us an exptime algorithm overall. Therefore. . Thereto. An . qn ). Fi ). the total construction can be carried out in exponential time. λ) such that L(D) = T∃ (P ). . (q1 . δi . In the following. .3 in [62]). This can again be done in exponential time (Proposition 3. For the lower bound. . DP and A¬P are of size at most exponential in the size of P . . Here. let P = {(r1 . Therefore. then A is of size O(2m·n ). The latter then implies exp P∃ (Reg) ⇒ EDTDst . s1 ). Here. . T∃ (P ) ⊆ L(DP ) and T∃ (P ) is deﬁnable by a DTD if and only if T∃ (P ) = L(DP ). . / It follows that the leaf node v of w is not in P∀ (w). Summarizing. . construct another NTA(NFA) A¬P which deﬁnes the complement of T∃ (P ). λ((q1 . al [77]. T∃ (P ) = L(DP ) if and only if L(DP ) ∩ L(A¬P ) = ∅. By Lemma 78. which again gives us the desired contradiction. . and λ((q1 . Now.2. [77] have shown that given any NTA(NFA) AP it is possible to construct. which is shown in Theorem 85(2). we combine a number of results of Kasneci and Schwentick [62] and Martens et. . D can then be translated into an equivalent singletype EDTD in polynomial time and the theorem follows. . . Then. . Regular PatternBased Schemas 123 w ∈ L(r′ ). Since T∃ (P ) ⊆ L(DP ). This concludes the proof of Theorem 83(1). qn )) = i≤n. such that L(ri ) = L(Ai ). sn )}. P∃ (Reg) ⇒ EDTD already holds for a restricted version of patternbased schemas. Martens et. and testing the nonemptiness of their intersection can be done in time polynomial in the size of DP and A¬P .7. . qi . qn )) = ∅ if none of the qi are accepting states for their automaton. an expression for i≤n. First. (rn .3 in [62]). . we reduce from satisfiability of patternbased schemas.
we use the following similar approach which does not need to translate regular expressions into NFAs and back. In this algorithm. This can be done in exponential time according to Theorem 83(3). all expressions of D deﬁne unions of the language deﬁned by the expressions in D1 . Deﬁne P ′ = {(r∅ .124 Succinctness of PatternBased Schema Languages over the alphabet Σ. Σ∗ )} ∪ {(rC . (rn . T∃ (P ′ ) = T∀ (P ′ ). If C = ∅. We take the same approach as in the proof of Theorem 83(1). s1 ). use the polynomial time algorithm of Martens et al [77]. . P ′ is disjoint and complete and. and for any nonempty set C ⊆ {1. b + c). d(b) = e. (ab. . Since every DTD is closed under labelguarded subtree exchange (Lemma 79). into a normal DTD by means of state elimination would give us a double exponential algorithm. the theorem follows. . sC  C ⊆ {1. (ac. Here. by the same argumentation as in the proof of Theorem 83(1). s)  (r. n} ∧ C = ∅}. We show P∃ (Reg) → DTD. If T∃ (P ) = ∅. can be constructed in time double exponential in the size of P ′ . 2exp (5) First. Let r∅ deﬁne / Σ∗ \ i≤n L(ri ) and let sC be the expression deﬁning the language i∈C L(si ). then ancstr(v) ∈ L(r∅ ) and the horizontal regular expression Σ∗ allows every childstring. D is constructed in exponential time. be done by taking the disjunction of expressions in D1 . . by Lemma 76. then the following DTD d deﬁnes T∃ (P ′ ): d(a) = b + c. but a(b(e(t))) ∈ T∃ (P ′ ) which yields the desired contradiction. obtained in the previous proof. In total. sn )}. / exp based schemas (Theorem 89(6)). If C = ∅. and a(c(e(t))) ∈ L(D). By Lemma 75(1). (abe. d(e) = ε. but have to make some small changes. v ∈ P∀ (t) if and only if v ∈ P∀ (t): ′ • v ∈ P∀ (t) ⇒ v ∈ P∀ (t): Let C = {i  ancstr(v) ∈ L(ri )}. if there exists some tree t ∈ T∃ (P ). and deﬁne the patternbased schema P ′ = {(a. d(c) = e. This can. . it suﬃces to prove that ′ for any node v of any tree t. deﬁne ΣP = {a. ε). n} let rC be the regular expression deﬁning i∈C L(ri )\ 1≤i≤n. c. to construct an equivalent DTD D. s) ∈ P }. So. suppose towards a contradiction that there exists a DTD D such that L(D) = T∃ (P ′ ). . Conversely. Therefore. a(b(e(t))) ∈ L(D) also holds. e} ⊎ Σ. . . ε)} ∪ {(acer.i∈C L(ri ). . Simply translating the DTD(NFA). We show that T∀ (P ) = T∀ (P ′ ) from which T∀ (P ) = T∃ (P ′ ) then follows. b. Because of the disjointness of P ′ no other vertical regular ′ expression in P ′ can deﬁne ancstr(v) and thus v ∈ P∀ (t). (6) We ﬁrst show P∀ (Reg) → P∃ (Reg). a(b(e)) ∈ L(D). (ace. Since exptime is closed under complement. construct a singletype EDTD D1 such that L(D1 ) = T∃ (P ). We show that T∃ (P ′ ) is deﬁnable by a DTD if and only if P is not existentially satisﬁable. of course. P∃ (Reg) → DTD already holds for a restricted version of pattern . e). e). . . First. Then. Let P = {(r1 . . Then.
. . To show that any node with these child and ancestorstrings must be in P∃ (t). s− can be constructed from s in linear time. with a ∈ L(r) and w ∈ L(s). then for all i ∈ C. . ∅)  b ∈ Σ}. where r is the corresponding vertical expression for s. for a ∈ Σ. We show that t = a(w) ∈ T∃ (P ) = T∀ (Pn ). for all i ∈ C. and thus. Let n ∈ N. which implies that w ∈ K. If C = ∅. v ∈ P∃ (t). T∀ (Pn ) contains all trees a(w). suppose w ∈ L(s− ) for some s− ∈ U . All other nodes v are leaf nodes with chstr(v) = ε and ancstr(v) = ab. That is. ancstr(v) ∈ L(ri ) and chstr(v) ∈ L(si ). For an expression s. w ∈ L(rK ). t′ = a(w′ ) ∈ T∀ (Pn ) = T∃ (P ) and thus there is a leaf node v ′ in t′ for which ancstr(v ′ ) = ab and chstr(v ′ ) = ε. ′ • v ∈ P∀ (t) ⇒ v ∈ P∀ (t): Let C ⊆ {1. ri )  / i ≤ m} ∪ {(ab. Deﬁne Pn over the alphabet Σa = Σ ⊎ {a}. First. It follows that there must be a rule (r. ε)  b ∈ Σ} ∪ {(b. then ancstr(v) is not deﬁned by any vertical expression in P and thus v ∈ P∀ (t). we know that chstr(v) = w ∈ L(s− ). rm of size linear in n such that any regular expression deﬁning i≤m L(ri ) must be of size at least double exponential in n. ancstr(v) ∈ L(rC ) and chstr(v) ∈ L(sC ). denote by s− the expression deﬁning all words in L(s) ∩ Σ∗ . thereby proving that the size of P must be at least double exponential in n. n} be the unique set for which (rC . According to Theorem 20(3). . where w ∈ K. Now w ∈ Σ∗ implies w ∈ L(s− ). by deﬁnition of rC and sC . let rK be the disjunction of all expressions in U . / ancstr(v) ∈ L(ri ). t = a(w) ∈ T∀ (Pn ) = T∃ (P ). by deﬁnition of U and rK . By Lemma 75(3). Then. Therefore. It follows that v ∈ P∀ (t). Finally. For brevity. Let P be a patternbased schema with T∀ (Pn ) = T∃ (P ). We now show that L(rK ) = K. and by deﬁnition of U . by Lemma 75(3) 2exp . But then. there exist a linear number of regular expressions r1 . Regular PatternBased Schemas 125 since v ∈ P∀ (t). let w ∈ K. as Pn = {(a. For the root node v of t. Then. sC ) ∈ P ′ . chstr(v) ∈ L(si ). . it suﬃces to show that every node v of t is in P∃ (t). Since ′ v ∈ P∀ (t) and by the disjointness and completeness of P ′ there indeed exists exactly one such set. . . by Lemma 75(3). If C = ∅. . Conversely. Since. that ancstr(v) = a ∈ L(r). ancstr(v) ∈ L(rC ) and chstr(v) ∈ L(sC ).2. According to Theorem 82(2). Therefore. and for all i ∈ C. Otherwise b is useless and can be removed from Σ. s) ∈ P . deﬁne K = i≤m L(ri ). s) ∈ P ∧ a ∈ L(r)} as the set of horizontal regular expressions whose corresponding vertical regular expressions contains the string a.7. Deﬁne U = {s−  (r. combined with ′ the disjointness of P ′ gives v ∈ P∀ (t). / We now show that P∀ (Reg) ⇒ P∃ (Reg). note that for every symbol b ∈ Σ there exists a string w′ ∈ K such that w′ contains a b. where b ∈ Σ since w ∈ L(s− ). the root node v of t is in P∃ (t).
. also any leaf node v of t with ancstr(v) = ab is in P∃ (t). . an XPath expression \\a\\b\c captures all nodes that are labeled with c. is constructed in time exponential in the size of P .qi ∈Fi L(si ). s1 ). . and λ((q1 . . (rn . qn )) = i≤n. and by Theorem 20(1) a regular expression for λ((q1 . let λ((q1 . We construct an automatonbased schema D = (A. 2exp Further. . . which is shown in Theorem 85(7). with w0 . . Thereto. which is shown in Theorem 85(10).126 Succinctness of PatternBased Schema Languages v ′ ∈ P∃ (t′ ). . . wn ∈ Σ∗ . . we call an expression linear if it is of the form w0 Σ∗ · · · wn−1 Σ∗ wn . Formally. .qi ∈Fi L(si ) can be constructed in double exponential time. . By Lemma 78.3 Linear PatternBased Schemas In this section. We already know that A can be constructed in exponential time. . Denote the classes of linear . Notice that the DTD(NFA) D constructed in the above proof. P∀ (Reg) → DTD already holds for a restricted version of patternbased schemas (Theorem 89(12)). This corresponds to the regular expression Σ∗ aΣ∗ bc. conform the proof of Theorem 83(4). 2exp 2exp 2exp 2exp 2exp 7. sn )}. (10) First. For instance. It follows that the total construction can be done in double exponential time. (78) We ﬁrst show P∀ (Reg) → EDTDst . . . We ﬁrst show P∀ (Reg) → DTD. Finally. P∀ (Reg) ⇒ EDTD already holds for a restricted version of patternbased schemas. qn )) = Σ∗ if none of the qi are accepting states for their automaton. The latter implies P∀ (Reg) ⇒ EDTDst . λ) such that L(D) = T∀ (P ). which implies P∀ (Reg) → EDTD. following [62]. qn )) = i≤n. which can be done in exponential time (Theorem 2(1)). To obtain an actual DTD. we only have to translate the NFAs in D into regular expressions. let P = {(r1 . D can then be translated into an equivalent singletype EDTD and the theorem follows. have b as parent and have an a as ancestor. P∀ (Reg) ⇒ DTD already holds for a more restricted version of patternbased schemas. This yields a total algorithm of double exponential time complexity. It follows that t ∈ T∃ (P ) = T∀ (Pn ). For λ. we restrict the vertical expressions to XPath expressions using only descendant and child axes. A patternbased schema is linear if all its vertical expressions are linear. (9) The proof is along the same lines as that of Theorem 83(4). We construct A in exactly the same manner as in the proof of Theorem 83(3). and wi ∈ Σ+ for 1 ≤ i < n. .
P∃ (Lin) ⇒ EDTDst . Theorem 85 lists the results for linear schemas. Deﬁne U = {r  (r. Lemma 86. 5. 4. then it also deﬁnes words containing b. s) ∈ P ′ and ε ∈ L(s)} as the set of all vertical / 2exp 2exp 2exp exp exp . Given an alphabet Σ. which gives us the desired contradiction. Σ)}. 10. P∀ (Lin) ⇒ EDTDst . P∃ (Lin) ⇒ EDTD. ε). Then. T∃ (P ) deﬁnes all unary trees containing at least one b. 3. The complexity of simplification improves slightly. let P = {(Σ∗ bΣ∗ . . 8. Suppose to the contrary that such a list of linear expressions does exist. Theorem 85. Linear PatternBased Schemas 127 schemas under existential and universal semantics by P∃ (Lin) and P∀ (Lin). simplification for P∃ (Lin) is pspacecomplete. b Proof. but that the complexity of translating to DTDs and (singletype) EDTDs is in general not better than for regular patternbased schemas. P∃ (Lin) → DTD. simplification for P∀ (Lin) is pspacecomplete. Further.7. However. pspace instead of exptime. and. one of these expressions must contain Σ∗ because otherwise 1≤i≤n L(ri ) would be a ﬁnite language. Now. P∃ (Lin) → P∀ (Lin). 6. . . P∀ (Lin) ⇒ EDTD. . rn such that 1≤i≤n L(ri ) is an inﬁnite language and 1≤i≤n L(ri ) ⊆ L(Σ∗ ). exp 1. 2. Suppose that P ′ is a linear schema such that T∃ (P ) = T∀ (P ′ ). Then. P∀ (Lin) → P∃ (Lin). There does not exist a set of linear regular expression r1 . and a symbol b ∈ Σ. 7. P∀ (Lin) ⇒ DTD.3. (1) We ﬁrst prove the following simple lemma. we show that the expressive power of linear schemas under existential and universal semantics becomes incomparable. Proof. denote Σ \ {b} by Σb . 9. if an expression contains Σ∗ . (Σ∗ . respectively.
We show that the union of the expressions in U deﬁnes an inﬁnite language and is a subset of Σ∗ . It is not hard n n 1 2 1 2 to see that any NFA deﬁning Kn must be of size at least exponential in n. where s is the corresponding / horizontal expression for r. Indeed deﬁne M = {(x. s. s)} ∪ {(a. for some string w. ak+1 ∈ T∀ (P ′ ) which leads to the desired contradiction. let w ∈ L(r). First. deﬁne A = (Q. there exists an NFA A such that L(D) = L(A). and F = {a  a ∈ Σ′ ∧ ε ∈ d(a)}. such that T∃ (Pn ) = Kn . w)  xw ∈ Kn ∧x = n+1} which is of size exponential in n. But then. q0 . ak+1 ∈ L(r) for all vertical expressions in U / ′ and thus the leaf node of ak+1 is also in P∀ (ak+1 ). It contains the following rules: • #1 → a0 + a1 1 1 • For any i < n: – #1 Σ∗ a0 → a0 + a1 i i+1 i+1 . 1}. P∃ (Lin) → EDTDst follows immediately from Theorem 83(3). and by deﬁnition of U . Thereto. we show w ∈ Σ∗ . which means that w contains at least one b and / b thus w ∈ T∃ (P ) = T∀ (P ′ ). µ) be an EDTD. ak+1 b ∈ T∃ (P ) = T∀ (P ′ ) and thus by Lemma 75(2) every nonleaf node ′ v of ak+1 is in P∀ (ak+1 ). for some r ∈ U . Then. by Lemma 87. A can be computed from D in time linear in the size of D.128 Succinctness of PatternBased Schema Languages regular expressions in P ′ whose horizontal regular expressions do not contain the empty string. s. F ) as Q = {q0 } ∪ Σ′ . which by Lemma 86 proves that such b a schema P ′ can not exist. suppose that it does not. d. which then implies both statements. which again gives the desired contradiction. suppose w ∈ Σ∗ . by Lemma 75(1). δ. #1 . Σ′ . For any EDTD D for which L(D) is a unary tree language. ancstr(v) = w ∈ L(r). for the leaf node v of w. b0 . But then. / ′ ′ ). Then. Proof. µ(b). w ∈ T∀ (P / (23) First. every EDTD deﬁning the unary tree language Kn must also be of size exponential in n. v ∈ P∀ (w) and thus by Lemma 75(1). every expression r ∈ U is of the form r = w. #2 } ∪ 1≤i≤n {a0 . let n ∈ N. Deﬁne Σn = {$. chstr(v) = ε ∈ L(s). 1 ≤ k ≤ n}. Then. we ﬁrst characterize the expressive power of EDTDs over unary tree languages. b1 } and i i i i Kn = {#1 ai1 ai2 · · · ain $bi1 bi2 · · · bin #2  ik ∈ {0. Let D = (Σ. Now. Now. Second. δ = {(q0 . We conclude the proof by giving a patternbased schema Pn . to show that the union of these expressions deﬁnes an inﬁnite language. Further. Let k be the length of the longest such string w. exp Lemma 87. We exp show P∃ (Lin) ⇒ EDTD. a1 . Moreover. Then. b ∈ Σ′ ∧ b ∈ L(d(a))}. which is of size linear in n. such that L(D) is a unary tree language. b)  a. and satisﬁes the conditions of Theorem 81. Towards a b contradiction.
(ab. such that the tree t3 = t1 [v1 ← subtreet2 (v2 )] is not in T∃ (P ). e). Given r over alphabet Σ. (ace. We refer to such a tuple (t1 . . / For the upper bound. let ΣP = {a. b. which is a regular tree language. That is. P ′ are two linear schemas. respectively. and a(c(e(w′ ))) ∈ L(D) but a(c(e(w))) ∈ T∃ (P ). Conversely. d(b) = e. and deﬁne the patternbased schema P = {(a. then ′ there also exists a witness tuple (t′ . a(b(e(w))) ∈ L(D). Linear PatternBased Schemas – #1 Σ∗ a1 → a0 + a1 i i+1 i+1 129 – #1 Σ∗ a1 Σ∗ b1 → b0 + b1 i+1 i+1 i i • #1 Σ∗ a0 → $ n • #1 Σ∗ a1 → $ n • #1 Σ∗ $ → b0 + b1 1 1 • #1 Σ∗ a0 Σ∗ b0 → #2 n n • #1 Σ∗ a1 Σ∗ b1 → #2 n n • #1 Σ∗ #2 → ε – #1 Σ∗ a0 Σ∗ b0 → b0 + b1 i i i+1 i+1 (4) For the lower bound. then there exists a tree t′ of depth polynomial / with the same properties. (aceσ. Σ∗ ). deciding for a regular expression r whether L(r) = Σ∗ . Lemma 88. Let w. When P. t2 ) for a linear schema P . ε). We show that there exists a DTD D with L(D) = T∃ (P ) if and only if L(r) = Σ∗ . c. (ac.7. where t′ and t′ are of depth 2 1 1 polynomial in the size of P . Observe that T∃ (P ). we reduce from universality of regular expressions. The latter problem is known to be pspacecomplete [102]. they stated that if there exists a tree t with t ∈ T∃ (P ) but t ∈ T∃ (P ′ ). ε)  σ ∈ Σ}. We make use of techniques introduced by Kasneci and Schwentick [62]. we again make use of the closure under labelguarded subtree exchange property of DTDs. we show that T∃ (P ) is not closed under labelguarded subtree exchange. then the following DTD d deﬁnes T∃ (P ): d(a) = b + c. or simply a witness tuple. r)} ∪ {(abeσ. with labt1 (v1 ) = labt2 (v2 ). is not deﬁnable by any DTD if and only if there exist trees t1 . d(e) = Σ∗ . it then follows that T∃ (P ) is not deﬁnable by a DTD. t2 ∈ T∃ (P ) and nodes v1 and v2 in t1 and t2 . From Lemma 79. t2 ) as a witness to the DTDundeﬁnability of T∃ (P ).3. b + c). d(c) = e. they obtained the following property. (abe. Proof. and d(σ) = ε for every σ ∈ Σ. w′ be strings such that w ∈ L(r) and w′ ∈ L(r). If there exists a witness tuple (t1 . / Then. d} ⊎ Σ. if L(r) Σ∗ . In particular. t2 ) for P . If L(r) = Σ∗ . e).
Indeed. let (t1 . t′ can be constructed from t by searching for nodes v and v ′ of t such that v ′ is a descendant of v. deﬁned as t1 [v1 ← subtreet2 (v2 )]. labt (v) = labt (v ′ ) t t and FP (v) = FP (v ′ ). t2 [v2 ← ()]: the tree t2 without the subtree under v2 4. a vector FP (v) over N can be assigned with the following properties: t • along a path in a tree. subtreet2 (v3 ): the subtree under v3 in t2 . Since t3 ∈ T∃ (P ). to every node t v of t. such that t3 . 5. t1 [v1 ← ()]: the tree t1 without the subtree under v1 . subtreet2 (v2 )[v3 ← ()]: the subtree under v2 in t2 . we get a tree which is still at v existentially valid with respect to P and where no two nodes on a path in the tree have the same vector and label and which thus is of polynomial depth.130 Succinctness of PatternBased Schema Languages Let P be a linear patternbased schema and t a tree. Indeed. We will also use this technique. Then. v and v ′ occur in the same . t2 ) be a witness tuple for P and ﬁx nodes v1 and v2 of t1 and t2 . and since t1 ∈ T∃ (P ) also v ∈ P∃ (t1 ) and thus v ∈ P∃ (t3 ). Furthermore. Thereto. and replacing the subtree rooted at v by the one rooted ′ . FP (v) can take at most polynomially many values in the size of P . is not in T∃ (P ). occurring in t2 . we can partition the trees t1 / and t2 . but have to be a bit more careful in the replacements we carry out. in ﬁve diﬀerent parts as follows: 1. v3 must / occur in the subtree under v2 inherited from t2 . 2. So. subtreet1 (v1 ): the subtree under v1 in t1 . t t • if v ′ is a child of v. 3. with v3 ∈ P∃ (t3 ). v ′ such that v is an ancestor of v ′ . by Lemma 75(3). Now. respectively. ﬁx some node v3 . and label of v t • v ∈ P∃ (t) can be decided solely on the value of FP (v) and chstr(v). then FP (v ′ ) can be computed from FP (v) and the ′ in t. v and v ′ are not equal to v1 . This situation is graphically illustrated in Figure 7. every node v not in that subtree. By applying this rule as often as possible. and thereby also t3 . / there must be some node v3 of t3 with v3 ∈ P∃ (t3 ). v2 or v3 . then there exists a tree t′ of polynomial depth which existentially satisﬁes P . Then. let t′ and t′ be the trees obtained from t1 and t2 by repeating the 1 2 following as often as possible: Search for two nodes v. has the same vector and childstring as its corresponding node in t1 . Based on these properties it is easy to see that if there exists a tree t which existentially satisﬁes P .2. without the subtree under v3 .
We simply guess a witness tuple (t1 . t′ . Therefore. Here. the sets of states the appropriate automata can be in. for t′ = 3 2 1 ′ ′ t′ t t′ [v1 ← subtreet2 (v2 )]. we guess that we are now at the nodes v1 and v2 of t1 and t2 . By Lemma 88. b. which are initiated by the values of these of t1 . Since pspace is closed under complement. That is. So. Then.2: The ﬁve diﬀerent areas in t1 and t2 . P∃ (Lin) → DTD already holds for a restricted version of patternexp (6) Let Σ = {a. t2 ) and check in polynomial space whether it is a valid witness tuple. replace v by the subtree under v ′ . t2 ) is a valid witness for P . But 1 ′ / then. based schemas (Theorem 89(6)). t1 and t2 are guessed simultaneously and independently. v2 and v3 still occur in t′ and t′ . we also guess whether it belongs to t1 or t2 . lab(v) = lab(v ′ ) and FP1 (v) = FP1 (v ′ ) (or FP2 (v) = FP2 (v ′ ) if ′ both occur in t ). there does not exist a witness tuple for P . witness tuple in which t1 2 Now. Linear PatternBased Schemas 131 t1 v1 t2 v2 v3 t1 v2 v3 t3 t1 t2 Figure 7. If in the end. v3 ∈ P∃ (t′ ). T∃ (P ) is not deﬁnable by a DTD. and thus t′ and t′ are 1 2 1 2 of at most a polynomial depth. t′ ∈ T∃ (P ) still holds and 1 2 the original nodes v1 . Then. (5) First. At some point in this procedure.3. If T∃ (P ) is deﬁnable by a DTD. t2 ) is a 1 3 / 3 ′ and t′ are of at most polynomial depth. but the third tree is not. t t t t part of t1 or t2 . T∀ (P ) contains all trees in which whenever a c labeled node v has a b labeled node as ancestor. but the subsequent subtree take the values of t2 . any path in one of the ﬁve parts of t′ and t′ can have at most a polynomial depth.7. b)}. Therefore. (t′ . . From that point we maintain a third list of states of automata. If it is. we show that the problem is in pspace. v and v 2 Observe that. by the properties of F . Then. the theorem follows. Furthermore. then (t1 . c} and deﬁne P = {(Σ∗ bΣ∗ c. it suﬃces to guess trees of at most polynomial depth. FP3 (v3 ) = FP3 (v3 ) and chstrt3 (v3 ) = chstrt3 (v3 ). t1 and t2 are accepted. we guess t1 and t2 in depthﬁrst and lefttoright fashion. maintaining for each tree and each level of the trees. P∃ (Lin) → DTD follows immediately from Theorem 83(5). using Lemma 88. for each guessed symbol. which by Lemma 75(3) gives us t′ ∈ T∃ (P ).
an1 Σ∗ an2 Σ∗ · · · Σ∗ ank cΣ∗ 3. 2exp 2exp . Suppose there does exist a linear schema P ′ such that T∀ (P ) = T∃ (P ′ ). we deﬁne Pn over the alphabet Σ ⊎ {a} as Pn = {(a. . Then. L(rK ) = K. it follows from Lemma 80(2) that rK cannot contain an a. an1 Σ∗ an2 Σ∗ · · · Σ∗ ank c 2. According to Theorem 20(3). (9) The proof is along the same lines as that of Theorem 85(4). v ∈ P∃ (t2 ). P∀ (Lin) → EDTDst follows immediately from Theorem 83(3). T∀ (Pn ) deﬁnes all trees a(w). In the proof 2exp 2exp (78) First. Set K = i≤m L(ri ). P∀ (Lin) → DTD follows immediately from Theorem 83(10). But then. must also deﬁne trees not in T∀ (P ). Let D = (Σ. Let a → r be the single rule in D for the root element a. Choose some N ∈ N with N ≥ P ′  and deﬁne the unary trees t1 = aN baN cb and t2 = aN baN c. by Lemma 75(4). let n ∈ N. every nonleaf node v ′ of t2 is in P∃ (t2 ). There must be at least one as P ′ contains a ﬁnite number of rules. / t1 ∈ T∃ (P ′ ) and since t2 is a preﬁx of t1 . ε ∈ L(s) must hold and r is of one of the following forms: 1. Let (r. . . Then. Obviously. Let n ∈ N. and thus by Lemma 75(3). For P∀ (Lin) ⇒ DTD. Then. for which w ∈ K. we can assume that D is trimmed. an1 Σ∗ an2 Σ∗ · · · Σ∗ ank Σ∗ where k ≥ 2 and nk ≥ 0. ri )  i ≤ m} ∪ {(ab. That is. Finally. µ) be an EDTD with T∀ (P ) = L(D). Since D is trimmed. t1 ∈ T∀ (P ). and t2 ∈ T∀ (P ). a. Σ′ . rm of size linear in n such that any regular expression deﬁning i≤m L(ri ) must be of size at least double exponential in n. Let rK be the expressions deﬁning µ(L(r)). ε)  b ∈ Σ} ∪ {(b. . By Lemma 80(a). ancstr(v) ∈ L(r) for any of the three expressions given above and ε ∈ L(s) for its corresponding horizontal ′ expression. Then.132 Succinctness of PatternBased Schema Languages chstr(v) must be b. s) ∈ P ′ be a rule matching inﬁnitely many leaf nodes of the strings wℓ . which then implies both statements. (10) First. for the leaf node v of t2 . t2 ∈ T∃ (P ′ ) which completes the proof. Deﬁne wℓ = aℓ c for ℓ ≥ 1 and notice that wℓ ∈ T∀ (P ) = T∃ (P ′ ). Next. We show P∀ (Lin) ⇒ EDTD. ∅)  b ∈ Σ}. P∀ (Lin) → DTD already holds for a restricted version of patternbased schemas (Theorem 89(12)). d. We show that any linear schema P ′ deﬁning all trees in T∀ (P ) under existential semantics. which proves that the size of D must be at least double exponential in n. there exist a linear number of regular expressions r1 .
7. but not for the various translation problems. 3. we say that a regular expression is strongly linear if it is of the form w or Σ∗ w. Denote the class of all strongly linear patternbased schemas under existential and universal semantics by P∃ (SLin) and P∀ (SLin). P∃ (SLin) → P∀ (SLin) and poly poly poly P∃ (DetSLin) → P∀ (DetSLin). observe that the expressive power of these schemas under existential and universal semantics again coincides. where w is nonempty. following [62]. Strongly Linear PatternBased Schemas 133 of Theorem 85(7) we have deﬁned a linear patternbased schema Pn of size polynomial in n for which any EDTD D′ with T∀ (Pn ) = L(D′ ) must be of size at least double exponential in n. This is also the case for the simplification problem studied here. Furthermore. every DTD is an EDTD and the language T∀ (Pn ) is deﬁnable by a DTD. simplification for P∃ (DetSLin) is in ptime. P∃ (SLin) → DTD and P∃ (DetSLin) → DTD. 6. and strongly linear schemas where all horizontal expressions must be deterministic. poly poly 2. as deﬁned above. poly poly poly . it is observed that the type of a node in most realworld XSDs only depends on the labels of its parents and grand parents. we distinguish between strongly linear schemas. all horizontal expressions in a strongly linear schema are also required to be deterministic. 7. P∃ (SLin) → EDTD and P∃ (DetSLin) → EDTD. The latter requirement is necessary to get ptime satisfiability and inclusion which would otherwise be pspacecomplete for arbitrary regular expressions. Therefore. To capture this idea. First. which makes strongly linear schemas very interesting from a practical point of view. It follows that any DTD D with T∀ (Pn ) = L(D) must be of size at least double exponential in n. Theorem 89. which we call deterministic strongly linear schemas and denote by P∃ (DetSLin) and P∀ (DetSLin).4. 4. simplification for P∃ (SLin) is pspacecomplete. 1. all considered problems become tractable. P∃ (SLin) → EDTDst and P∃ (DetSLin) → EDTDst . Further.4 Strongly Linear PatternBased Schemas In [77]. as is the case for DTDs and XML Schema. 5. In [62]. Theorem 89 shows the results for (deterministic) strongly linear patternbased schemas. respectively. A patternbased schema is strongly linear if it is disjoint and all its vertical expressions are strongly linear.
given R: (1) S is ﬁnite and polynomial time computable. Σ∗ b. and aw ∈ Suﬃx(R). such that w ∈ Suﬃx(R). cbc. Deﬁne V as Suﬃx(R) \ r∈R L(r). (4) r∈R∪S L(r) = Σ . Proof. Every r ∈ R is of the form w or Σ∗ w. 11. For P = {(r1 . sn )}. we show how it implies the theorem. It suﬃces to show that. for some w. 12. Further. Here. let Suﬃx(R) = ∗ r∈R Suﬃx(r). . (3) r∈R L(r) ∩ ∗ s∈S L(s) = ∅. 9. ∅)}. s1 ). with w′  > w. Σ∗ ac. For instance. Before we prove this lemma. Σ∗ a. Σ∗ cbc. P∀ (DetSLin) → P∃ (DetSLin) also holds. . and. .134 poly Succinctness of PatternBased Schema Languages poly 7. and thus poly poly poly poly poly poly poly poly . The key of this proof lies in the following lemma: Lemma 90. . Σ∗ cc. Proof. note that the regular expressions in P ′ are copies of these in P . is a suﬃx in L(r) then r must be of the form Σ∗ w and thus for every a ∈ Σ. P∀ (SLin) → EDTD and P∀ (DetSLin) → EDTD. c}. When a string w′ . rn } satisfying the conditions of Lemma 90. let S be the set of strongly linear expressions for R = {r1 . Therefore. We ﬁrst show (1). a ﬁnite set S of disjoint strongly linear regular expressions can be constructed in polynomial time such that s∈S L(s) = Σ∗ \ r∈R L(r). . and. 8. Deﬁne U as the set of strings aw. Then. Set P ′ = P ∪ ′ ′ s∈S {(s. We ﬁnally give the proof of Lemma 90. simplification for P∀ (SLin) is pspacecomplete. P ′ is too. . bc} we have U = {bbc. for R = {Σ∗ abc. aw is also a suﬃx in L(r). P∀ (SLin) → DTD and P∀ (DetSLin) → DTD. a ∈ Σ. For each ﬁnite set R of disjoint strongly linear expressions. . the set S is polynomial time computable and therefore. This gives us T (P ) = T (P ′ ). P∀ (SLin) → P∃ (SLin) and P∀ (DetSLin) → P∃ (DetSLin). follows from Lemma 76 that T∃ (P ∀ ∃ ∀ By Lemma 90. For R a set of strongly linear regular expressions. ac. (1) We ﬁrst show P∃ (SLin) → P∀ (SLin). T∃ (P ) = T∃ (P ) and since P is disjoint and complete it ′ ) = T (P ′ ). P∀ (SLin) → EDTDst and P∀ (DetSLin) → EDTDst . / We claim that S = u∈U {Σ∗ u} ∪ v∈V {v} is the desired set of regular expressions. cc. (rn . 10. a} and V = {c} which gives us S = {Σ∗ bbc. (2) the expressions in S are pairwise disjoint. . w ∈ Σ . simplification for P∀ (DetSLin) is in ptime. for r there are only w suﬃxes in L(r) which can match the deﬁnition of U or V .
. such that wl ∈ Suﬃx(R). . If w is not a suﬃx of wr . This concludes / the proof of Lemma 90. w is accepted / by the expression Σ∗ al−1 · · · ak . and since aw = a′ w′ . we must check that the generated expressions are all pairwise disjoint. then w is accepted by an expression generated by V . Thereto. Strongly Linear PatternBased Schemas 135 aw ∈ U . But then. then it is easy to see that wl = ε is a suﬃx of every string accepted by every r ∈ R. Obviously. w′ ∈ V . When l = k + 1. ′ ′ (7) For P = {(r1 . but that is perfectly symmetrical). for any r ∈ R. So. suppose that their intersection is nonempty and thus w′ ∈ L(Σ∗ aw). · · · . for some r ∈ R. So assume w ∈ V . the number of strings in U and V is bounded / / by the number of rules in R times the length of the strings w occurring in the expressions in R. given / that wk+1 ∈ Suﬃx(R). let S = {r1 . every expression generated by V deﬁnes only one string. which by deﬁnition must be generated by U . Then. First. . sn )}. which case we already ruled out. s1 ). note that if l = k + 1. so two expressions generated by V always have an empty intersection. (rn . . . if l = 1 we show that wl = w can not be a suﬃx of any string deﬁned by any r ∈ R. Σ∗ a′ w′ generated by U have a nonempty intersection. . But then. If w ∈ V . Let w ∈ L(r). which also contradicts our assumptions. But w′ ∈ Suﬃx(R) and thus aw ∈ Suﬃx(R) must also hold. Let r be wr or Σ∗ wr . . rm } be the set of strongly linear expressions for R = {r1 .4. we can only conclude that w1 ∈ Suﬃx(R). For (3). aw must be a suﬃx of a′ w′ (or the other way around. Further. We show that there / exists an s ∈ S. So. which is a polynomial. aw must be a sufﬁx of w′ and we know by deﬁnition of V that w′ ∈ Suﬃx(R). we show (4). Now. So. r∈R L(r) ∩ s∈S L(s) = ∅. Then. Suppose to the contrary that w ∈ Suﬃx(r). we go from left to right through w and / search for the rightmost l ≤ k + 1 such that wl = al · · · ak ∈ Suﬃx(R). wl = ε. we are guaranteed to ﬁnd some l. we are done. we can also compute these strings in polynomial time. w ∈ L(r). Third. also aw ∈ Suﬃx(R) which contradicts the deﬁnition of U . times the number of alphabet symbols. Let w = a1 · · · ak . observe that they only deﬁne words which have suﬃxes that can not be suﬃxes of any word deﬁned by any expression in R. Therefore. / 1 < l ≤ k + 1. and w1 ∈ Suﬃx(R). and wl−1 = al−1 · · · ak ∈ Suﬃx(R). For the expression generated by U . Finally. such that w ∈ L(s). Then. The strings in V are explicitly deﬁned such that their intersection with r∈R L(r) is empty. It is only left to show that there indeed exists such an index l for w. For an expression Σ∗ aw generated by U and an string w′ in V . rn } satisfying the conditions of Lemma 90. which again contradicts the deﬁnition of U . . If w is a suﬃx of wr .7. suppose that two expressions Σ∗ aw. then r must be of the form Σ∗ wr and wr must be a suﬃx of w. . For (2). and wl−1 ∈ Suﬃx(R). Conversely. aw must be a suﬃx of w′ .
δ(w. λ(q) = si if and only if w ∈ L(ri ) and λ(q) = ∅ otherwise.(89) We show P∃ (SLin) → EDTDst . since singletype EDTDs are a subset of EDTDs and since we can translate a stronglylinear schema with universal semantics into an equivalent one with existential semantics in polynomial time (Theorem 89(7)). (rm . . Here. . 1. Deﬁne S = {w  w ∈ Preﬁx(wi ). s1 ). we construct A in polynomial time in a manner similar to the construction used in Proposition 5. the set S is polynomial time computable and therefore. . • δ(q0 . . assume that every ri is of the form Σ∗ wi . For the deﬁnition of λ. Further. this would induce an exponential blowup. Since deterministic stronglylinear schemas are a subset of stronglylinear schemas. . poly poly . a) = q0 otherwise. . 1 ≤ i ≤ n}. then w is the biggest element in S which is a suﬃx of u. if q0 ⇒A. Lemma 91. . . and δ(w. We deﬁne D such that when A is in state q after reading w. This gives us T∀ (P ) = T∃ (P ′ ). and for each a ∈ Σ. (rn . (r1 . We prove the correctness of our construction using the following lemma which can easily be proved by induction on the length of u. For any string u = a1 · · · ak . . T∀ (P ) = ′ ) and since P ′ is disjoint and complete it follows from Lemma 76 that T∀ (P T∃ (P ′ ) = T∀ (P ′ ). (rn . The most obvious way to construct A is by constructing DFAs for the vertical expressions and combining these by a product construction. We later extend the construction to also handle vertical expressions of the form wi .2 in [62]. By Lemma 78. (23). a) = a if a ∈ S. and δ(q0 .u q0 . all other results follow. Instead. δ) is deﬁned as Q = S ∪ {q0 }. we can then translate D into an equivalent singletype EDTD in polynomial time.u w. a) = q0 if no string in S is a suﬃx of wa. P∃ (DetSLin) → P∀ (DetSLin) also holds. λ) such that L(D) = T∃ (P ). However. Let P = {(r1 . λ is welldeﬁned. A = (Q. let λ(q0 ) = ∅. First. where w′ is the longest suﬃx of wa in S. Σ∗ )}. λ(w) = si if w ∈ L(ri ) and λ(w) = ∅ if w ∈ L(ri ) for all i ≤ n. for some w ∈ S. a) = w′ . . sn ). and 2. note that the regular expressions in P ′ are copies of these in P . we construct an automatonbased schema D = (A. Then. Note that since the vertical / expression are disjoint. then no suﬃx of u is in S. sn )}. P ′ is too. Given P . q0 . Σ∗ ). . if q0 ⇒A. and • for each w ∈ S. s1 ).136 Succinctness of PatternBased Schema Languages ′ ′ Then. . deﬁne P ′ = {(r1 . and for all w ∈ S. Therefore. By Lemma 90.
q0 ⇒ancstr(v) q. (4). q0 ⇒A. ′ ′ Deﬁne A = (Q. Then.(11) We give the proof for the existential semantics. with λ(w) = si . with Q = {q0 . the deﬁnition of λ again remains the same for q0 and the states ′ introduced by S. and δ(q0 . observe that the schema used in the proofs of Theorem 85(4) and (9) is strongly linear. there is some i such that ancstr(v) ∈ L(ri ).u w. a) = q0 otherwise. ′ ) = ∅ otherwise. suppose that for q such that q0 ⇒A. The transition function for q0 and the states introduced by S remains the same. So. chstr(v) ∈ L(λ(q)). Finally. that there exists some m such that for i ≤ m. chstr(v) ∈ L(si ). It follows that v ∈ P∃ (t). For a string w ∈ S ′ . a node v ∈ P∃ (t) if and only if chstr(v) ∈ L(λ(q)) for q ∈ Q such that q0 ⇒A. λ(q0 ) = ∅.ancstr(v) q. The algorithm proceeds in a number of steps. it suﬃces to prove that for any tree t. First. δ(q0 . The upper bound carries over since every strongly linear schema is also a linear schema. and switches to the normal operation of the original automaton if this is not possible anymore. ri = wi . if and only if u ∈ L(ri ). But then. Strongly Linear PatternBased Schemas 137 3. Assume w. suppose v ∈ P∃ (t). and λ(w for this extended construction and the correctness of the construction follows thereof. a) is the longest suﬃx of wa in S if it exists and wa ∈ S ′ . w ∈ S. We have now shown that the construction is correct when all expressions are of the form Σ∗ w.ancstr(v) q. Note that the elements of S and S ′ need not be disjoint. q0 ⇒A. with λ(q) = ∅. chstr(v) ∈ L(λ(q)) holds. and / To show that L(D) = T∃ (P ).7. Therefore. δ(w′ . By Theorem 89(3) this can be 4. Further. (5). For the lower bound.u q. by Lemma 91(4). q0 . for instance ab ∈ S ′ corresponds to the state ′ ′ a′ b′ . and the deﬁnition of λ. a) = a if ′ a ∈ S ′ ∧a ∈ S. for some i > m. much like a suﬃxtree. a) = a′ if a ∈ S ′ .l. we have added a subautomaton to A which starts by checking whether w = wi . By Theorem 89(7) the result carries over immediately to the universal semantics. δ(w ′ . Then. q0 } ∪ S ∪ S ′ . construct an automatonbased schema D1 such that L(D1 ) = T∃ (P ). with λ(q) = si . The previous lemma can be extended 1 ≤ i ≤ n.o. Then.(10) This follows immediately from Theorem 85(4) and (9). a) = w′ a′ / ′ . we denote the states corresponding to elements of S ′ by primes. and λ(w′ ) = ri if w ∈ L(ri ) for some i. for any symbol a ∈ Σ. and by the deﬁnition of λ. δ(q0 . a) = q0 otherwise.g. ri = Σ∗ wi and for i > m. δ). By Lemma 91(4). ancstr(v) ∈ L(ri ) and chstr(v) ∈ L(si ). for some i ≤ n. Deﬁne S = {w  w ∈ Preﬁx(wi )∧1 ≤ i ≤ m} in the same manner as above. if wa ∈ S / and δ(w′ .4. . if and only if u ∈ L(ri ). First. and S ′ = {w  w ∈ Preﬁx(wi ) ∧ m < i ≤ n}. for any i ≤ n. Conversely. We sketch the extension to the full class of strongly linear expressions.
Now. t1 and t2 must exist by Lemma 80(2). From Lemma 79 it then follows that L(D2 ) is not deﬁnable by a DTD. let ΣP = {a. consider the DTD D = (Σ.(12) We ﬁrst show that P∀ (DetSLin) → DTD and then P∃ (SLin) → DTD. T∀ (P ) / is not closed under ancestorguarded subtree exchange and by Lemma 79 is not deﬁnable by a DTD. Furthermore. Because D2 is a singletype EDTD. deﬁne t3 = µ(t2 )[u ← µ(subtreet1 (v))] which is obtained from µ(t1 ) and µ(t2 ) by labelguarded subtree exchange. Here. Σ′ . Conversely. are equivalent. ε). This shows that D2 is not closed under labelguarded / subtree exchange. Since µ(L(d(ai ))) = µ(L(d(aj ))). b. f } and P = {(a. aj ∈ Σ′ it holds that L(µ(d(ai ))) = L(µ(d(aj ))). However. First. suppose that for every pair of types ai . To show that P∃ (SLin) → DTD. / / We consider the ﬁrst case. it must assign the type aj to node u in t3 . d). Then. to show that P∀ (DetSLin) → DTD. which by Lemma 78 can again be done in ptime and also maintains the determinism of the used regular expressions. Further. we claim that L(D2 ) = T∃ (P ) is deﬁnable by a DTD if and only if for every two types ai . with µ(ai ) = a. (acd. this can be tested in polynomial time. aj ∈ Σ′ it holds that µ(d2 (ai )) = µ(d2 (aj )). (acdf. µ). aj ∈ Σ′ such that µ(L(d(ai ))) = µ(L(d(aj ))). e. Since all regular expression µ(d2 (ai )). We show that L(D2 ) is not closed under ancestorguarded subtree exchange. b + c). Since deterministic stronglylinear schemas are a subset of stronglylinear schemas and since we can translate a stronglylinear schema with universal semantics into an equivalent one with existential semantics in polynomial time (Theorem 89(7)). it does not matter which type we choose. d. (ab. there exists a string w such that w ∈ µ(L(d(ai ))) and w ∈ µ(L(d(aj ))) or w ∈ µ(L(d(ai ))) and w ∈ µ(L(d(aj ))). all other results follow. the regular expressions in D1 are copies of the horizontal expressions in P and are therefore also deterministic. (6). Finally. a. c. L(D) = L(D2 ) which shows that L(D2 ) is deﬁnable by a DTD. suppose that there exist types ai . chstrt3 (u) = w ∈ µ(L(d(aj ))) / and thus t3 ∈ L(D3 ). a(b(d)) ∈ T∀ (P ) and a(c(d(f ))) ∈ T∀ (P ) but a(b(d(f ))) ∈ T∀ (P ). let t2 ∈ L(d2 ) be a tree with some node u with labt2 (u) = aj . f ). the second is identical. we trim D2 which can be done in polynomial time by Lemma 80(1) and also preserves the determinism of the expressions in D2 . Then. s). Since D2 is trimmed. Now. Since all regular expressions in D2 are deterministic. Therefore.138 Succinctness of PatternBased Schema Languages done in polynomial time. d2 . (abd. d. Let t1 ∈ L(d2 ) be a tree with some node v with labt1 (v) = ai and chstrt1 (v) = w′ where µ(w′ ) = w. where d(a) = µ(d2 (ai )) for some ai ∈ Σ′ . Then. ε)}. translate D1 into a singletype EDTD D2 = (Σ. note that the algorithm in the above poly poly . (ac. We ﬁnally prove the above claim: First. d).
7. . However. because we have to test equivalence of regular expressions. Strongly Linear PatternBased Schemas 139 proof also works when the horizontal regular expressions are not deterministic.4. The total algorithm then becomes pspace. the DTD D is still constructed in polynomial time. which completes this proof.
.
Additionally. The most common schema languages are DTD. [25. DTDs correspond to contextfree grammars with regular expressions at righthand sides.8 Optimizing XML Schema Languages As mentioned before. e. (ii) static analysis of transformations (cf. 78]). unranked tree automata [16].. 73. it enables automation and optimization of standard tasks like (i) searching.. while Relax NG is usually abstracted by extended DTDs (EDTDs) [88] or equivalently. 109]). XML Schema [100]. In particular. Decision problems like equivalence. and.g. 87]).. [28.. deﬁning the regular unranked tree languages.. Because of their widespread applicability. Within applications or communities.g. XML Schema is usually abstracted by singletype . 56.g. the basic decision problems are fundamental for schema minimization (cf. constitute essential building blocks in solutions for the just mentioned optimization and static analysis problems. but far more importantly. 74. and processing of XML data (cf. it is therefore important to establish the exact complexity of the basic decision problems for the various XML schema languages. e.. XML data is usually not arbitrary but adheres to some structure imposed by a schema. XML is the lingua franca for data exchange on the Internet [1]. The presence of such a schema not only provides users with a global view on the anatomy of the data. inclusion and nonemptiness of intersection of schemas. integration. [5. and Relax NG [22] and can be modeled by grammar formalisms [85]. hereafter referred to as the basic decision problems. e. 69.
&) expressions (and subclasses thereof) carries over to DTDs and singletype EDTDs. in addition to the conventional regular expression operators like concatenation. union.2. which has severe implications on the complexity of the basic decision problems. As detailed in [75]. Therefore. As observed in Section 8. We illustrate these additional operators in Figure 8. Although the new operators can be expressed by the conventional regular operators.1 and 8. the complexity of inclusion and equivalence of RE(#.1. they cannot do so succinctly (see Chapter 4). For EDTDs. In Sections 8.1. this immediate correspondence does not hold anymore. &) expressions. we establish the complexity of the basic . the relationship between schema formalisms and grammars provides direct upper and lower bounds for the complexity of the basic decision problems. we revisit the simpliﬁcation problem introduced in [77] for schemas with RE(#. XSDs. The schema deﬁnes a shop that sells CDs and oﬀers a special price for boxes of 10–12 CDs. and Kleenestar.12 price artist & title & price Figure 8. Relax NG allows unordered concatenation through the interleaving operator.7. Finally. Finally. Therefore. Indeed. the XML Schema and the Relax NG speciﬁcation allow two other operators as well. This problem is deﬁned as follows: given an extended DTD. we study the complexity of the basic decision problems for EDTDs extended with counting and interleaving operators.1: A sample schema using the numerical occurrence and interleave operators. immediately give us complexity bounds for the basic decision problems for a wide range of subclasses of DTDs and singletype EDTDs. both DTD and XML Schema require expressions to be deterministic. The goal of this chapter is to study the impact of these counting and interleaving operators on the complexity of the basic decision problems for DTDs. A closer inspection of the various schema speciﬁcations reveals that the above abstractions in terms of grammars with regular expressions is too coarse. EDTDs.142 shop regular discountbox cd → → → → Optimizing XML Schema Languages regular∗ & discountbox∗ cd cd10. which is also present in XML Schema in a restricted form. can it be rewritten into an equivalent DTD or a singletype EDTD? Outline. The XML Schema speciﬁcation allows to express counting or numerical occurrence constraints which deﬁne the minimal and maximal number of times a regular construct can be repeated. the results in Chapter 5. and Relax NG. &) and Section 6. concerning (subclasses) of RE(#. concerning deterministic expressions with counting.
&). Let M further have the property that L(M A ) ⊆ L(M B ) whenever L(A) ⊆ L(B).1. Proposition 92 ([75]). we establish the complexity of the basic decision problems for DTDs and singletype EDTDs with extended regular expressions. The following two propositions have been shown to hold for standard regular expressions in [75]. conp. Let R be a subclass of RE(#. Clearly. Hence the results on the equivalence and inclusion problem of Section 5. Then L′ is also in C. and that every such class contains ptime. This is mainly done by showing that the results on extended regular expressions of Section 5. . Let L′ be accepted by a deterministic polynomialtime Turing machine M with oracle O (denoted L′ = L(M O )).3. For our purposes. their proofs carry over literally to all subclasses of RE(#. (b) inclusion for DTD(R) is in C. 8. &) and let C be a complexity class closed under positive reductions. (c) inclusion for EDTDst (R) is in C.8. We discuss simpliﬁcation in Section 8. np. pspace. The corresponding statement holds for equivalence. Then the following are equivalent: (a) inclusion for R expressions is in C. and DTD(&) are expspacecomplete. equivalence and inclusion for EDTDst (#).1 Complexity of DTDs and SingleType EDTDs In this section. and EDTDs. and expspace have this property.2 and Section 6. For instance. S The previous proposition can be generalized to intersection of DTDs as well. Complexity of DTDs and SingleType EDTDs 143 decision problems for DTDs and singletype EDTDs.2 often carry over literally to DTDs and singletype EDTDs. while these problems are in pspace for EDTDst (DRE# ).7 all carry over to DTDs and singletype EDTDs. it is suﬃcient that important complexity classes like ptime. respectively. For a more precise deﬁnition of this notion we refer the reader to [54]. EDTDst (#. any lower bound on the complexity of equivalence and inclusion of a class of regular expressions R also carries over to DTD(R) and EDTDst (R). We call a complexity class C closed under positive reductions if the following holds for every O ∈ C. &). However.
144 Optimizing XML Schema Languages Proposition 93 ([75]).&) intersection remains in exptime. d. more surprisingly. exptime for intersection. and any lower bound on intersection for R expressions carries over to DTD(R) and EDTDst (R). By using the NFA(#. (b) intersection for DTD(R) is in C. Theorem 94. As the basic decision problems are exptimecomplete for EDTD(RE). Hence. for example intersection for DTD(#. µ). and. which does not require expressions to be deterministic. First. i. intersection for EDTDst (RE) is exptimehard and in the next section we will see that even for EDTD(#. Indeed. we will denote elements of Σ′ . (1) equivalence and inclusion for EDTD(#.1. the straightforward approach of translating every RE(#. and. (1) We show that inclusion is in expspace. (3) intersection for EDTD(#. Σ′ . we introduce some notation. Then the following are equivalent: (a) intersection for R expressions is in C. we do not consider deterministic expressions as EDTDs are the abstraction of Relax NG. &) expression into an NFA and then applying the standard algorithms gives rise to a double exponential time complexity. &) introduced in Section 5. Although the above proposition does not hold for singletype EDTDs. The upper bound for equivalence then immediately follows. we can do better: expspace for inclusion and equivalence.&) and DTD(DRE# ) are pspacecomplete. by τ . For an EDTD D = (Σ. The only remaining problem is inW tersection for singletype EDTDs. s. &) and let C be a complexity class which is closed under positive reductions. we can still establish complexity bounds for them.&) are in expspace.e.2 carry over to DTDs. EDTDst (&).&) is exptimecomplete.. Proof. It immediately follows that intersection for EDTDst (#). types. Hence.&) is also exptimecomplete. Here. and EDTDst (#. Let R be a subclass of RE(#.2 Complexity of Extended DTDs We next consider the complexity of the basic decision problems for EDTDs with numerical occurrence constraints and interleaving. also the results on intersection of Section 5. (2) equivalence and inclusion for EDTD(#) and EDTD(&) are expspacehard. τ ) the . We denote by (D. 8.
Let E := Eℓ for ℓ = 2Σ1  · 2Σ2  . For every k ≥ 1. Complexity of Extended DTDs 145 EDTD D with start symbol τ . µ2 ). 2. As expspace is closed under complement. it is decidable in expspace ′ ′ .1 ) · · · (C1. Hence. bj. Step (3) can be carried out in exponential time. ∃bj. We compute the set E in a bottomup manner as follows: 1. As Ek ⊆ Ek+1 . The lemma can be proved by induction on k. then depth(t) = 0. the theorem fol′ ′ lows. C2 )  ∃a ∈ Σ. then depth(t) = max{depth(ti )  i ∈ {1. C2 ) of depth at most k.n ∈ Cj.n j with bj. C2 ) ∈ E with s1 ∈ C1 and s2 ∈ C2 and rejects otherwise. Initially. . C2 ). µ1 ) and D2 = 1 (Σ. and if t = σ(t1 · · · tn ).2. C2 ) for which ∗ there are a ∈ Σ. d2 .8. d1 . the algorithm computes the largest set of pairs. Cj is the set of states that can be assigned to the root in a run on t. For step (2). We deﬁne the depth of a tree t. τ1 ∈ Σ′ . . Ek is the union of Ek−1 and the pairs (C1 .n . Σ′ . τ2 ∈ Σ′ such that 1 2 µ1 (τ1 ) = µ2 (τ2 ) = a. s1 . Σ′ . Lemma 95. denoted by depth(t). 2. For every k > 1. . It remains to show that the algorithm can be carried out using exponential space. The algorithm computes a set E of pairs (C1 . (C1 . C2. when Ek−1 is known. Therefore. That is. we say that t is a witness for (C1 . n}} + 1. s2 ∈ C2 ). n ∈ N and a string (C1.1 ∈ Cj.1 · · · bj. C2 ) ∈ Ek if and only if there exists a witness tree for (C1 . We argue that the algorithm is correct. . . for each j = 1. C2 ) ∈ E if and only if there exists a tree t such that Cj = {τ ∈ Σ′  t ∈ j L((Dj . The following lemma then shows that the algorithm decides whether L(D1 ) ⊆ L(D2 ). it follows that Eℓ = Eℓ+1 . Suppose that we have two EDTDs D1 = (Σ. C2.1 . τ ))} for each j = 1. Step (1) reduces to a linear number of tests ε ∈ L(r). set E1 := {(C1 .1 . s2 . Or when viewing Dj as a tree automaton. Hence. since the size of E is exponential in the input. L(D1 ) ⊆ L(D2 ) if and only if there exists a pair (C1 . We provide an expspace algorithm that decides whether 2 L(D1 ) ⊆ L(D2 ). . as follows: if t = ε. C2 ) ∈ E with s1 ∈ C1 and s2 ∈ C2 . Notice that t ∈ L(D1 ) (respectively. &) expressions r which is in ptime by [64].n ∈ dj (τ )}. The algorithm then accepts when there is a pair (C1 . t ∈ L(D2 )) if s1 ∈ C1 (respectively. it suﬃces to argue that. and for i = 1. . every Cj is the set of types that can be assigned by Dj to the root of t. Ci = {τ ∈ Σ′  ε ∈ di (τ ) ∧ µi (τ ) = i a}}. for some RE(#. 2. 2. C2 ) ∈ 2Σ1 × 2Σ2 where (C1 . .n ) in Ek−1 such that Cj = {τ ∈ Σ′  µj (τ ) = a. for every k.
n ) in Ek−1 such that for each j = 1. Di = (Σ. we guess the string W one symbol at a time and compute the set of reachable conﬁgurations Γτ for each N (τ ).n . Intuitively.&). n}. We assume w. .n with bj. (B) for every τ ∈ Σ′ \ Cj .m . C2 ) is in Ek . and. .n ∈ Cj. Then. si . i The algorithm proceeds as follows: . &) accepting dj (τ ).g.m .m−1 . Suppose that we have guessed a preﬁx (C1. for each i ∈ {1. 2. C2. n}. bj. Initially. . τp . for each j = 1. let N (τ ) denote an NFA(#. &) for di (τ ). ′ Assume that Σ′ ∩ Σ2 = ∅.n ∈ dj (τ ). In the following. this proves the theorem. We accept (C1 . Σ′ .1 .l. . . we assume that no derivation tree consists of only the root. We guess t node by node in a topdown manner. N (τ ) be the 1 j NFA(#. According i to Theorem 42.1 ∈ Cj.m ). the following information is written on the tape of the Turing machine: for i i i every i ∈ {1.1 · · · bj. Γτ is the singleton set containing the initial conﬁguration of N (τ ). we need to verify that ∗ there exists a string W = (C1. . For every guessed node v.1 ) · · · (C1.146 Optimizing XML Schema Languages whether a pair (C1 .n ∈ dj (τ ).1 ∈ Cj. . respectively. di . let. τp i i current conﬁguration N (τp ) is in after reading the string formed by the left siblings of v.b γ ′ } and set Γτ to Γ′ . C2. the triple ci = (τv . µi ) be an i EDTD(#. We provide an alternating polynomial space algorithm that guesses a tree t and accepts if t ∈ L(D1 ) ∩ · · · ∩ L(Dn ).1 · · · bj.n ∈ Cj. (A) for every τ ∈ Cj . γ i ) where τv is the type assigned i is the type of the parent assigned by D .1 . As there are only an exponential number of such possible pairs. .1 . N (τ ) can be computed from di (τ ) in polynomial time. τ ∈ Cj if and only if Γτ contains an accepting conﬁguration.o. we say that τ ∈ Σ′ is an atype when µi (τ ) = a. Thereto. C2. C2. and γ i is the to v by grammar Di . . C2. there exist bj. equivalence and inclusion are also expspacehard for EDTD(&) and EDTD(#). To this end. Hence. . C2 ) when τ for every τ ∈ Σ′ .1 . γ ∈ Γτ such that γ ⇒N (τ ). Each τ τ set Γ′ can be computed in exponential space from Γτ . For each type τ ∈ Σ′ .n with j bj. that the sets Σ′ are pairwise disjoint. We argue that the problem is in exptime. (3) The lower bound follows from [99]. bj. . As apspace = exptime [19]. j (2) It is shown by Mayer and Stockmeyer [79] and Meyer and Stockmeyer [83] that equivalence and inclusion are expspacehard for RE(&)s and RE(#). . . . we compute the set Γ′ = {γ ′  ∃b ∈ Cj. there do not exist bj. . .m−1 ) of W and that we guess a new symbol (C1. 2 and τ ∈ Σ′ .1 ) · · · (C1. We i also assume that the start type si never appears at the righthand side of a rule. Finally. Let. the result follows.
. Simpliﬁcation 147 1. t ∈ n L(Di ). compute a conﬁguration γ ′i for which γ i ⇒N (τp ). we have ′ guessed a tree t and. . and for each i i ∈ {1. The algorithm obviously uses only polynomial space. guess an a ∈ Σ and for each i. We argue that the algorithm is correct. n}. Replace every ci by the triple (θi . . . let ci = (τ i . on the tape where γs i i 2. . i. γs ) i is the start conﬁguration of N (s ).τ i γ ′i . τp . When every γ ′i is a i ﬁnal conﬁguration. τ i . Otherwise. For i ∈ {1. γs is the start conﬁguration of N (τ i ). . &). As for each grammar the types of the roots are given. then we do not need to extend to the right anymore and the algorithm accepts. guess an atype θi . . ε ∈ di (τ i ) then the current node can be a leaf node and the branch accepts. guess an atype θi . γ i ) be the triples on the tape. guess an a ∈ Σ and for each i. suppose i i=1 that there exists a tree t ∈ n L(Di ) and t is minimal in the sense that no i=1 n subtree t0 of t is in i=1 L(Di ). n}. . We revisit this problem in the context of RE(#. Whenever for two trees t1 . For the other direction. we start by guessing the ﬁrst child of the root. we study the simpliﬁcation problem a bit more broadly than before. Replace every i i ci by the triple (θi . n}. We recall the following theorem from [77]: . we guess an a ∈ Σ. . That is.3. We say that a tree language L is closed under ancestorguarded subtree exchange if the following holds. The algorithm now universally splits into two parallel branches as follows: (a) Downward extension: When for every i. γ ′i ) and proceed to step (2). there is a run of the above algorithm that guesses t and guesses trees t′ with µi (t′ ) = t. In [77]. this problem was shown to be exptimecomplete for EDTDs with standard regular expressions.3 Simpliﬁcation In this section. Given an EDTD.8. If the algorithm accepts. Here. τ i . Otherwise. . Then. 8. we are interested in knowing whether it has an equivalent EDTD of a restricted type. a tree t′ with µi (ti ) = t and i t′ ∈ L(di ). We need a bit of terminology. . .. si . Therefore. . γs ) and proceed to step (2). ancstrt1 (u1 ) = ancstrt2 (u2 ) implies t1 [u1 ← subtreet2 (u2 )] ∈ L. . The tree t must be minimal i i since the algorithm stops extending the tree as soon as possible. n}. . (b) Extension to the right: For every i ∈ {1. . for every i ∈ {1. we guess an atype τ i and write the triple ci = (τ i .e. an equivalent DTD or singletype EDTD. t2 ∈ L with nodes u1 ∈ Dom(t1 ) and u2 ∈ Dom(t2 ).
&) or DTD(#. µ) as the EDTD with the following rules: s → b1 b2 b1 → r1 b2 → r2 . Then the following conditions are equivalent. for j = 1. The following algorithms are along the same lines as the EXPTIME algorithms in [77] for the simpliﬁcation problem without numerical occurrence or interleaving operators. By deﬁnition L(rj ) = ∅. Let r1 . Conversely. b2 }. the second is identical. and µ(b1 ) = µ(b2 ) = b. (b) L is closed under ancestorguarded subtree exchange. 2. Proof. However. assume that r1 is not equivalent to r2 . Let L be a tree language deﬁned by an EDTD. Σ ∪ {s. then D is equivalent to the DTD (and therefore also to a singletype EDTD) s → bb b → r1 .&).148 Optimizing XML Schema Languages Theorem 96 (Theorem 7. d. every tree language deﬁned by a singletype EDTD is closed under such an exchange of subtrees. We are now ready for the following theorem. this means that L(D) cannot be deﬁned by an EDTDst . So. s}. suppose that there exists an EDTDst which deﬁnes the language L(D). Now. r2 be RE(#) expressions over Σ and let b and s be two symbols not occurring in Σ. According to Theorem 96. Towards a contradiction. We ﬁrst give an expspace algorithm which decides whether an EDTD is equivalent to an .1 in [77]). let w2 be a string in L(r2 ) and consider the tree t = s(b(w1 )b(w2 )). b1 . Given an EDTD(#. We ﬁrst show that the problem is hard for expspace. the tree t′ = s(b(w2 )b(w1 )) obtained from t by switching its left and right subtree is not in L(D). there exists a string w1 such that w1 ∈ L(r1 ) and w1 ∈ L(r2 ). Theorem 97. We now proceed with the upper bounds. or w1 ∈ L(r1 ) / / and w1 ∈ L(r2 ). if r1 is equivalent to r2 . Deﬁne D = (Σ ∪ {b. So. Clearly. µ(τ ) = τ . We use a reduction from equivalence of RE(#). We only consider the ﬁrst case. We claim that D is equivalent to a singletype DTD or a DTD if and only if L(r1 ) = L(r2 ). Clearly. s. which is expspacecomplete [83]. which leads to the desired contradiction. deciding whether it is equivalent to an EDTDst (#. where for every τ ∈ Σ ∪ {s}.&) is expspacecomplete. (a) L is deﬁnable by a singletype EDTD. t is in L(D).
0 For each a ∈ Σ. (2) For every b ∈ Σ. that. We ﬁrst show how D0 can be constructed. ′ We show that properties (a)–(d) hold. a). 0 (b) the RE(#. Intuitively. each types(D. we immediately 0 have that (a) holds. (d) L(D0 ) = L(D) ⇔ D is equivalent to a EDTDst . (c) L(D) ⊆ L(D0 ). &) expressions in the deﬁnition of d. the algorithm runs in time exponential in the size of D. Clearly. for which there is a tree t and a tree t′ ∈ L(d) with µ(t′ ) = t. Then. s. 0 0 0 For every τ ∈ types(D. &) expressions in the deﬁnition of d0 are only polynomially larger than the RE(#. d0 . Let D = (Σ. We next deﬁne D0 = (Σ. &) expressions in D0 is at . Types(c) := {c1 } and R := {{c1 }}. Hence. We can assume w. a) is ﬁnite. µ0 ). set W := {c}. we compute an EDTDst D0 = (Σ. Σ′ . s. and. In d0 . every useless type can be removed from D in a simple preprocessing step. a) be the set of all nonempty sets types(wa). The EDTDst D0 has the following properties: (a) Σ′ is in general exponentially larger than Σ′ . let Types(wab) contain all bi for which there exists an aj in Types(wa) and a string in d(aj ) containing bi . d. d0 . We show how to compute types(wa) in exponential time. Note that s ∈ Σ′ . If Types(wab) is not empty and not already in R. with each bj in d(ai ) replaced by types(wab). set µ0 (τ ) = a. and a node v in t such that ancstrt (v) = wa and the type of v in t′ is ai .g. Repeat the following until W becomes empty: (1) Remove a string wa from W . let types(D. Its set of types is Σ′ := a∈Σ types(D. Σ′ . the size of the RE(#. s. µ0 ) which is the closure of D under the singletype 0 property. but have types in ′ 2Σ rather than in Σ′ . µ) be an EDTD. The RE(#. Let s = c1 . a). Since Σ′ ⊆ 2Σ . &) expressions that we constructed in D0 are unions of a linear number of RE(#. there exists a tree t′ ∈ L(d) such that ai is a label in t′ . Σ′ . To this end. Hence. for a string w ∈ Σ∗ and a ∈ Σ let types(wa) be the set of all types ai ∈ Σ′ . Indeed. D is equivalent to an EDTDst if and only if L(D0 ) ⊆ L(D). for each type ai ∈ Σ′ .o. Since we add every set only once to R. Simpliﬁcation 149 EDTDst . &) expressions in D. the righthand side of the rule for each types(wa) is the disjunction of all d(ai ) for ai ∈ types(wa).8. Initially. Moreover. we enumerate all sets types(w).3. and that R = Σ′ . then add it to R and add wab to W . with w ∈ Σ∗ . we have that Types(w) = types(w) for every w.l.
sd ) which is equivalent to D if and only if L(D) is deﬁnable by a DTD.i be the expression obtained from d(ai ) by replacing each symbol bj in d(ai ) by b. d0 . µ). where c1 is an element of C1 . A direct application of the expspace algorithm in Theorem 94(1) leads to a 2expspace algorithm to test whether L(D0 ) ⊆ L(D). s. C2 ). τ ))}. By Theorem 94(1) and since d0 is of size polynomial in the size of D. However. µ) is equivalent to a DTD. we can use this adaption to test whether L(D0 ) ⊆ L(D) in expspace. s0 . let for each ai ∈ Σ′ . Indeed. the algorithm remembers. we give the algorithm which decides whether an EDTD D = (Σ. It then accepts if there exists such a pair (C1 . s. Thereto. it is shown in [77] that L(D) = L(d0 ) if and only if L(D) is deﬁnable by a DTD. Σ′ .150 Optimizing XML Schema Languages most quadratic in the size of D. when we use nondeterminism. given the EDTDs D0 = (Σ. ra. rather than the entire set. due to the computation of C1 . notice that it is not necessary to compute the entire set C1 . d.i . d0 . Again. Since nexpspace = expspace. d. C2 ) such that 0 there exists a tree t with C1 = {τ ∈ Σ′  t ∈ L((D0 . as we only test whether there exist elements in C1 in the entire course of the algorithm. Σ′ . Σ′ . We compute a DTD (Σ. . Finally.1 in [77] that (c) and (d) also hold. It remains to argue that it can be decided in expspace that L(D0 ) ⊆ L(D). this can be tested in expspace. we can adapt the algorithm to compute pairs (c1 . For every a ∈ Σ. Finally. Indeed. τ ))} and C2 = {τ ∈ 0 Σ′  t ∈ L((D. deﬁne d0 (a) = ai ∈Σ′ ra. C2 ) with s0 ∈ C1 and s ∈ C2 . all possible pairs (C1 . we note that it has been shown in Theorem 7. µ0 ) and D = (Σ.
Sadly this notion of unambiguous parsing is a semantic rather than a syntactic one. Unfortunately. the Unique Particle Attribution (UPA) constraint. That is. the XML Schema speciﬁcation mentions the following deﬁnition of UPA: A content model must be formed such that during validation of an element information item sequence. is a highly nontransparent one.f. [103]).9 Handling NonDeterministic Regular Expressions At several places already in this thesis we have discussed the requirement in DTD and XML Schema of regular expressions to be deterministic. as determinism is called in XML Schema. indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item. the UPA constraint is usually explained in terms of a simple example rather than by means of a clear syntactical deﬁnition. The latter is not surprising as to date there is no known easy syntax for deterministic regular expressions. a predecessor of XML. there are no simple rules a user can . Speciﬁcally. The sole motivation for this restriction is backward compatibility with SGML. making it diﬃcult for designers to interpret. where it was introduced for the reason of fast unambiguous parsing of content models (without lookahead) [103]. and without any information about the items in the remainder of the sequence. the particle component contained directly. [italics mine] In most books (c.
they provide an algorithm. Indeed. bkwdec runs in time quadratic in the size of the minimal deterministic ﬁnite automaton and therefore in time exponential in the size of the regular expression. it is very diﬃcult for nonexpert1 users to grasp the source of the error and almost impossible to rewrite the content model into an admissible one. that we call bkwdec . one or several content models are rejected by the schema checker on account of being nondeterministic. Br¨ggemannKlein and Wood [17] provide u an algorithm. are entirely diﬀerent matters. our algorithm can serve as a plugin for any model management tool which supports export to XML Schema format [6]. the corresponding minimal DFA is quite small and far from the theoretical worstcase exponential size increase.152 Handling NonDeterministic Regular Expressions apply to deﬁne only (and all) deterministic regular expressions. Deterministic regular expressions were investigated in a seminal paper by Br¨ggemannKlein and Wood [17]. Deciding existence of an equivalent deterministic regular expression or effectively constructing one. In this chapter. for many expressions. As this number is very small for the far majority of realworld regular expressions [9]. while the decision problem is in exptime. applying bkwdec in practice should never be a problem. It turns out that. The purpose of the present chapter is to investigate methods for transforming nondeterministic expressions into concise and readable deterministic ones deﬁning either the same language or constituting good approximations. We tested bkwdec on a large and diverse set of regular expressions and observed that it runs very fast (under 200ms for expressions with 50 alphabet symbols). the task of designing an XSD is relieved from the burden of the UPA restriction and the user can focus on designing an accurate schema. when after the schema design process. In addition. We prove in this chapter that the problem is hard for pspace thereby eliminating much of the hope for a theoretically tractable algorithm. The ﬁrst exponential size increase stems from creating the minimal deterministic automaton Ar equivalent to the 1 In formal language theory. which constructs deterministic regular expressions whose size can be double exponential in the size of the original expression. we measure the size of an expression as the total number of occurrences of alphabet symbols. So. In addition. They show that deciding whether a given u regular expression is deterministic can be done in quadratic time. which we will call bkw. We propose the algorithm supac (Supportive UPA Checker) which can be incorporated in a responsive XSD tester which in addition to rejecting XSDs violating UPA also suggests plausible alternatives. Consequently. to decide whether a regular language can be represented by a deterministic regular expression. we observe that bkwdec is ﬁxedparameter tractable in the maximal number of occurrences of the same alphabet symbol. In addition. .
Next. In our experimental study. called shrink. as detailed in the experiments section. until a regular language is obtained with a corresponding concise deterministic regular expression. In this respect. Using this measure we can compare the quality of diﬀerent approximations. We deﬁne an optimized version of bkw. The third algorithm. The second one stems from translating the automaton into an expression. every deterministic regular expression r can be translated in a Glushkov automaton with as many states as there are alphabet symbols in r. grow tries to extend Ar such that it becomes Glushkov. we show in which situation which of the algorithms works best. its success rate is inversely proportional to the size of the input expression (we refer to Section 9. To overcome this. we focus on the case when no equivalent deterministic regular expression can be constructed for a given nondeterministic regular expression and an adequate superapproximation is needed. That is. thereby generalizing the language. Our experiments show that when grow succeeds in ﬁnding a small equivalent deterministic expression its size is always roughly that of the input expression. then there is a deterministic expression s′ with L(r) L(s′ ) L(s). we again make use of grow. Although it is unclear whether this double exponential size increase can be avoided. we use the existing algorithm rewrite [9]. Unfortunately. For instance. Nevertheless. we propose the algorithm grow. it is greatly superior to bkw and bkwopt. merges states. which optimizes the second step in the algorithm and can produce exponentially smaller expressions than bkw. The idea underlying this algorithm is that small deterministic regular expressions correspond to small Glushkov automata [17]: indeed. respectively. We consider three algorithms. whenever L(r) L(s). Therefore.4. examples are known for which a single exponential blowup is necessary [17]. The second algorithm operates like the ﬁrst one but utilizes grow rather than bkw to generate the corresponding deterministic regular expression. For the latter. The ﬁrst one is an algorithm of Ahonen [3] which essentially repairs bkw by adding edges to the minimal DFA whenever it gets stuck. We start with a fairly negative result: we show that there is no smallest superapproximation of a regular expression r within the class of deterministic regular expressions. bkw and bkwopt generate equivalent deterministic expressions of average size 1577 and 394. We therefore measure the proximity between r and s relative to the strings up to a ﬁxed length. To translate the Glushkov automaton into an equivalent regular expression. and s is deterministic. for input expressions of size 15. when the minimal automaton Ar is not Glushkov.2 for details). and of the algorithm koatokore of [8] which transforms automata to concise regular expressions. called bkwopt. .153 given nondeterministic regular expression r. the obtained expressions can still be very large.
Although XML is accepted as the de facto standard for data exchange on the Internet and XML Schema is widely used. 9. In the context of streaming the notion of determinism and kdeterminism was used in [68] and [20]. and Schwentick [75]. no such (eﬃcient) algorithms exist for deterministic regular expressions. Related Work. Deciding determinism of expressions containing numerical occurrences was studied by Kilpel¨inen and Tuhkanen [66]. The complexity of syntactic suba classes of the deterministic regular expressions with counting has also been considered [43. we discuss the construction of equivalent and superapproximations of regular expressions. fairly little attention has been devoted to the study of deterministic regular expressions. 11. for which minimization is in np. Outline. for which minimization is pspacecomplete. we propose the algorithm supac (supportive UPA checker) for handling nondeterministic regular expressions. We present an experimental validation of our algorithms in Section 9.5. and the algorithm koatokore [8. based on the experimental assessment. Computational and structural properties were addressed by Martens. 44]. respectively. Bex et al.154 Handling NonDeterministic Regular Expressions Finally. Neven. . supac makes use of several of the aforementioned algorithms and can be incorporated in a responsive XSD checker to automatically deal with the UPA constraint. investigated algorithms for the inference of regular expressions from a sample of strings in the context of DTDs and XML Schemas [8.2 and 9. We outline a supportive UPA checker in Section 9. we discuss the complexity of deciding determinism of the underlying regular language. 9] which operates as rewrite with the diﬀerence that it always returns a concise expression at the expense of generalizing the language when no equivalent concise expression exists. In Section 9. The outline of the chapter is as follows.3. From this investigation resulted two algorithms: rewrite [9] which transforms automata with n states to equivalent expressions with n alphabet symbols and fails when no such expression exists. To the best of our knowledge. The presented work would clearly beneﬁt from algorithms for regular expression minimization. In Section 9. They show that testing nonemptiness of an arbitrary number of intersections of deterministic regular expressions is pspacecomplete. Only a sound and complete rewriting system is available for general regular expressions [95]. We already mentioned the seminal paper by BruggemannKlein and Wood [17].1. 12].4.
if there does exist a valid corridor tiling for T . L(r) = Σ∗ \ {w1 #w2  w1 encodes a valid tiling for T and w2 ∈ Σ∗ Σa. there exists exactly one. However. Proof. Before giving the actual deﬁnition of r. that is. let Σb denote Σ \ {b}. Then. the algorithm bkwdec decides in time quadratic in the size of the minimal DFA corresponding to r whether L(r) is deterministic. We encode corridor tilings by a string in which the diﬀerent rows are separated by the symbol $. R1 is the bottom row and Rn is the top row. First.1. Deciding Determinism 155 9.l. Conversely. We show that. we restrict ourselves to those tiling instances for which there exists at most one correct corridor tiling. it follows that the number of correct tilings of the constructed tiling system is precisely the number of accepting runs of the Turing Machine on its input word. We reduce from the corridor tiling problem.1 Deciding Determinism The ﬁrst step in creating a responsive UPA checker is testing whether L(r) is deterministic. A DFA for L(r) . introduced in Section 5. that the input instance of corridor tiling has at most one correct corridor tiling. $. Then.g. Moreover. Given a regular expression r. let T be a tiling instance for which there exists at most one correct tiling. then by our assumption. such that L(r) is deterministic if and only if there does not exist a corridor tiling for T . Br¨ggemannKlein and Wood obtained an exptime algorithm u (in the size of the regular expression) which we will refer to as bkwdec : Theorem 98 ([17]). we can assume w. unless pspace = ptime. we give the language it will deﬁne and show this is indeed deterministic if and only if corridor tiling for T is false. We construct a regular expression r. Given a regular expression r. if there does not exist a valid tiling for T .1. by strings of the form $R1 $R2 $ · · · $Rm $ in which each Ri represents a row and is therefore in X n .9. let Σ = X ⊎ {a. Now. there is no hope for a theoretically tractable algorithm. As the acceptance problem for polynomial space bounded Turing Machines is already pspacecomplete for deterministic machines. then L(r) = Σ∗ # and thus L(r) is deterministic.o. Theorem 99. the problem of deciding whether L(r) is deterministic is pspacehard. Notice that we can assume this without loss of generality: From the master reduction from Turing Machine acceptance to corridor tiling in [104].# Σ# }. #} and for a symbol b ∈ Σ.
where it sees that the gates in the orbit consisting of the three rightmost states in Figure 9. this minimal DFA does not satisfy the orbit property.n+1] Notice that r itself does not have to be deterministic. for every x1 . • Σ∗ x1 Σn x2 Σ∗ #Σ∗ . • Σ∗ $Σ$ $Σ∗ #Σ∗ + Σ∗ $Σ$ Σ∗ $Σ∗ #Σ∗ : This expression detects $ all string in which a row in the tiling encoding is too short or too long. x2 ) ∈ V : These expressions / detect all violations of vertical constraints in the tiling encoding. x2 ∈ X. x2 ∈ X. Then. Algorithm 4 gets immediately stuck in line 17. • Σi+1 Σbi Σ∗ #Σ∗ for every 1 ≤ i < n: These expressions detect all tilings which do not have b as the bottom row in the tiling encoding. x2 ) ∈ H: These expressions / detect all violations of horizontal constraints in the tiling encoding. it is easily veriﬁed that L(r) is deﬁned correctly. (x1 . Notice that this DFA is the minimal DFA if and only if w1 exists. Our regular expression r now consists of the disjunction of the following regular expressions:2 • Σ∗ #Σ∗ #Σ∗ + Σ∗ : This expression detects strings that do not have ex# actly one occurrence of #. • Σ$ Σ∗ + Σ∗ Σ$ #Σ∗ : This expression detects strings that do not have a $sign as the ﬁrst or last element of their encoding. (x1 . where an additional parameter k is extracted from the input r. It is unclear whether the problem itself is in pspace. it is easily seen that L(r) is not deterministic. Next. • Σ∗ #Σ?: This expression detects strings that have # as last or second to last symbol. Hence. By applying the algorithm of Br¨ggemannu Klein and Wood (Algorithm 4). • Σ∗ aΣ∗ #Σ∗ : This expression detects strings that have an a before the #sign. Indeed.156 Handling NonDeterministic Regular Expressions is graphically illustrated in Figure 9. we say that a problem is ﬁxed parameter tractable if there exists a computable function f and a polynomial g such that the problem can be solved in 2 [0.n−1] [n+1. • Σ∗ Σ∗ aΣ: This expression detects strings that have a as second to last # symbol. . • Σ∗ x1 x2 Σ∗ #Σ∗ . Simply guessing a deterministic regular expression s and testing equivalence with r does not work as the size of s can be exponential in the size of r (see also Theorem 102). we address the problem from the viewpoint of parameterized complexity [34].1 are not all ﬁnal states. • Σ∗ Σti Σn−i #Σ∗ for every 1 ≤ i < n: These expressions detect all tilings which do not have t as the top row in the tiling encoding.1. Finally. for every x1 .
For example. By Theorem 98. For any n ∈ N. Let r be a kORE. 2. Corollary 101. It has already often been observed that the vast majority of regular expressions occurring in practice are kOREs. [17]). the result follows. the problem of deciding whether the language deﬁned by a kORE is deterministic is in ptime.9. The problem of deciding whether L(r) is deterministic is ﬁxed parameter tractable with parameter k. Constructing Deterministic Expressions 157 a start Σ$ Σ w1 Σ# # # # # Σa. This result is not only of theoretical interest. followed by a subset construction. it is easy to see that the resulting DFA has at most 2k · Σ states. . Unfortunately.g. [77]). the problem is eﬃciently solvable. its 2 complexity is O(2k r2 ).. there exists a regular expression rn of size O(n) such that any deterministic regular expression deﬁning L(rn ) is of size at least 2n .g.. ab(a∗ + c) is a 2ORE because a occurs twice. For any ﬁxed k. time f (k) · g(r). e.2.# a Σ# Σa. this implies that. Let r be a kORE. if k is small and f is reasonable. Speciﬁcally. we focus on constructing equivalent deterministic regular expressions. this result implies that in practice the problem can be solved in polynomial time. 3 (see.1: DFA for L(r) in the proof of Theorem 99. Proposition 100. Intuitively.2 Constructing Deterministic Expressions Next. 9.# Figure 9. By applying a Glushkov construction (see. Hence. Proof. for k = 1. the following result by Br¨ggemannKlein and Wood already u rules out a truly eﬃcient conversion algorithm: Theorem 102 ([17]). e. We now say that an expression r is a koccurrence regular expression (kORE) if every alphabet symbol occurs at most k times in r.
we always denote its set of states by QA . The minimization algorithm eliminates some of these states. initial state by qA .2 Enumerating Automata The grow algorithm described above uses an enumeration algorithm as a subroutine. Therefore. given a minimal DFA M and a size k. In this section. complicating the inverse Glushkovrewriting from DFA to deterministic regular expression. The algorithm rewrite of [9] succeeds when the modiﬁed automaton can be obtained from a deterministic regular expression by the Glushkov construction and assembles this expression upon success. Require: P . contains no useless transitions. transitions by δA . Ensure: Det. and set of ﬁnal states by FA . By expanding the minimal automaton. Throughout this section. if successful for i = 0 to E do 2: Generate at most P nonisomorphic DFAs B s. The idea underlying this algorithm is that the Glushkov construction [17] transforms small deterministic regular expressions to small deterministic automata with as many states as there are alphabet symbols in the expression. q0 .t. E ∈ N.1 Growing automata We ﬁrst present grow as Algorithm 2. eﬃciently enumerates all nonisomorphic DFAs equivalent to M of size at most k.2. we . reg.. grow tries to recuperate the eliminated states. the number of generated nonisomorphic DFAs can explode quickly. However. Nonwithstanding the harsh brute force ﬂavor of grow. which is designed to produce concise deterministic expressions. As their are many DFAs equivalent to the given automaton A.158 Handling NonDeterministic Regular Expressions Algorithm 2 Algorithm grow. Σ. we assume that any DFA is trimmed. δ.2. F ) with L(A) deterministic. we ﬁrst provide some deﬁnitions. minimal DFA A = (Q. i. Thereto. with pool size P and expansion size E.e. we show in our experimental study that the algorithm can be quite eﬀective. s with L(s) = L(A). 9. For a DFA A. exp. we only enumerate nonisomorphic expansions of A up to a given number of extra states E. the algorithm is also given a given pool size P which restricts the number of DFAs of each size which are generated. we describe this algorithm which. L(B) = L(A) and B has Q + i states 4: for each such B do if rewrite (B) succeeds then 6: return rewrite (B) if i > E then fail 9. Finally.
deﬁnes f (q) for all q to which qA has an outgoing transition.2. then L(M q ) = L(M p ). p ∈ QA and a ∈ Σ. • For all q. we deﬁne a function bf : QA → [1. Let A and B be DFAs. Hence. = Now. the goal of this section reduces to give an algorithm which computes the set S. If A is a DFA. S contains exactly one isomorphic copy of any DFA A equivalent to M of size at most k. . if q = p. denoted A ∼ B. Then.9. we ﬁrst need some insight in the connection between a minimal DFA and its larger equivalent versions. Then. lexicographic ordering). because qA has at most one outgoing transition for each alphabet symbol this. A is canonical}. Indeed. recall that Aq denotes the DFA A in which q is the initial state. A is canonical if and only if the names of its states correspond to their number in the breadthﬁrst traversal of A. Then. = if there exists a bijection f : QA → QB . we ﬁx an order on the alphabet symbols (say. such that (1) f (qA ) = qB . then Q = {0. if Q = n. starting at qA . by the above lemma. A = B if and only if A ∼ B. Notice that. Let A. To do so eﬃciently. = 2. and q ∈ QA . B be canonical DFAs. although there it is presented in terms of a string representation. a. We then also say that f witnesses A ∼ B. That is. there exists a canonical DFA B such that A ∼ B. there is at most one such bijection which. Lemma 104. and k an integer and deﬁne S = {A  L(A) = L(M ). Then. A and B are isomorphic. and (3) for all q. . QA  ≤ k. n − 1}. The following lemma then states a few basic facts about minimal DFAs (see. . That is. p) ∈ QA if and only if (f (q). for all i ∈ QA (recall that the state labels are always an initial part of the natural numbers). can easily be constructed. .g. Constructing Deterministic Expressions 159 assume that the set of states Q of any DFA consists of an initial segment of the integers. we must set f (qA ) = qB . f (p)) ∈ QB . because we are working = with DFAs. we have q ∈ FA if and only if f (q) ∈ FB . Lemma 103. 1. This ensures that bf is unambiguously deﬁned. let M be a minimal DFA. The following simple lemma is implicit in their paper. and M its equivalent minimal DFA. and require that in the breadthﬁrst traversal of the DFA transitions are followed in the order of the symbol by which they are labeled. it holds that (q. . moreover. QA ] which assigns to every state q ∈ QA its number in a breadthﬁrst traversal of A. a. By continuing recursively in this manner. Then. [55]). p ∈ QM . Here. in turn. Let A be a DFA. We then say that A is canonical if bf(i) = i.. given M and k. one obtains the unique bijection witnessing A ∼ B = Further. Let A be a DFA. e. (2) for all q ∈ QA . The latter notion of canonical DFAs essentially comes from [4].
for any p′ ∈ QM . and. The reason that A is canonical is because the enumeration explicitly follows the breadthﬁrst traversal of the automaton it is generating. enumerate follows all possible breadthﬁrst paths generating such DFAs. p′ ) ∈ δM . which quarantees that never more than k states will be generated. Let us ﬁrst show that enumerate only produces automata in this set S. again to copies of the appropriate target states in M . when L(Aq ) = L(M p ). the set S satisﬁes the criteria of the theorem. Given this information.160 Handling NonDeterministic Regular Expressions • For all q ∈ QA . there is at least one copy in A. furthermore. there exists a (unique) p ∈ QM . A is canonical}. Theorem 105. we maintain the corresponding state minimal(q) ∈ QM and give q outgoing transitions to copies of the states where minimal(q) also has transitions to. . Finally. We also say that q is a copy of p. and hence also generates A. It can easily be modiﬁed. then there exists a q ′ ∈ QA . Using this knowledge. this function is welldeﬁned. any q ∈ QA is a copy of minimal(q) ∈ QM . Given a minimal DFA M . According to the above lemma. if (p. QA  < k. and is canonical. This guarantees that for all q ∈ QA . in particular L(A) = L(Aqa ) = L(M qM ) = L(M ). and hence the desired set. there is a B ′ ∈ S such that B ∼ B ′ . using recursion. the set S of all automata returned by enumerate. and k ≥ QM . we must show that any automaton in S is generated by enumerate. Algorithm 3 enumerates the desired automata. for any DFA B of size at most k and equivalent to M . such that minimal(q ′ ) = p′ . enumerate returns the set S = {A  QA  < k. although there can be more. for each state p ∈ QM . and a ∈ Σ. Hence. to an algorithm outputting all these automata. we have L(Aq ) = L(M minimal(q) ). i. any canonical A equivalent to M with at most k states is generated by enumerate. Further. any DFA A generated by enumerate has QA  < k. Further. a. It is equivalent to M because for any state q ∈ QA . consists of pairwise nonisomorphic DFAs of size at most k. Due to Lemma 103. Conversely.e. is equivalent to M . = Proof. a. q ′ ) ∈ QA .e. For ease of exposition. and (q. L(A) = L(M ). Let A be such a DFA. such that L(Aq ) = L(M p ). i. q’s outgoing transitions are copies of those of minimal(q). and. Let minimal : QA → QM be the function which takes the state q ∈ QA to p ∈ QM . due to the test on Line 12. By Lemma 104 we know it consists of a number of copies of each state of M which are linked to each other in a manner consistent with M . it is presented as a nondeterministic algorithm in which each trace only outputs one DFA. That is. and k ≥ QM . We ﬁrst argue that given M .
9.2. Constructing Deterministic Expressions
161
Algorithm 3 Algorithm enumerate. Require: Minimal DFA M = (QM , δM , qM , FM ), and k ≥ QM  Ensure: Canonical DFA A, with L(A) = L(M ) and QA  ≤ k. QA , δA , FA ← ∅ 2: Queue P ← ∅ qA ← 0 4: Push qA onto P QA ← QA ⊎ {qA } 6: Set minimal(qA ) = qM while P = ∅ do 8: q ← P .Pop p ← minimal(q) for a ∈ Σ do 10: if ∃p′ ∈ QM : (p, a, p′ ) ∈ δM then 12: Choose q ′ nondeterministically such that either q ′ = QA  (if QA  < k), or such that q ′ < QA  and minimal(q ′ ) = p′ . If no such q ′ exists: fail. 14: if q ′ = QA  then Push q ′ onto P 16: Set minimal(q ′ ) = p′ QA ← QA ⊎ q ′ 18: δA ← δA ⊎ (q, a, q ′ ) if p′ ∈ FM then 20: FA ← FA ∪ q ′ return A We conclude by noting that, by using a careful implementation, the algorithm actually runs in time linear in the total size of the output. Thereto, the algorithm should never fail, but this can be accomplished by maintaining some additional information (concerning the number of states which must still be generated) which allows to avoid nondeterministic choices which will lead to failure.
9.2.3
Optimizing the BKWAlgorithm
Next, we discuss Br¨ggemannKlein and Woods bkw algorithm and then u present a few optimizations to generate smaller expressions. First,