Applications of Brzozowski derivatives to XML Schema processing

C. M. Sperberg-McQueen

Abstract

Given a regular language L, the Brzozowski derivative of L with respect to some string s is a regular expression which defines what strings can follow s in strings appearing in L. The calculation of Brzozowski derivatives is a convenient and beautiful tool for the algebraic manipulation of content models. It can provide convenient solutions to several problems arising in developing, manipulating, and validating content models in W3C XML Schema: validation, checking the unique-particle-attribution constraint, and testing whether the language accepted by one content model is a subset of that accepted by another.

Keywords: Modeling; Schema Languages; Processing

C. M. Sperberg-McQueen

C.M. Sperberg-McQueen is a member of the technical staff at the World Wide Web Consortium; he serves on the W3C XML Schema Working Group and chairs the XML Coordination Group.

Applications of Brzozowski derivatives to XML Schema processing

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and Artificial Intelligence Laboratory]

Extreme Markup Languages 2005® (Montréal, Québec)

Copyright © 2005 C. M. Sperberg-McQueen. Reproduced with permission.

One of the main advances of SGML and XML over earlier methods of document representation is the idea that user-defined validity conditions for a document type could be formulated explicitly, and that document validity could be assessed automatically. Strategies for understanding, analysing, and implementing those constraints are thus a core problem for the use of markup-language technologies. Many discussions of these topics limit themselves to a brief discussion of content-model notation and its meaning; deeper discussions may discuss the equivalence between content models and finite-state automata, possibly showing how to translate content models into automata. Vocabulary designers, users, and implementors would all benefit from more serious attention to other techniques of working with content models, notably including the notion of the Brzozowski derivative.

This paper describes some applications of Brzozowski derivatives to problems which arise in connection with schemas for XML vocabularies. It begins with a brief introduction to Brzozowski derivatives and a description of how to extend them to cover content models in XML Schema 1.0, and then shows how to use derivatives to validate document instances against content models, check the Unique Particle Attribution constraint, handle a weakened form of that constraint, check whether one content model accepts a subset of the language accepted by another, and define either of two competing meanings for extensions of the XML Schema 1.0 all-group.

Software developers can use Brzozowski derivatives to reduce the cost of operations in certain circumstances, or to develop a simple well-understood reference implementation against which to check the behavior of faster but more complex code. Vocabulary developers can use Brzozowski derivatives to analyse and understand different ways of formulating particular content models; in particular, it may be very useful in vocabulary design to be able to check mechanically whether one content model accepts only a subset of what another content model accepts.

Introduction

Brzozowski derivatives (in what follows they will often be called just “derivatives”) were introduced by Janusz Brzozowski in 1964 [Brzozowski 1964].

Brzozowski defines derivatives thus:

Given a set R of sequences and some finite sequence s, the derivative of R with respect to s is denoted by DsR and is DsR = {t | stR}.

In words, the derivative of R with respect to s is the set of strings t which can follow s in sentences of R, or: the set of strings t such that the concatenation of s and t is a sentence in R.

Regular sets of strings can, of course, be denoted by regular expressions, and Brzozowski's contribution was to show how, given (1) a regular expression E denoting the language R and (2) a string s, to calculate a regular expression D denoting the derivative of R with respect to s. He also proved (3) that of all the derivatives of an expression, only a finite number would be distinct from each other in terms of recognizing different languages, and (4) that even if equal expressions are not always detected, there will still be only a finite number of dissimilar derivatives, if certain simple tests of similarity are performed; he then showed (5) how to construct a finite-state automaton from the set of characteristic derivatives thus identified.

Brzozowski derivatives have applications to schemas and can simplify discussions (and implementations) of validation, unique particle attribution, and similar problems. They can also be used to define ‘weakened wildcards’ (see below).

Related work

In their work on content models, Anne Brüggemann-Klein and Derick Wood have regularly pointed out the relations between regular expressions, finite-state automata (particularly Glushkov automata), and Brzozowski derivatives [Brüggemann-Klein 1993a], [Brüggemann-Klein 1993b], [Brüggemann-Klein/Wood 1998]. An even earlier mention (with, however, no references to Brzozowski or to any other literature) appears in an unattributed white paper prepared at Software Exoterica [Wilmott 1991], where derivatives are treated as a tool for algebraic manipulations of content models and proofs of content-model equivalence.

The earliest suggestion, however, that Brzozowski derivatives ought to be applied in validation software appears to be due to Joe English [English 1999], who provides a terse summary and urges their obvious utility for handling complex content models. Surprisingly, Joe English's own Haskell-based XML parser, HXML, appears not to use Brzozowski derivatives; not being a validating parser, it has no occasion to use them. But the suggestion was taken up by a number of others: Martin Schmidt adapts English's code fragment and applies Brzozowski derivatives in a validating XML parser written as a Master's thesis project and forming part of the Haskell XML Toolbox [Schmidt 2002]. More prominently, James Clark uses derivatives in his RelaxNG validator, Jing [Clark 2002]. According to at least one report ([Foster 2003]), Brzozowski derivatives are also used in Sun's MSV validator [Kawaguchi 2003]. Bob Foster has described (mostly in blog articles) a number of applications of Brzozowski derivatives, including validation of expressions with integer-range exponents on sub-expressions [Foster 2004].

Both English and Clark credit their acquaintance with the idea to Mark Hopkins [Hopkins 1994], who uses Brzozowski derivatives in a set of regular-expression tools posted to Usenet in 1994, and stresses the advantage of not having to build large finite state automata for short inputs.

Brzozowski derivatives have an obvious importance to those concerned with Relax NG: they help reduce the cost of Relax NG's non-determinism and co-occurrence constraints. But they also have applications to other schema languages, and deserve to be better understood among those working with them, including especially those working with XML Schema (hereinafter XSD).1

A simple extension of Brzozowski derivatives

This section explains the extension of Brzozowski derivatives to XSD content models. No knowledge of Brzozowski's original formulation is required.

Notation

Brzozowski worked with a regular-expression notation which differs interestingly both from standard regular expressions and from XSD content model notation. So to apply the notion of derivatives to XSD content models, we need to extend Brzozowski's rules. This is easily done.

For concision, I will use the following notation as a shorthand instead of writing content models in the transfer syntax defined by the XML Schema specification [W3C 2001b]. A content model expression is one of the following:

  • ε (in the XML Schema transfer syntax, this can be written <xsd:sequence/>)
  • ∅ (in the transfer syntax: <xsd:choice/>)
  • an expanded name, in the sense of the XML Namespaces specification [W3C 2004] (in the transfer syntax, <xsd:element ref = "QName"/> or <xsd:element name = "NCName"> ... </xsd:element>). For convenience, we will allow ourselves to write expanded names as qualified names (QNames) whenever the bindings of namespace prefixes are clear. The symbol e will sometimes be used as a dummy name.
  • wc(KW, NS), where KW is in {strict, lax, skip} and NS is either the keyword ##any or a list of namespace names or ε (in the transfer syntax, <xsd:any namespace = "NS" processContents = "KW"/>); I'll use ε to denote the absence of a namespace name; since the empty string can never be used as a namespace name, this leads to no ambiguities; the transfer syntax uses the keyword ##local.
  • wc(KW, not(NS)), with KW as above and NS either a namespace name or ε (in the transfer syntax, <xsd:any namespace = "##other" processContents = "KW"/>).
  • F | G, where F and G are each content-model expressions (in the transfer syntax, <xsd:choice> F G </xsd:choice>).
  • F, G, where F and G are each content-model expressions (in the transfer syntax, <xsd:sequence> F G </xsd:sequence>).
  • F{n, m}, where F is a content-model expression (in the content model, this is minOccurs = "n" maxOccurs = "m" on the element representing F).

These expressions denote regular sets of sequences of XML elements as follows:

  • ε denotes the set consisting solely of the empty sequence.2
  • ∅ denotes the empty set.
  • An expanded name e denotes an element (or, strictly speaking, a sequence of length one whose sole member is an element) whose generic identifier (written as a QName) maps to e.3
  • wc(KW, NS) (wc for ‘wildcard’) denotes a sequence whose sole member is an element in namespace NS.
  • wc(KW, not(NS)) denotes a sequence whose sole member is an element not in namespace NS.
  • F | G denotes the union of the set denoted by F and that denoted by G.
  • F, G denotes the set of sequences fg where fF and gG.
  • F{n, m} denotes the set of sequences formed by concatenating at least n and at most m occurrences of members of the set denoted by F, where F is a content-model expression, n is a non-negative integer, and m is either a positive integer or the symbol unbounded.

In examples, the notations familiar from SGML and XML content models will occasionally be used: for some expression E, E* ≡ E{0, unbounded}, E+ ≡ E{1, unbounded}, E? ≡ E{0, 1}.

Content models are either content-model expressions or “all”-expressions. An “all”-expression is:

  • F&G, where F and G are each either simple content-model expressions or are themselves “all”-expressions.
A simple content-model expression is ε, ∅, or F{n, 1}, where F is an expanded name and 0 ≤ n ≤ 1. (For convenience, we'll allow ourselves to write just e instead of e{1, 1}, when e is an expanded name.) An “all”-expression has the following denotation:
  • F&G, denotes the set of sequences formed by concatenating, in any order, exactly one member of each set denoted by the simple content-model expressions in F and G.

Some examples may help make things clear.

  • The content model expression “a, b, c{1, unbounded}” matches any of the following sequences of elements:
    • a b c
    • a b c c
    • a b c c c
    etc.
  • The content model expression “a, b, (c{1, unbounded} | (d){2, 4})” matches any of the sequences above, as well as any of:
    • a b d d
    • a b d d d
    • a b d d d d
  • The content model expression “a & b{0, 1} & c” matches any of the sequences
    • a c
    • c a
    • a b c
    • a c b
    • b a c
    • b c a
    • c a b
    • c b a
  • The content model expression “ε” matches the empty sequence.
  • The content model expression “” matches no sequences at all.

Note that while in XSD all of the operators | and , and & are n-ary (i.e. they take an arbitrary number of arguments), we can treat them without loss of generality as binary operators, all of which obey the associative law. That is, for any content model expressions E, F, and G, we have

(E | (F | G)) = ((E | F) | G) = (E | F | G)

and

(E, (F, G)) = ((E, F), G) = (E, F, G)

and

(E & (F & G)) = ((E & F) & G) = (E & F & G)

The last set of equations requires some discussion. They will seem foreign to those used to the SGML & operator, which is not associative, but natural to those accustomed to working with the interleave operator of Relax NG and some recent discussions of content models by computer scientists. In SGML, the expression (a & (b & c)) is not at all the same as ((a & b) & c): the former requires an a to appear at the beginning or end of the sequence, while the latter requires c to appear at the beginning or end, so that a c b is accepted by the first expression, but not the second. In RelaxNG, the two expressions are equivalent. In XSD, the issue does not arise, since XSD does not allow all-expressions to nest within all-expressions and so does not need to specify either the associative or the non-associative interpretation.

The & operators of SGML and RelaxNG also differ in another way: as suggested by the term interleave, RelaxNG allows sub-parts of the left- and right-hand expressions to be interleaved. In RelaxNG, for example, the expression (a, b) & (d, e) accepts the sequence a d b e, as well as any other in which both a occurs before b and c occurs before d, and nothing else occurs. In SGML, the expression accepts only the two sequences “a b d e” and “d e a b”. Here, too, XSD restricts the operator in such a way that the XSD operator can be interpreted either way: in XSD, all arguments of the & operator must be simple content-model expressions, so that the question of interleaving sub-parts of the matching strings does not arise.

Let it suffice for now to observe that the validation semantics assigned to all-expressions below have been formulated in such a way that the binary &-operator to which we translate XSD all-expressions does obey the associative law.

Nullability

It is useful to know, for an expression E, whether ε is in L(E) (i.e. whether the language accepted by E includes the empty string). Following Brüggeman-Klein [Brüggemann-Klein 1993a], I will say that if ε is in L(E), then E is nullable, and define the predicate ν(E) which is true if and only if E is nullable.

The definition of nullable for content-model expressions is:

  • ν(ε) = true
  • ν(∅) = false
  • ν(e) = false for any expanded name e
  • ν(wc(KW, NS)) = false for any value of KW and NS
  • ν(F | G) = ν(F) ∨ ν(G)
  • ν(F, G) = ν(F) ∧ ν(G)
  • ν(F{n, m}) = (n = 0) ∨ ν(F)
  • ν(F & G) = ν(F) ∧ ν(G)

Some algebraic manipulations will become simpler if we define a related function, which following Brzozowski I call δ. For all regular expressions E,

  • δ(E) = ε if and only if L(E) includes ε.
  • Otherwise, δ(E) = ∅
That is, if ν(E) then δ(E) = ε else δ(E) = ∅.

Construction of derivatives

First we define the derivative of an expression E with respect to a single expanded name x. This we specify to be identical by definition to the derivative with respect to a sequence consisting just of one element whose expanded name is x.

  • Dx ε = ∅
  • Dx ∅ = ∅
  • Dx e = ε if x = e
  • Dx e = ∅ if e is an expanded name, and (xe)
  • Dx wc(KW, NS) = ε if x is in namespace NS, else ∅
  • Dx wc(KW, not(NS)) = ∅ if x is in namespace NS, else ε
  • Dx (F | G) = (Dx F) | (Dx G)
  • Dx (F, G) = ((Dx F), G) | (Dx G) if F is nullable, else (Dx F), G4
  • Dx F{n,m} = (Dx F), F{--n, --m} if F is not nullable or if n = 0 and m = unbounded, else (Dx F), F{--n, --m} | (Dx F{--n, --m}). (See definition of -- operator below.)
  • Dx (F & G) = ((Dx F) & G) | (F & (Dx G))

For positive integers, the -- operator decrements, but it respects special cases: --k = k - 1 for integer k > 0, but --0 = 0, and --unbounded = unbounded.5

To calculate the derivative of E with respect to a sequence x1, x2, x3, ..., xn, we first take the derivative with respect to x1, then the derivative of that expression with respect to x2, and so on until we are at the end of the sequence. Algebraically,

Ds E = D<x1, x2, ... xn> E = Dxn (... (Dx3 (Dx2 (Dx1 E))) ...)

Simplification of expressions

The following identities are useful. For any expressions E, F, and G,

  • E | E = E
  • E | F = F | E
  • (E | F) | G = E | (F | G)
  • ε, E = E, ε = E
  • ε & E = E & ε = E
  • ∅ | E = E | ∅ = E
  • ∅, E = E, ∅ = ∅
  • ∅ & E = E & ∅ = ∅

In what follows, we'll (silently) make use of these identities to simplify expressions generated by taking derivatives.

Characteristic derivatives

Brzozowski proves that every regular expression is guaranteed to have a finite number of distinct (i.e. non-equivalent) derivatives. The set of distinct derivatives of E with respect to the strings in Σ* (for some input alphabet Σ) is called the characteristic derivatives of E.

Brzozowski also defines a test of similarity for regular expressions: expressions are similar if they can be transformed into each other using only the first three identities listed in section 3-4. Since in practice the detection of equivalence may be expensive, it is encouraging to note that Brzozowski also proves that every regular expression is guaranteed to have a finite number of dissimilar derivatives. The consequence is that even if we fail to detect all equivalences among the derivatives of an expression, we can still avoid having the number of characteristic derivatives grow without bound.

One complication should be addressed here: like all standard work on regular languages, Brzozowski assumes that the input alphabet Σ is finite. Content models, however, are regular expressions over an infinite set, namely XML element types (or element type names). The following discussion silently makes the usual adjustment: for content models without wildcards, it suffices to assume an alphabet including all the names used in the content model, plus one name which matches nothing in the content model. Since in the absence of wildcards all non-matching names behave the same way, the single non-matching name successfully represents the infinite set of possible inputs which match nothing in the content model. When wildcards are present, it suffices to include one otherwise unknown name in each namespace mentioned in any wildcard, plus one name in a namespace unmentioned in any wildcard. In this way we can work with finite input alphabets which correctly stand in for the infinite input alphabet provided by XML.

Applications to XML Schema processing

This section identifies some important tasks in XSD processing and shows how Brzozowski derivatives provide convenient and clear ways to formulate and perform the task.

Validation

We can use derivatives to validate without ever constructing anything recognizable as an automaton: for content model E and sequence s, E accepts s if and only if ν(Ds E).6

For strongly deterministic expressions, the time it takes to calculate the derivative of E with respect to a single input symbol is at most linear in the size of E, and the derivative is never larger than E.7 This means that overall, validation of a sequence will be at worst linear in the size of the sequence. (Note that formal proofs of these claims remain a task for future work.)

For example, let us consider the strongly deterministic content model E =

( (script | style | meta)*,
  ( (title, (script | style | meta)*, (base, (script | style | meta)*)?)
    |
    (base, (script | style | meta)*, (title, (script | style | meta)*))
  )
)
which is a slight simplification of the definition of head in XHTML. If we seek to validate the sequence <meta, title, style> against this content model, we proceed as follows:
  1. Find Dmeta E.
    This proves to be E again (which is intuitively plausible: any number of meta elements are allowed at the beginning of the sequence).8
  2. Now find Dtitle E.
    This proves to be ((script | style | meta)*, (base, (script | style | meta)*)?).9
  3. Now find Dstyle ((script | style | meta)*, (base, (script | style | meta)*)?).
    This is (base, (script | style | meta)*)?10
    Since ν((base, (script | style | meta)*)?) is true, the sequence is valid against the content model.

For non-strongly-deterministic expressions, the derivative may be larger than the original expression and will take the form of a disjunction. If the expression is weakly deterministic, then the number of disjuncts will correspond to the number of states the (non-strongly-deterministic) Glushkov automaton could be in at the given state in the parse.

For example, consider the content model E = (e{1,5}, b{0,2}){1,5}. A straightforward calculation of the derivatives shows how the second and later occurrences of “e” in an input string cause more and more growth in the size of the derivative:

E =
(e{1,5}, b{0,1}){1,5}
De E =
e{0,4}, b?, (e{1,5}, b?){0,4}
Dee E =
  e{0,3}, b?, (e{1,5}, b?){0,4} 
| e{0,4}, b?, (e{1,5}, b?){0,3}
Deee E =
  e{0,2}, b?, (e{1,5}, b?){0,4}
| e{0,4}, b?, (e{1,5}, b?){0,3}
| e{0,3}, b?, (e{1,5}, b?){0,3} 
| e{0,4}, b?, (e{1,5}, b?){0,2}
Deeeb E =
  (e{1,5}, b?){0,4} 
| (e{1,5}, b?){0,3} 
| (e{1,5}, b?){0,3} 
| (e{1,5}, b?){0,2}'
Deeee E =
  e?, b?, (e{1,5}, b?){0,4} 
| e{0,4}, b?, (e{1,5}, b?){0,3} 
| e{0,3}, b?, (e{1,5}, b?){0,3} 
| e{0,4}, b?, (e{1,5}, b?){0,2} 
| e{0,2}, b?, (e{1,5}, b?){0,3} 
| e{0,4}, b?, (e{1,5}, b?){0,2} 
| e{0,3}, b?, (e{1,5}, b?){0,2} 
| e{0,4}, b?, (e{1,5}, b?)?

Unique Particle Attribution

SGML DTDs, XML 1.0 DTDs, and XSD 1.0 all share a rule which restricts the forms of regular expressions legal in content models, in order to ensure that they do not require unbounded lookahead. ISO 8879 requires that content models be “unambiguous”. The sense assigned to ambiguity is somewhat broader than the sense usual in formal language study: the expression a?, a is ambiguous in the SGML sense, because when a parser is confronted with a sequence beginning with an a, lookahead is required to know whether it matches the first a in the expression, or the second. It is not ambiguous in the usual formal sense: for any word in the language, there is only one parse tree. In part to avoid the confusion caused by this unusual usage of the term ambiguity, the XML 1.0 spec requires instead that content models be deterministic. Determinism, however, may be defined in either of two ways, weak or strong. To avoid the confusion caused by the two flavors of determinism, XSD 1.0, in turn, imposes what it calls the Unique Particle Attribution constraint. All three of these names denote essentially the same requirement: that it be possible, without lookahead in the document instance, to know which token (or particle) in a content model is matched by a token in the document instance, and that there be at most one such token. This is what Brüggemann-Klein calls weak determinism, as opposed to strong determinism, which additionally requires that every repetition of a token be licensed by at most one occurrence indicator in the expression.11

Previous formulations of this constraint have either provided an algorithm translating expressions into automata, required that this translation be performed, and appealed to properties of the resulting automata (so Brüggemann-Klein and Wood in [Brüggemann-Klein 1993a] and [Brüggemann-Klein/Wood 1998]), or else they have refrained from requiring such a translation, but then found themselves unable to specify precisely what the constraint is, relying instead on nudges and winks and hopes that the reader understands what is meant (so ISO 8879 [ISO 1986] and the XML specification [W3C 2004]).12 Using Brzozowski derivatives, however, the UPA constraint of XSD 1.0 can be specified constructively and precisely, without reference to the corresponding automata.13 Two approaches offer themselves.

Testing initial determinism

Definition

The first idea is this: we define a property which holds of some content-model expressions: determinism with respect to initial x, where x is an input signal (for us: an expanded name). The UPA constraint can now be expressed thus:

  • The original content model E must be x-deterministic for all x.
  • For any sequence of input signals s, the derivative of E with respect to s must be x-deterministic for all x in the input alphabet.
Since taking the derivative of E with respect to every sequence s of input signals will eventually generate all of the characteristic derivatives of E, the second item can also be stated thus:
  • For all symbols x in the input alphabet, and for all characteristic derivatives F of the original content model, F must be deterministic with respect to initial x.

We can now define the function idet(x, E), which is true if and only if E is deterministic with respect to initial x.

  • idet(x, ε) = true
  • idet(x, ∅) = true
  • idet(x, e) = true
  • idet(x, wc(KW,NS)) = true for any value of KW and NS
  • idet(x, wc(KW,not(NS))) = true for any value of KW and NS
  • idet(x, (F | G)) = ((DxF = ∅ ∧ idet(x,G)) ∨ (DxG = ∅ ∧ idet(x,F)))
  • idet(x, (F, G)) = if ν(F), then ((DxF = ∅ ∧ idet(x, G)) ∨ (DxG = ∅ ∧ idet(x, F))), else (F not nullable) idet(x,F)
  • idet(x, (F & G)) = ((DxF = ∅ ∧ idet(x, G)) ∨ (DxG = ∅ ∧ idet(x, F)))
  • idet(x, F{n,m}) = idet(x, F)
It is useful to note that for all expressions E, if DxE = ∅, then idet(x,E) = true.

This approach was anticipated in part by Sam Wilmott [Wilmott 1991], but the treatment of repetition and of sequences differs there.14

Examples

A few simple examples may illustrate the method. First, let E be a{2,4}, a. If we test the first few derivatives of E for initial determinism with respect to a, we have:

  • DεE = (a{2,4}, a)
    • idet(a, (a{2,4},a)) = idet(a, a{2,4})15 = true.
  • DaE = (a{1,3}, a)
    • idet(a, (a{1,3},a)) = idet(a, a{1,3}) = true
  • DaE = (a{0,2}, a)
    • idet(a,a{0,2}, a) = ((Dxa{0,2} = ∅ ∧ idet(x, a)) ∨ (Dxa = ∅ ∧ idet(x, a{0,2}))) = ((false ∧ true) ∨ (false ∧ true)) = false
At this point it is unnecessary to consider other derivatives of E: since idet(DaE) is false, we know that E does not obey the UPA constraint. This is a simple instance of the general rule: if a choice between F and G would be non-deterministic, then the sequence (F{n,m}, G) is non-deterministic for n < m.

When n = m, however, the situation is different, as shown by a second example. Let E be a{2,2}, a. There are five characteristic derivatives of E (for any alphabet containing a and at least one other symbol, here represented by b). We have:

  • DεE = a{2,2}, a
    • idet(a, (a{2,2},a)) = idet(a, a{2,2}) = true.
    • idet(b, (a{2,2},a)) = idet(b, a{2,2}) = true.
  • DaE = a{1,1}, a
    • idet(a, (a{1,1},a)) = idet(a, a{1,1}) = true
    • idet(b, (a{1,1},a)) = idet(b, a{1,1}) = true
  • DaaE = a
    • idet(a, a) = true
    • idet(b, a) = true
  • DaaaE = ε
    • idet(a, ε) = true
    • idet(b, ε) = true
  • DbE = DabE = DaabE = ∅
    • idet(a, ∅) = true
    • idet(b, ∅) = true
Since idet(x,F) is true for all derivatives F of E and all symbols x in the alphabet, we can conclude that E obeys the UPA constraint.

As final example, let E = (a, a?){2,4}. We have:

  • DεE = (a, a?){2,4}
    • idet(a, (a{2,2},a)) = idet(a, a{2,2}) = true.
    • idet(b, (a{2,2},a)) = idet(b, a{2,2}) = true.
  • DaE = a?, (a, a?){1,3}
    • idet(a, a?, (a, a?){1,3}) = ((Daa? = ∅ ∧ idet(a, (a, a?){1,3})) ∨ (Da(a, a?){1,3} = ∅ ∧ idet(a, a?))) = ((false ∧ true) ∨ (false ∧ true)) = false
No further examination is necessary, to know that E is non-deterministic and thus violates the UPA constraint.

Reducing the cost of the calculation

In some cases, the work needed to test an expression for initial determinism can be reduced. This can be important for large maximum numbers of repetitions, especially when expressions with repetition operators nest. The expression (e{0,1000}){0,1000}, for example, has 1,000,002 (1000 × 1000 + 2) distinct derivatives.16

The key to reducing the cost of the calculation is noticing that for 1 < n < m and any expression E, the expression E{n, m} is deterministic for initial x if and only if E{--n, --m} is deterministic for initial x. This means that n can be reduced to 1, and m to 2, without changing the result of the determinism calculation (but saving some arbitrarily large number of derivatives in the process). When the minimum and maximum numbers of repetitions are the same, then they can be reduced to 2 but not to 1. That is, if n = m > 2, then E{n, m} is deterministic for initial x if and only if E{--n, --m} is deterministic for initial x.

More formally: every subexpression E of the form F{n,m} can be replaced by another expression E′, constructed so that E obeys the UPA constraint if and only if E′ does, but E′ has fewer characteristic derivatives. The value of E′ is:

  • If E = ε or ∅ or an expanded name or wc(KW,NS) or wc(KW,not(NS)), then E′ = E.
  • If E = F | G, then E′ = F′ | G′.
  • If E = F, G, then E′ = F′, G′.
  • If E = F{n, m}, then E′ = F′{j,k}, where
    • F′ is the recursive simplification of F.
    • If n = m > 1, then j = k = 2.
    • Else j = min(1,n), k = min(2,m), where for all natural numbers n, min(n, unbounded) = n

It's intuitively clear that for any expression E we can construct the E′ described. Informally, it's not hard to see that E is weakly deterministic if and only if E′ is: in expressions of the form F{n,m}, there are two possible sources of non-determinism other than non-determinism within F itself, each unaffected by the reduction of n to 1 or 2 and m to 2:

  • If n < m, then some derivative of E will have the form F{0,j} (with j = m - n). At that point, there may be weak non-determinism between the start set of F and the start set of whatever expressions can follow E. The magnitude of the difference between n and m is immaterial, so m can be reduced to n + 1 without affecting the determinism of the expression.
  • If m > 1, then F{n, m} is equivalent, for purposes of language recognition and determinism, to F, F{--n,--m} and there may be weak non-determinism between the start set of F and optional tokens at the end of F.17 It does not matter whether F can be repeated a hundred times or only repeated once: if it can be repeated at all, this non-determinism must be checked for, and m can be reduced from 100, or from unbounded, to 2 without affecting the determinism of the expression. At the same time, n must be reduced by a commensurate amount: if n = m, both are reduced to 2, while if n < m, n can be reduced to 1.

It would be desirable to have a more formal proof that the reduction in exponents (a) reduces the number of distinct derivatives and (b) preserves strong and weak determinism / non-determinism. Such a formal proof remains a task for future work.

Counting initial matches

A second approach is to define a function c which counts the number of tokens in the expression which can match an initial symbol x:

  • c(x, ε) = 0
  • c(x, ∅) = 0
  • c(x, e) = 1 if x = e, else 0
  • c(x, wc(KW,NS)) = 1 if x matches NS, 0 otherwise
  • c(x, wc(KW,not(NS))) = 0 if x matches NS, 1 otherwise
  • c(x, (F | G)) = c(x,F) + c(x,G)
  • c(x, (F, G)) = c(x,F) + c(x,G) if F is nullable, else c(x,F)
  • c(x, (F & G)) = c(x,F) + c(x,G)
  • c(x, F{n,m}) = c(x,F))

The XSD Unique Particle Attribute constraint, the XML 1.0 determinism rule, and the SGML ambiguity rule then all amount to this:

  • For all symbols x in the input alphabet, and for all characteristic derivatives F of the original content model, c(x, F) ≤ 1.

To illustrate, let us repeat the first example from the previous section: let E be a{2,4}, a. For the first few derivatives of E, we have:

  • DεE = (a{2,4}, a)
    • c(a, (a{2,4},a)) = c(a, a{2,4}) = 1.
  • DaE = (a{1,3}, a)
    • c(a, (a{1,3},a)) = c(a, a{1,3}) = 1
  • DaE = (a{0,2}, a)
    • c(a,a{0,2}, a) = c(x, a{0,2}) + c(x, a) = 1 + 1 = 2
As before, the derivative of E with respect to a decides the question: c(DaE) > 1, and so E violates the UPA constraint.

Weakened wildcards

In discussions of XSD and the Unique Particle Attribution constraint, various alternative validation behaviors have occasionally been proposed; one of the most popular ideas is ‘weakened wildcards’. This is a weakened form of the Unique Particle Attribution constraint, which allows content models in which both an extended name and a wildcard can match an input element, with the proviso that in cases of such competition, it is the extended name, not the wildcard, which is used to match the input element. Competition between element tokens or between wildcards is still prohibited.

Either of the two approaches to the UPA constraint described above can be extended to handle weakened wildcards.

The idet predicate can be specialized in the obvious way into two predicates, edet (which is true just in case there is at most one element particle in the content model which can match the input symbol) and wdet (true if there is at most one matching wildcard). The weakened form of the constraint is then:

  • For all symbols x in the input alphabet, and for all characteristic derivatives F of the original content model, edet(x,F) and wdet(x,F) must both be true.

The c function can similarly be specialized to make functions that count just element tokens (ecount) or just wildcard tokens (wcount) which can match an initial x. The weak wildcard rule then says that for every member F of the characteristic derivatives of E and every symbol x in the input alphabet, it must be the case both that ecount(x,F) ≤ 1 and that wcount(x,F) ≤ 1.

Checking subsumption of content models

This topic has already been treated, using different formalisms, by Henry Thompson and Matthew Fuchs [Thompson/Tobin 2003], [Fuchs/Brown 2003]. Brzozowski derivatives provide a different way of considering the problem, which seems in some respects more convenient and concise.

Section 4-4-1 below extends the notation described above in section 3-1, adding (as Brzozowski does) the operators 'and' and 'not' to the expression language and showing how to take their derivatives. They are redundant and do not extend the expressive power of the language, but they are convenient to have when considering subsumption.

Section 4-4-2 expresses the subsumption test in several ways using derivatives, notably:

B subsumes R if and only if for all strings s, δ(Ds (R and ~B)) = ∅

(N.B. this definition of subsumption corresponds to shallow local validation, not to deep validation.)

Section 4-4-3 describes a relatively simple way to test subsumption: find the set of characteristic derivatives of (R and ~B) and test each one for nullability. Some speculations about complexity are also included.

Section 4-4-4 works in detail through some examples.

Section 4-4-5 is a series of proofs showing that the various formulations of subsumption offered in section 4-4-2 are in fact equivalent.

Extending the expression notation

It will be convenient, before we proceed, to extend the expression language defined earlier. In addition to the expressions defined above, the following are also legal expressions:

  • F and G, where F and G are each content-model expressions or “all”-expressions
  • ~F, where F is either a content-model expression or an “all”-expression
We interpret these expressions as follows:
  • F and G denotes the set of sequences which are both in F and in G. Note that this is not the same as F & G — this is a straightforward set intersection operation.
  • ~F denotes the set of sequences not in F

We need to define how to take derivatives with respect to expressions of these shapes. Brzozowski proves that for all strings s,

Ds (F and G) = (Ds F) and (Ds G)
Ds (~F) = ~(Ds F)
We'll use these below.

Expressing the subsumption test

It is a well known result that for two regular expressions B and R, it is readily decidable whether B subsumes R, i.e. whether the language accepted by R is or is not a subset of that accepted by B. Since content models are so similar to regular expressions, it has seemed plausible that the same relation is decidable for content models. In the worst case, it is always possible to translate the two content models first into regular expressions and then into finite state automata exploit the standard algorithms. The presence of wildcards, the all connector, and repetition operators other than the Kleene star, however, has complicated the story somewhat. Uncertainty about the possible effect of these complications contributed to the decision to define complex type restriction not in terms of the subset relation between languages accepted by two types, but instead in terms of a complex set of constructive rules for checking content models of substantially similar structure. It is thus particularly significant that Brzozowski derivatives offer a way to define and check the subsumption relation between two content models.

To test whether expression B subsumes expression R, we want to test any one of the following propositions:

  1. for all strings s, s is in L(R) only if s is in L(B)
  2. for all strings s, δ(Ds R) = ε only if δ(Ds B) = ε
  3. the expression (R and ~B) denotes the empty set
  4. for all strings s, δ(Ds (R and ~B)) = ∅

A proof that these expressions are equivalent is given in section 4-4-5 below.

Calculating subsumption

The simplest way I can find to define the subsumption test using derivatives is this. We wish to establish that for all strings s over the input alphabet, δ(Ds (R and ~B)) = ∅.

Brzozowski proves that any expression E has "a finite number of types of derivatives", where two expressions E and F are of the same type if and only if L(E) = L(F), i.e. the two expressions are equivalent. The n different types (equivalence classes) of derivatives of E are called the characteristic derivatives of E. Brzozowski defines a straightforward way to generate all the characteristic derivatives of E, which involves examining increasingly long strings over the input alphabet until a fixed point is reached. The algorithm runs something like this:

We'll use variables n (current string length), X (the set of characteristic derivatives found so far), and f (a Boolean flag).

  1. Start with n = 0, X = ∅, f = false.
  2. For each s in the set of all strings of length n,
    1. Let F = Ds E.
    2. If F is equivalent to any member of X, do nothing.
    3. Otherwise, let X = X ∪ {F} and let f = true.
  3. If f = true, then increment n and go to step 2.
  4. Otherwise, f = false, and X contains the set of characteristic derivatives.

The finiteness of the set of characteristic derivatives gives us one simple way to test proposition (4):

  1. Find the set of characteristic derivatives of (R and ~B).
  2. For each characteristic derivative, test whether it is nullable or not.

If no characteristic derivative of (R and ~B) is nullable, then B subsumes R.

There may be some way to do this by analysing R and B separately, and performing some calculation on their two sets of characteristic derivatives, but in the general case this doesn't seem to simplify things; in the worst case, R has n characteristic derivatives and B has m, and we have to check n × m combinations. This is exactly what taking the characteristic derivatives of (R and ~B) does: since for any string s we have

Ds (R and ~B) = (Ds R) and (Ds ~B) = (Ds R) and ~(Ds B)

we end up calculating the derivatives of R in parallel to those of B, with the exception that we don't necessarily generate all n × m pairs of characteristic derivatives.

It seems clear that calculating the set of characteristic derivatives of (R and ~B) takes about as much work as (and is isomorphic to) calculating the MR and MB (the Glushkov automata of R and B) and then calculating, for all strings over the alphabet, whether the string is in B and not R.

Which in turn is about the same amount of work it takes to calculate the two Glushkov automata, negate the automaton for B, and calculate the automaton which is the intersection of the MR and M~B.

It may be possible to reduce the cost of translation by translating content models not into standard finite state automata but into what might be called ‘counter-extended automata’ (described in an unpublished paper [Sperberg-McQueen 2004]). In the worst case, however, counter-extended automata explode in size in much the same way as a naive expansion of content models into standard regular expressions. I still don't know a good theory explaining when and how using extended automata will be cheaper than expanding to standard expressions and using standard automata, and by how much. This is another topic for further work.

Examples

Shorter all-group

Consider first the pair of expressions

B = (a & b & c)
R = (a & b)
Does B subsume R?

We test by seeing whether among the characteristic derivatives of E = (R and ~B) there is any which is nullable. The characteristic derivatives are found thus:

For n = 0:

  • s = ε, Ds E = (a & b) and ~(a & b & c). This is not nullable.

For n = 1:

  • s = “a”: Ds E = (b and ~(b & c)). Not nullable.
  • s = “b”: Ds E = (a and ~(a & c)). Not nullable.
  • s = “c”: Ds E = ∅. Not nullable.

For n = 2:

  • s = “aa”: Ds E = ∅. Not nullable.
  • s = “ab”: Ds E = (ε and ~(c)). Nullable.
  • s = “ac”: Ds E = (∅). Not nullable .
  • s = “ba”: Ds E = (ε and ~(c)). Nullable.
  • s = “bb”: Ds E = ∅. Not nullable .
  • s = “bc”: Ds E = ∅. Not nullable .
  • s = “ca”: Ds E = ∅. Not nullable .
  • s = “cb”: Ds E = ∅. Not nullable .
  • s = “cc”: Ds E = ∅. Not nullable.

For n = 3, all strings produce Ds E = ∅, which is not nullable.

Note that (as expected) the sequences “ab” and “ba” constitute counter-examples: this derivation is not legal because they are locally valid against R but not against B.

Another shorter all-group

If we rewrite the base type as B′ = (a? & b? & c?) then we get these results:

For n = 0, 1, 2:

  • s = ε: Ds E = (a & b) and ~(a & b & (c?)). Nullable = false.
  • s = “a”: Ds E = b and ~(b & (c?)). Nullable = false.
  • s = “b”: Ds E = a and ~(a & (c?)). Nullable = false.
  • s = “c”: Ds E = ∅. Nullable = false.
  • s = “aa”: Ds E = ∅. Nullable = false.
  • s = “ab”: Ds E = ε and ~(c?). Nullable = false.
  • s = “ac”: Ds E = ∅. Nullable = false.
  • s = “ba”: Ds E = ε and ~(c?). Nullable = false.
  • s = “bb”: Ds E = ∅. Nullable = false.
  • s = “bc”: Ds E = ∅. Nullable = false.
  • s = “ca”: Ds E = ∅. Nullable = false.
  • s = “cb”: Ds E = ∅. Nullable = false.
  • s = “cc”: Ds E = ∅. Nullable = false.

For n = 3, all strings get Ds E = ∅. Nullable = false.

As may be seen, the characteristic derivatives of (R and ~B′) are:

  • (a & b) and ~(a & b & (c?))
  • b and ~(b & (c?))
  • a and ~(a & (c?))
  • ε and ~(c?)
and none of them is nullable. Since for every string s, Ds E is equivalent to one of these five derivatives, we have established that for every string s, Ds E is not nullable, which means ε is not accepted by Ds E, which means E did not accept s. The absence of any string accepted by E is precisely what we mean when we say L(E) = ∅. So B′ does subsume R, Q.E.D.

Reduction of all-group to sequence
B = (a & b)
R = (a, b)
E = (R and ~B) = ((a, b) and ~(a & b))

We obtain the characteristic derivatives of E thus:

For n = 0, 1, 2:

  • s = ε: Ds E = (a, b) and ~(a & b). Nullable = false.
  • s = “a”: Ds E = b and ~(b). Nullable = false.
  • s = “b”: Ds E = ∅. Nullable = false.
  • s = “aa”: Ds E = ∅. Nullable = false.
  • s = “ab”: Ds E = ε and ~(ε). Nullable = false.
  • s = “ba”: Ds E = ∅. Nullable = false.
  • s = “bb”: Ds E = ∅. Nullable = false.

For n > 3, for all strings s, Ds E = ∅. So the characteristic derivatives are those given above, none of them nullable. So B subsumes R. Q.E.D.

Translation of all-group to equivalent choice of sequences
B = (a & b)
R = ((a, b) | (b, a))
E = ((a, b) | (b, a)) and ~(a & b)

The characteristic derivatives are all reached by strings of length < 3:

  • s = ε: Ds E = ((a, b) | (b, a)) and ~(a & b). Nullable = false.
  • s = “a”: Ds E = b and ~(b). Nullable = false.
  • s = “b”: Ds E = a and ~(a). Nullable = false.
  • s = “aa”: Ds E = ∅. Nullable = false.
  • s = “ab”: Ds E = ε and ~(ε). Nullable = false.
  • s = “ba”: Ds E = ε and ~(ε). Nullable = false.
  • s = “bb”: Ds E = ∅. Nullable = false.

B subsumes R. Q.E.D.

A more complex example

It may be worth working an example with larger content models, just to give an idea of how this algorithm behaves when the expressions are not quite so simple.

Let's take

B = (a, ((b, c, d){0,5}, e{0,1}){0,4}, f)
R = (a, b, (c, d, b){2,3}, c, d, e, f)
E = (R and ~B) = ((a, b, (c, d, b){2,3}, c, d, e, f) and ~(a, ((b, c, d){0,5}, e{0,1}){0,4}, f))

B does subsume R (at least that was the intent when they were written), but it's not easy to see this at a glance. I won't go through the algorithm in full detail, but here are some notes.

Expression E has, if my calculations are correct, 18 characteristic derivatives, one each calculable as Ds E for s in the 13 prefixes of the string “abcdbcdbcdef”, four more for the last five prefixes of “abcdbcdbcdbcdef” (one of those prefixes has the same derivative as a known one), and one (the empty set) which applies to all other strings. Some of these are given below; there seemed little point in listing all of them.

The derivatives steadily get more and more complex as the string gets longer and exercises more and more heavily the non-determinism in the exponents of the base type:

For n = 0,

  • s = ε: Ds E = (a, b, ((c, d, b){2,3}), c, d, e, f) and ~(a, ((((b, c, d){0,5}), (e?)){0,4}), f)

For n = 1:

  • s = “a”: Ds E = (b, ((c, d, b){2,3}), c, d, e, f) and ~(((((b, c, d){0,5}), (e?)){0,4}), f).
  • s = b (etc.): Ds E = ∅

For n = 2:

  • s = “ab”: Ds E =
    (((c, d, b){2,3}), c, d, e, f) 
            and
            ~( ( (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,3})) 
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               ),
               f)
All other sequences of length 2 yield the derivative ∅.

For n = 3:

  • s = “abc”: Ds E =
    (d, b, ((c, d, b){1,2}), c, d, e, f)
            and 
            ~( ( (d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,3}))
               | (d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (d, ((b, c, d){0,4}), (e?))
               ), 
               f
             )

When we reach n = 11, the derivative has reached alarming proportions:

  • s = “abcdbcdbcdb”: Ds E =
    (c, d, e, f)
            and
            ~( ( (c, d, ((b, c, d)?), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,3}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?))
               | (c, d, ((b, c, d)?), (e?), 
                 ((((b, c, d){0,5}), (e?)){0,2}))
               | (c, d, ((b, c, d){0,4}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?))
               | (c, d, ((b, c, d)?), (e?), 
                 ((((b, c, d){0,5}), (e?))?))
               | (c, d, ((b, c, d){0,4}), (e?))
               | (c, d, ((b, c, d){0,3}), (e?))
               | (c, d, ((b, c, d){0,2}), (e?))
               | (c, d, ((b, c, d)?), (e?))
              ), f)

Alert eyes will note that the rudimentary expression simplifier used to produce the derivative just shown has not fully simplified some of these expressions: a smarter simplifier eliminates some of the subterms in the negated expression, producing

(c, d, e, f) 
        and 
        ~( ( (c, d, ((b, c, d)?), (e?), 
             ((((b, c, d){0,5}), (e?)){0,3}))
           | (c, d, ((b, c, d){0,4}), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d){0,4}), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d){0,4}), (e?))
           | (c, d, ((b, c, d){0,3}), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d){0,3}), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d){0,3}), (e?))
           | (c, d, ((b, c, d){0,2}), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d){0,2}), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d){0,2}), (e?))
           | (c, d, ((b, c, d)?), (e?), 
             ((((b, c, d){0,5}), (e?)){0,2}))
           | (c, d, ((b, c, d)?), (e?), 
             ((((b, c, d){0,5}), (e?))?))
           | (c, d, ((b, c, d)?), (e?))
         ), f)
which has 13 disjuncts in the inner expression, rather than 35. It may be worth noting that Brzozowski shows explicitly that even if only relatively rudimentary simplification is done on the characteristic derivatives, there is still only a finite number of them. More formally, he shows that not only is there a finite number of characteristic derivatives such than none is equal to the others, there is also a finite number, albeit a larger number, such that none is similar to the others, where similar expressions can be transformed into each other using only the identities
E | E = E
E | F = F | E
(E | F) | G = E | (F | G)
Any expression simplifier which exploits at least these identities will suffice for the task; further or more aggressive simplifications may reduce the number of distinct expressions found.

For both n = 11,