Recent work on markup has emphasized the importance of overlapping structures, but little progress has been made toward validation of such structures: the state of the art in validation of overlap remains where it was in 1986 with the CONCUR feature of SGML.
Like CONCUR, rabbit/duck grammars manage validation by means of several context-free grammars, but rabbit/duck grammars offer some improvements over CONCUR: within a document grammar written for one hierarchical subset of the markup vocabulary, the start- and end-tags of other hierarchies may be made visible; rabbit/duck grammars thus make it possible to constrain the interactions of different hierarchies in ways not possible with CONCUR. Rabbit/duck grammars also provide a more principled account of self-overlap (i.e. overlap of two elements with the same generic identifier) than CONCUR.
Documents can be validated against rabbit/duck grammars either by validating individually against each grammar or by a single pass through the document which validates against all grammars simultaneously. An implementation of this single-pass approach using Brzozowski derivatives is sketched out.
Keywords: Validating; Modeling; Concurrent Markup/Overlap
| XML Source | PDF (for print) | Author Package | Typeset PDF |
This document introduces rabbit/duck grammars, a method of validating overlapping structures.1
One of the first things taught to students in courses on SGML or XML is that elements nest within other elements, thus forming a hierarchical structure or tree. One of the first questions asked by alert students is: but what about non-hierarchical structures in the information? What about overlapping structures?
It is thus no surprise that the development and spread of SGML and XML have been accompanied from the beginning by a vigorous discussion of the limits of tree structures and the best ways of dealing with non-hierarchical information; see, for example, [Barnard et al. 1988], [Renear et al. 1993], [Barnard et al. 1995], [Huitfeldt 1995], [Murata 1995], [Durand et al. 1996]. Out-of-line or standoff markup [Dybkjær et al. 1998], fragmentation of elements, milestone elements, virtual elements (e.g. [Barnard et al. 1988], [ACH/ACL/ALLC 1994], [Barnard et al. 1995]), concurrent structures [ISO 1986] [Sperberg-McQueen / Huitfeldt 1999], MECS [Huitfeldt 1999], parallel encoding (e.g. [Witt 2004], [Hilbert et al. 2005], [Dekhtyar / Iacob 2005]), bottom-up virtual hierarchies [Durusau / O'Donnell 2002a], layered markup and annotation (LMNL) [Tennison / Piez 2002], range algebra [Nicol 2002], just-in-time trees [Durusau / O'Donnell 2002b], multi-colored trees [Jagadish et al. 2004], tables [Durusau / O'Donnell 2004], TexMecs [Huitfeldt / Sperberg-McQueen 2001], Goddag structures [Sperberg-McQueen / Huitfeldt 2000], Trojan horses [DeRose 2004] — the world has not been starved of proposals for markup systems that handle overlap more or less gracefully (for some sense of the word handle). The discussion has been vigorous both with respect to the serial form to be assumed by the data and with respect to the abstractions or data structures around which access to and operations on the data are to be organized.
On the topic of validation, however, the discussion has been remarkably quiet. The state of the art appears to be today just what it was twenty years ago, when ISO 8879 defined the validity conditions for markup using the SGML feature CONCUR. Concurrent markup defines multiple trees over the same frontier2, each obeying its own DTD; ISO 8879 does not go into details on the subject, but the document may conveniently be regarded as valid if and only if it is valid against each of its DTDs.
In general, discussions of overlap say nothing at all about validation, but those mechanisms (e.g. multi-colored trees, XCONCUR) which like CONCUR treat overlap as the interplay of multiple trees (possibly over the same frontier, as in CONCUR, and possibly over different frontiers, as in multi-colored trees) can in general be validated by validating each of their trees in isolation against an appropriate document grammar.
Tree-by-tree validation provides no mechanism for constraining the interactions of the trees: each tree is validated in majestic and lonely isolation. Three forms of interaction in particular are worth noting.
In the search for a method of validating overlapping structures, several goals may be identified:
The method outlined in this paper for writing document grammars and validating documents against them, called rabbit/duck grammars, is designed to address these issues.
Some precursors of the idea are worth noting; they may help make it clearer.
The first idea is language intersection. Formal language theory treats languages as sets of strings; the intersection of two languages L 1 and L 2 is just the set of strings which appear both in L 1 and in L 2. Other set operations (e.g. union and complementation) are also possible.
Discussions of formal language theory routinely refer to the important results which establish that regular languages are closed under the operations of union, intersection, and complementation, and that the set of context-free languages is not closed under intersection. An example often cited is this: consider the languages L 1 and L 2, where L 1 is
This example can be (and, I think, normally is) taken as an indication that set operations are ‘safe’ when performed on regular languages but unsafe on context-free languages. Safe, since they cannot have a result which is not itself a regular language; unsafe, since they clearly increase the expressive power of the formalism.
A few years ago, however, Lars Johnsen (a collaborator on the MLCD project) suggested in a workshop that the facts here could be given a more positive interpretation. While it is in general difficult to tell in reasonably bounded time whether a given string is a member of an arbitrary context-sensitive language, the example just given demonstrates that there are at least some context-sensitive languages for which the test can be performed in the same time required to parse context-free languages, just by parsing the string twice against different context-free grammars; if both accept it, then the context-sensitive language also accepts it. Rabbit/duck grammars apply this idea to document grammars: while in general languages like TexMecs require more than context-free power, they can be factored into a finite set (often a small one) of mostly conventional document grammars.3
A second related idea is the SGML feature CONCUR, which rabbit/duck grammars resemble in some ways. In SGML using CONCUR,4 overlap is handled by grouping elements into trees. Each tag is marked with the identifiers of the trees it belongs to; tags belonging to other trees are ignored.5
Rabbit/duck grammars resemble CONCUR in analysing overlap for the most part as involving distinct trees which share frontiers and some internal structures, and in using a set of several context-free grammars as the basis of validation. They differ from CONCUR in allowing each grammar to refer to tags for elements which are not part of the tree being validated, and in providing for elements which overlap other elements with the same generic identifier.
The closest analogue I am aware of to rabbit/duck grammars, however, is an idea formulated in February 1992 by David Megginson on the Netnews discussion list comp.text.sgml [Megginson 1992].
Megginson's idea was to use end-tag omission, together with the fact that in SGML empty elements look the same as start-tags, to allow the same SGML data stream to be parsed in either of two ways. He provides two DTDs. In one, the elements recto, verso, line, and clause are empty while the elements poem, fit, verse, averse, and bverse have content. In the other DTD, the roles are reversed. Whichever grammar is used for parsing and processing, one hierarchical view is dominant and the other recessive, but Megginson's technique allows information about (the start of) elements in the recessive view to be visible. Whether the tag recto denotes an empty milestone or the beginning of an element with content depends on the grammar; changing grammars performs a sort of Gestalt switch.
Rabbit/duck grammars use a similar sort of Gestalt switch (although the technical details are different), and are named for the well known image used in many discussions of Gestalt perception.6

A figure which could be the head of a rabbit facing right or of a duck facing left.
Rabbit/duck grammars resemble Megginson grammars in using multiple document grammars and treating the same tag as a start-tag in some and a milestone in others; they differ in some ways made necessary by syntactic differences between the SGML reference concrete syntax and XML. Megginson also did not contemplate restricting the interactions of the two grammars; he used inclusion exceptions to allow milestones for recessive elements to appear anywhere.
The basic idea of rabbit/duck grammars is simple: define a set of distinct document grammars whose intersection describes the language to be accepted. In each grammar, each element type is assigned to one of four classes.7
The remainder of the paper is organized as follows. Section 2 provides a more complete definition of rabbit/duck grammars, defines a simple notation for writing them, and shows a simple example, concentrating on the simple case of languages without self-overlapping elements (i.e. languages in which any two overlapping elements are required to have different generic identifiers). Section 3 extends the discussion to languages with self-overlap. Section 4 addresses implementation issues. Section 5 describes some simple applications of rabbit/duck grammars.
I first consider rabbit/duck grammars for languages like Mecs [Huitfeldt 1999], which allow elements with different generic identifiers to overlap each other but prohibit overlap between elements with the same generic identifier.
Informally, one can describe the application of rabbit/duck grammars to such vocabularies thus. First, form groups of elements by assigning each element in the vocabulary to one or more groups, such that no two elements in the same group should ever overlap in a valid document instance. For example, if the vocabulary includes the elements doc (document), ch (chapter), sec (section), h (heading, title of chapter or section), p (paragraph), hi (highlighted phrase), vol (volume), page, and tl (typographic line), with the obvious meanings, then vol, page, and tl might go into one group, and the other elements into a second.
Second, define a document grammar for each group. For purposes of this paper, I will use an extension of DTD notation in which content models can contain primitive tokens of the following forms (where g is a generic identifier):
In addition, by analogy with XML Schema, I allow occurrence indicators to take the form of integer ranges, so that for any expression E one can write E { n , m }, where n is a non-negative integer and m is either a positive integer or the character “∞”8 The usual occurrence indicators of DTDs are also used.
Further extensions allow second-, third-, and fourth-class elements to be defined: instead of content models, they have the keywords MILESTONE-TAGS, IGNORE-TAGS, and IGNORE, respectively. In the examples given here, I adopt the convention that any element g mentioned in a primitive token of the form #stag( g ), #etag( g ), or #tag( g ) is a second-class element and need not be declared explicitly. Elements neither declared nor mentioned in a content model will be assumed to be third-class elements.
For our simple illustration, the page-oriented grammar might be
<!GRAMMAR vol [ <!ELEMENT vol (#stag(doc)?, page+, #etag(doc)?) > <!ELEMENT page (tl+) > <!ELEMENT tl (#PCDATA) > ]>
The logical-structure grammar might be
<!GRAMMAR doc [
<!ELEMENT doc (#stag(vol)?,
(p, #tag(page)*)+,
(ch, #tag(page)+)+,
#etag(vol)?) >
<!ELEMENT ch (#tag(page)*,
h,
(((p, #tag(page)*)+,
(sec, #tag(page)*)*)
| (sec, #tag(page)*)+)) >
<!ELEMENT sec (#tag(page)*,
h,
(((p, #tag(page)*)+,
(sec, #tag(page)*)*)
| (sec, #tag(page)*)+)) >
<!ELEMENT h (#PCDATA | hi | #tag(tl))* >
<!ELEMENT p (#PCDATA | hi | #tag(tl)
| #tag(page))* >
<!ELEMENT hi (#PCDATA | #tag(tl)
| #tag(page))* >
]>Formally, a rabbit/duck grammar can be viewed as a triple (G, D, S), where
content-model ::= group | counted
group ::= seq | or | all | conj
counted ::= group counter
expr ::= token | group | counted
seq ::= '(' expr (',' expr)* ')'
or ::= '(' expr ('|' expr)+ ')'
all ::= '(' expr ('&' expr)+ ')'
and ::= '(' expr ('∧' expr)+ ')'
counter ::= '+' // '+' = {1,∞}
| '*' // '*' = {0,∞}
| '?' // '?' = {0,1}
| '{' min ',' max '}'
| group '{' count '}'
min ::= '0' | count
count ::= POSITIVE_INTEGER
max ::= count | '∞'
token ::= basic-token
| counted-token
counted-token ::= basic-token counter
token ::= first_class
| second_class
| '#PCDATA'
first_class ::= GI
second_class ::= '#stag(' GI ')'
| '#etag(' GI ')'
| '#tag(' GI ')'In a clean rabbit/duck grammar, every element type is declared once (explicitly or implicitly), every generic identifier used in a right-hand side as a first-class token is that of a first-class element, and every generic identifier used in a second-class token is that of a second-class element (except in overlap-groups, as specified below in section 3).
Undefined element types, unused element types, non-productive declarations, and so on can be defined for rabbit/duck grammars in the same way as for context-free grammars in general.11 In what follows, I will without loss of generality restrict my attention to clean grammars in which none of these deficiencies occur.
In a full set of rabbit/duck grammars, each grammar is defined over the same vocabulary G, and each generic identifier in G is first-class in at least one grammar.
Some technical commentary is in order, on the four ways a grammar can treat elements and their start- and end-tags.
An example may help clarify things. Consider the opening lines of Peer Gynt used as a running example in [Sperberg-McQueen / Huitfeldt 2000]:
Åse: Peer, du lyver!
Peer Gynt : [uten å stanse] Nei, jeg gjør ei!
Åse: Nå, så bann på det er sant!
Peer Gynt: Hvorfor banne?
Åse: Tvi, du tør ei! Alt i hop er tøv og tant!
In English:
AASE: Peer, you're lying!
PEER GYNT : [without stopping] No, I'm not!
AASE: Well then, swear to me it's true.
PEER GYNT: Swear? why should I?
AASE: See, you dare not! Every word of it's a lie.
In the dramatic hierarchy, we want a hierarchy of play / act / scene, with scenes containing speeches and stage directions (speech, stage).
In the metrical hierarchy, we want play / act / scene, with scenes containing verse lines (L).
The dramatic hierarchy can be defined thus:
<!--* Grammar D *-->
<!GRAMMAR play [
<!ELEMENT play (act+) >
<!ELEMENT act (scene+)
<!ELEMENT scene (sp | stage | #tag(L))* >
<!ELEMENT sp (#PCDATA | stage | #tag(L))* >
<!ATTLIST sp
who CDATA #REQUIRED >
<!ELEMENT stage (#PCDATA) >
]>
The verse structure can be defined this way:
<!--* Grammar V *--> <!GRAMMAR play [ <!ELEMENT play (act+) > <!ELEMENT act (scene+) <!ELEMENT scene (L | stage | #tag(sp) )* > <!ELEMENT L (#PCDATA | stage | #tag(sp) )* > <!ELEMENT stage (#PCDATA) > ]>
In this view, a scene of a verse drama is a sequence of verse lines interspersed with stage directions. If we wish to ignore the stage directions entirely, we can do that, too:
<!ELEMENT scene (L | #tag(sp) )* > <!ELEMENT stage IGNORE >
In TexMecs notation, this fragment can be marked up this way:13
<play| <* Peer Gynt *>
<act n="1"|
<scene n="i"|
<sp who="Aase"|<l|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping|stage>
No, I'm not!|l>|sp>
<sp who="Aase"|<l| Well then, swear to me it's true! |l>|sp>
<sp who="Peer"|<l| Swear? Why should I? |sp>
<sp who="Aase"| See, you dare not! |l>
<l| Every word of it's a lie! |l> |sp>
<* ... more dialogue ... *>
|scene>
<* ... more scenes ... *>
|act>
<* ... more acts ... *>
|play>
If we associate a view of the document (in the database sense) with each grammar, the difference between the two variants of the verse grammar just given is the difference between a view in which stage directions are present:
<l| Peer, you're lying! <stage|without stopping|stage> No, I'm not! |l> <l| Well then, swear to me it's true! |l> <l| Swear? Why should I? See, you dare not! |l> <l| Every word of it's a lie! |l>
<l| Peer, you're lying! No, I'm not! |l> <l| Well then, swear to me it's true! |l> <l| Swear? Why should I? See, you dare not! |l> <l| Every word of it's a lie! |l>
For metrical studies, the second view may be more convenient than the first. It's easy enough to transform the first into the second, but some vocabulary designers may also find it easier to write grammars like the second, in which they can completely ignore elements in the other hierarchies. (When all elements in a grammar are first- or third-class, rabbit/duck grammars are very similar to CONCUR.)
Two important conditions are established and maintained by the rules given above for all documents valid against a clean set of rabbit/duck grammars.
For the cases covered by the current rules, it seems plausible to believe (1) that any document for which the two general conditions are true should count as valid for a given set of rabbit/duck grammars, and (2) that no document for which the general conditions do not hold should count as valid. The conditions, that is, seem to be both sufficient and necessary conditions for validity against a full set of clean rabbit/duck grammars.
The rules defined so far handle all overlap as the overlaying of the tree defined by one grammar with the tree defined by another. They cannot handle self-overlap, because in self-overlap both element instances are first-class elements in the same grammars. But the rules do suffice to cover most cases of concurrent hierarchies in which overlap stems from the tagging of multiple analytic perspectives, and when applied to such cases, the rules given so far appear to produce intuitively plausible results. So the conditions may serve to guide us in attempting to extend the mechanism to cover cases of self-overlap: what is needed is a way of handling self-overlap that satisfies both of them.
If within a given language, instances of some particular element type b can overlap other instances of element type b, then we wish to write one or more of the grammars in a set of rabbit/duck grammars in such a way as to ensure that
For example, consider the following XML document, written with Trojan Horse milestones for b (later examples will use a more compact notation):
<doc> aaa <b sID="b1"/> bbb <b sID="b2"/> ccc <b eID="b1"/> ddd <b eID="b2"/> eee </doc>
<doc> aaa <b> bbb <b sID="b2"/> ccc </b> ddd <b eID="b2"/> eee </doc>
<doc> aaa <b sID="b1"/> bbb <b> ccc <b eID="b1"/> ddd </b> eee </doc>
It is desirable to have the same control over overlap as we have for other items in a content model, so we wish to be able to control where it can and cannot occur. The following sections outline a way to accomplish this.
To handle languages in which elements with the same generic identifier can overlap each other, a few modest changes suffice. We need (1) a syntactic mechanism for matching start- and end-tags, (2) a new kind of token which we will call an overlap-token, and (3) a new kind of group for use in content models, the overlap-group.
First, since elements don't necessarily nest, some syntactic mechanism is needed for matching start- to end-tags. The generic identifier alone no longer suffices. The examples shown here will use the co-indexing syntax of TexMecs: a tilde and an arbitrary string of characters will decorate the generic identifier, and decorated start- and end-tags match only other tags with the same decoration. So in the document
<doc|<s~1|aaa<s~2|bbb|s~1>ccc|s~2>|doc>
<doc| aaa <b~b1| bbb <b~b2| ccc |b~b1> ddd |b~b2> eee |doc>
Second, a new kind of token is required, namely an overlap-token:
overlap-token ::= '~' GI '~'
Third, a new type of content-model expression is required, an overlap group:
group ::= '#overlap(' expr ')'In any normal overlap-group, there will be at least one overlap-token (~ g ~) and at least one #tag( g ) (or at least one #stag( g ) and at least one #etag( g )) token for the same generic identifier. Together, these show the two ways in which every occurrence of g within the scope of the overlap-group must be read. Every occurrence of g must be a first-class element in (at least) one reading, and a pair of milestones (i.e. a second-class element) in all other readings.14
If #overlap(α) is an overlap-group with an overlap-token for g, then we can use α1 to denote the reading of the expression in which the first g is first-class and any others are second-class, α2 to denote the reading in which the second g is first-class, and so on. Then if there is just one g in the relevant part of the document instance, the meaning of the overlap-group is (α1). If there are two overlapping g elements, the meaning of the overlap-group is (α1 ∧ α2), where the connector “∧” denotes conjunction. When F and G are expressions, the content-model expression (F ∧ G) denotes the set of sequences which are members both of F and of G. When F and G are readings, the expression (F ∧ G) denotes the set of sequences which can be read both as F and as G. So (α1 ∧ α2) denotes the set of expressions which can be read both as α1 and as α2 For three overlapping g elements, the meaning is (α1 ∧ α2 ∧ α3). And so on. The meaning of an overlap-group, formally, is thus the infinite disjunction ((α1) ∨ (α1 ∧ α2) ∨ (α1 ∧ α2 ∧ α3) ∨ ...)
The two general conditions identified above (section 2-3) are preserved by this interpretation of overlap-groups: the grammar rules for the parent apply separately for each overlapping element matching the overlap-token, as do those for the overlapping element itself. And each overlapping instance is covered by a declaration and produces a reading in which it is a first-class element.
A simple example of overlap might involve pages (tagged p) deletions (d), and other elements (a, b). Deletions can overlap other deletions (i.e. element d may self-overlap), but no other overlaps occur in this group of element types. We can define the grammar thus:
<!GRAMMAR p [
<!ELEMENT p #overlap( (
(#PCDATA | a | #tag(d))*,
~d~,
(#PCDATA | b | #tag(d))* )? ) >
<!ELEMENT d (#PCDATA | a | b | #tag(d))*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
]>Given a document instance
<p| <a><d~1|<a><d~2|<a><b><a>|d~1><b>|d~2><b> |p>
A second example illustrates slightly tighter constraints: d elements contain not an unordered mix of a and b elements, but an ordered sequence:
<!GRAMMAR p [
<!ELEMENT p (h*,
#overlap((a | #stag(d))*,
~d~,
(c | #etag(d))*),
f*) >
<!ELEMENT d ((a | #tag(d) | ~d~)*,
b,
(b | #tag(d) | ~d~)+,
(c | #tag(d) | ~d~)*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
<!ELEMENT c EMPTY >
<!ELEMENT f (#PCDATA) >
<!ELEMENT h (#PCDATA) >
]><p| <a><d~1|<a><d~2|<b>|d~1><c>|d~2><c> |p>
In the preceding sections I have endeavored to give a purely declarative account of rabbit/duck grammars and their meaning, without any appeal to particular parsing techniques. But of course a grammatical formalism is more likely to be useful in practice if the constraints it expresses can actually be enforced at reasonable implementation and runtime cost.
One implementation technique is straightforward for languages without self-overlap:
Alternatively, one can dispense with the filtering step by building a parser for each grammar, in which the lexical scanner takes care of distinguishing among first- and second-class tags, suppressing third-class tags, and suppressing fourth-class elements. The document can be validated against a set of rabbit/duck grammars by parsing it once for each grammar.
It is also possible to validate an input document against a set of rabbit/duck grammars in a single pass; one way to do that is described below. It involves some unconventional extensions of Brzozowski derivatives; these are introduced piecemeal, in an attempt to ease comprehension. The exposition begins with a description of one way to check the well-formedness and validity of XML (section 4-1) using Brzozowski derivatives. These are relatively simple and straightforward illustrations of the mechanism. Section 4-2 discusses validation of TexMecs languages without self-overlap, and finally section 4-3 describes validation of languages which allow self-overlap.
Brzozowski derivatives were introduced in 1964 by Janusz Brzozowski [Brzozowski 1964]. The derivative D of a set L with respect to some input string s, written D s L, is the set of strings which can follow s in words of the language L. Brzozowski showed, given a regular expression E denoting L, how to construct a regular expression for D s E, which makes possible an algebraic approach to recognizing members of the language. Less formally, given a regular expression and a (prefix) string, the derivative (of the expression, with respect to the string) tells us what can come next. In what follows, it will be important to recall that if one of the things which can come next is ε, the empty string, then the string s is itself a member of the language defined by E. An expression E whose language includes ε is said to be nullable, and the function ν(E) is true or false accordingly. In what follows, successful parses will normally be those which end with a derivative which is nullable.17
Brzozowski derivatives are conventionally used to recognize or reason about regular languages. But with a couple of extensions, they can also be used to parse well-formed XML.
The first extension is to define an expression language slightly different from regular expressions as defined either conventionally or by Brzozowski. I shall call these language expressions rather that regular expressions, because the languages they denote are not necessarily regular. Each expression denotes a set of strings, and may be said to match any string in that language. The language expressions of interest here may take any of the following forms:
The second extension is to define an infinite input alphabet. Assume that the input stream is a sequence of symbols each of which is one of the following:
As usual, the derivative of an expression E with respect to a sequence of symbols x 1, x 2, x 3, ..., x n ,is found by finding the derivative of E with respect to x 1, then the derivative of D x 1 E with respect to x 2, and so on. Formally,
D s E = D <x 1, x 2, ... x n > E = D x n (... (D x 3 (D x 2 (D x 1 E))) ...)
A sequence of input symbols d is well-formed XML if and only if D d X is nullable, when the rules for derivatives are as described below.
The third extension is the the derivative of a primitive language expression with respect to an input symbol may be something other than ε or ∅. Derivatives of compound expressions (concatenation, disjunction, conjunction) are given in [Brzozowski 1964] and need not be repeated here. [Sperberg-McQueen 2005] defines derivatives for interleave.
When x = stag( h ), then
When x = etag( h ), then
When x = pcdata(), then
Examples of using these rules to check XML well-formedness are given in appendix 7-1.
The reader will note that the derivative, which represents the expected sequence of tokens to come, is being used as a stack: when we encounter a start-tag in the input which matches an elem( g ) in the expression being matched, we push (W etag( g )) onto the stack by placing it at the front of the derivative.
This method can be applied to validation as well, by changing just one rule. When stag( g ) in the input matches the expression elem( g ), the rules just given make the derivative be (W etag( g )): in well-formed XML, what follows the start-tag of any element is well-formed XML followed by the end-tag of that element. To validate against a DTD, all we need to do is to replace W in the derivative with the content model of the element. If we write the expression “cm( g )” to denote the content model of element type g, and “[[cm( g )]]” to denote the value of that expression (i.e. to indicate that the expression should be replaced by its value), the rule becomes:
For example, consider a vocabulary for serializing binary trees whose definition (in DTD syntax) is:
<!ELEMENT tree ((tree, tree) | leaf) > <!ELEMENT leaf (#PCDATA) >
<tree><leaf>42</leaf></tree>
We start with the expression elem(tree) and validate the input stream “stag(tree), stag(leaf), pcdata(), etag(leaf), etag(tree)”:
Further examples of validation using Brzozowski derivatives are given in appendix 7-2.
The technique just described can be used without major changes to validate documents with overlap against individual rabbit/duck grammars which have no overlap-groups. The only change is that the implementation must take care to treat fourth-class elements and tags for third-class elements appropriately.
We can validate a document against a whole set of rabbit/duck grammars rather than just individual grammars with parsing mechanism which keeps track of the current derivative in each grammar and for each token calculates the next derivative for each grammar. Perhaps the simplest way to acquire such a mechanism is to connect the derivatives for the different languages using logical conjunction. Thus if the current validation state of grammar 1 is denoted E 1, and the state of grammar 2 is E 2, etc., then for n grammars we will have a single expression of the form
E 1 ∧ E 2 ∧ ... ∧ E n
It is necessary, however, to keep track of which expressions come from which grammars, so as to treat third- and fourth-class elements appropriately. One way to do this is to assign a number to each grammar and extend the expression language so as to handle the necessary bookkeeping. The foundation of the expression language are the content model expressions defined above in section 2-1; for purposes of validation we extend that expression language as follows.
First, we subscript or mark each reference to a first-class element in a grammar with the number of that grammar. In written form, this can be represented as a subscript to the generic identifier, so that instead of being written (a, (b | #tag(c)), d), a content model in grammar 1 would be written (a 1, (b 1 | #tag(c)), d 1), and similarly for grammar 2, 3, and so on. The rule for start-tags matching elements can then be rewritten
Second, we add two new types of group expressions. Neither of these will normally occur in user-written content models; they will be used only as the result of taking a derivative from some other expression.
group ::= g-tag | spinner
g-tag ::= '#gr(' POSITIVE_INTEGER ')' content-model
spinner ::= '#spin(' POSITIVE_INTEGER ')'
'(' list ')' content-model
list ::= empty | GI ' / ' list
empty ::= ε // nothing, the empty sequence
token ::= 'ε'
| '∅'
An expression of the form #gr( n )( E ) labels expression E as having come from grammar number n. The rules for taking the derivative D x #gr( n )( E ) are these: