Rabbit/duck grammars: a validation method for overlapping structures

C. M. Sperberg-McQueen

Abstract

Recent work on markup has emphasized the importance of overlapping structures, but little progress has been made toward validation of such structures: the state of the art in validation of overlap remains where it was in 1986 with the CONCUR feature of SGML.

Like CONCUR, rabbit/duck grammars manage validation by means of several context-free grammars, but rabbit/duck grammars offer some improvements over CONCUR: within a document grammar written for one hierarchical subset of the markup vocabulary, the start- and end-tags of other hierarchies may be made visible; rabbit/duck grammars thus make it possible to constrain the interactions of different hierarchies in ways not possible with CONCUR. Rabbit/duck grammars also provide a more principled account of self-overlap (i.e. overlap of two elements with the same generic identifier) than CONCUR.

Documents can be validated against rabbit/duck grammars either by validating individually against each grammar or by a single pass through the document which validates against all grammars simultaneously. An implementation of this single-pass approach using Brzozowski derivatives is sketched out.

Keywords: Validating; Modeling; Concurrent Markup/Overlap

C. M. Sperberg-McQueen

C.M. Sperberg-McQueen is a member of the technical staff at the World Wide Web Consortium; he serves on the W3C XML Schema Working Group and chairs the XML Coordination Group.

Rabbit/duck grammars: a validation method for overlapping structures

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and Artificial Intelligence Laboratory]

Extreme Markup Languages 2006® (Montréal, Québec)

Copyright © 2006 C. M. Sperberg-McQueen. Reproduced with permission.

This document introduces rabbit/duck grammars, a method of validating overlapping structures.1

Introduction

Overlap

One of the first things taught to students in courses on SGML or XML is that elements nest within other elements, thus forming a hierarchical structure or tree. One of the first questions asked by alert students is: but what about non-hierarchical structures in the information? What about overlapping structures?

It is thus no surprise that the development and spread of SGML and XML have been accompanied from the beginning by a vigorous discussion of the limits of tree structures and the best ways of dealing with non-hierarchical information; see, for example, [Barnard et al. 1988], [Renear et al. 1993], [Barnard et al. 1995], [Huitfeldt 1995], [Murata 1995], [Durand et al. 1996]. Out-of-line or standoff markup [Dybkjær et al. 1998], fragmentation of elements, milestone elements, virtual elements (e.g. [Barnard et al. 1988], [ACH/ACL/ALLC 1994], [Barnard et al. 1995]), concurrent structures [ISO 1986] [Sperberg-McQueen / Huitfeldt 1999], MECS [Huitfeldt 1999], parallel encoding (e.g. [Witt 2004], [Hilbert et al. 2005], [Dekhtyar / Iacob 2005]), bottom-up virtual hierarchies [Durusau / O'Donnell 2002a], layered markup and annotation (LMNL) [Tennison / Piez 2002], range algebra [Nicol 2002], just-in-time trees [Durusau / O'Donnell 2002b], multi-colored trees [Jagadish et al. 2004], tables [Durusau / O'Donnell 2004], TexMecs [Huitfeldt / Sperberg-McQueen 2001], Goddag structures [Sperberg-McQueen / Huitfeldt 2000], Trojan horses [DeRose 2004] — the world has not been starved of proposals for markup systems that handle overlap more or less gracefully (for some sense of the word handle). The discussion has been vigorous both with respect to the serial form to be assumed by the data and with respect to the abstractions or data structures around which access to and operations on the data are to be organized.

On the topic of validation, however, the discussion has been remarkably quiet. The state of the art appears to be today just what it was twenty years ago, when ISO 8879 defined the validity conditions for markup using the SGML feature CONCUR. Concurrent markup defines multiple trees over the same frontier2, each obeying its own DTD; ISO 8879 does not go into details on the subject, but the document may conveniently be regarded as valid if and only if it is valid against each of its DTDs.

In general, discussions of overlap say nothing at all about validation, but those mechanisms (e.g. multi-colored trees, XCONCUR) which like CONCUR treat overlap as the interplay of multiple trees (possibly over the same frontier, as in CONCUR, and possibly over different frontiers, as in multi-colored trees) can in general be validated by validating each of their trees in isolation against an appropriate document grammar.

Some problems with tree-by-tree validation

Tree-by-tree validation provides no mechanism for constraining the interactions of the trees: each tree is validated in majestic and lonely isolation. Three forms of interaction in particular are worth noting.

  • First, the vocabulary designer may wish to require that certain nodes in one tree also be nodes in another tree. In an encoding of verse drama, for example, one will normally wish to stipulate that the act and scene elements of the dramatic hierarchy must in all their instances be identical to instances of the act and scene elements of the verse hierarchy.
  • Second, the designer may wish to restrict the ways in which element boundaries in different trees line up. It is not possible, in any traditional verse drama, for lines of verse to begin or end in the middle of a stage direction or speaker indication, but if the element for verse lines and the elements for stage directions and speaker attributions are assigned to different hierarchies, this constraint is impossible to state. More prosaically, one might wish to perform some simple sanity checks on a data stream with both a logical and a formatted view of the document, such as: there should be a page break immediately before each top-level division; there should be no page breaks in the middle of any heading of any kind; etc.
  • Third, it may be desirable to make some elements in the document effectively invisible to some trees. A hierarchical view of a text in terms of pages, blocks, and other formatting objects has relatively little use for a metadata header full of non-printing information relevant for controlling document work flows. The formatted view can be defined to include such information and assign it the property of being invisible, but the formatted-view grammar is less cluttered if the metadata can be invisible to the grammar as well as being invisible on the page. Similarly, users attempting to write a document grammar to describe verse drama are sometimes unpleasantly surprised to discover that they cannot simply ignore the stage directions and speaker attributions, which play a crucial role in the dramatic hierarchy but which seem to have no business in the verse hierarchy. When using CONCUR, however, if stage directions are omitted from the verse hierarchy, their tags are invisible to the verse tree, but their content is visible, so that a line of verse may appear to include extraneous prose material inserted at what look like arbitrary points. The tagging for stage directions must be visible in the verse hierarchy in order that verse-oriented applications can successfully ignore the stage directions.

In the search for a method of validating overlapping structures, several goals may be identified:

  • The method should be powerful enough to do useful work.
  • It need not, however, be universal or Turing-complete. In the interests of being able to reason about the mechanism, Turing completeness should be avoided if possible.
  • It should provide a clear comprehensible method of expressing constraints on overlapping structures. This is not easy to judge at first exposure to any formalism; the ease with which an idea is first grasped is not the same as the clarity given by that idea when it is understood.
  • It should be be built around simple core ideas; the algorithms involved should be simple; the notation should be simple. Again, this is not easy to judge without more experience.
  • Ideally, it should be possible to validate both data streams and graph data structures representing the overlapping structures of interest. DTDs and most XML schema languages have this property: they can be used to check both an XML document in serialized form and a tree derived from such an XML document.

The method outlined in this paper for writing document grammars and validating documents against them, called rabbit/duck grammars, is designed to address these issues.

Some related ideas

Some precursors of the idea are worth noting; they may help make it clearer.

The first idea is language intersection. Formal language theory treats languages as sets of strings; the intersection of two languages L 1 and L 2 is just the set of strings which appear both in L 1 and in L 2. Other set operations (e.g. union and complementation) are also possible.

Discussions of formal language theory routinely refer to the important results which establish that regular languages are closed under the operations of union, intersection, and complementation, and that the set of context-free languages is not closed under intersection. An example often cited is this: consider the languages L 1 and L 2, where L 1 is

S → B | a S c
B → ε | b B
and L 2 is
S → a S | B
B → ε | b B c
If we write a n to denote n occurrences of a, then L 1 is the language a n b * c n , which has some occurrences of a, then of b and c, with the same number of a as of c. L 2 is a * b n c n , which has the same number of b as of c. The intersection is the language a n b n c n , which has the same number of a, b, and c and is one of the best known of non-context-free languages.

This example can be (and, I think, normally is) taken as an indication that set operations are ‘safe’ when performed on regular languages but unsafe on context-free languages. Safe, since they cannot have a result which is not itself a regular language; unsafe, since they clearly increase the expressive power of the formalism.

A few years ago, however, Lars Johnsen (a collaborator on the MLCD project) suggested in a workshop that the facts here could be given a more positive interpretation. While it is in general difficult to tell in reasonably bounded time whether a given string is a member of an arbitrary context-sensitive language, the example just given demonstrates that there are at least some context-sensitive languages for which the test can be performed in the same time required to parse context-free languages, just by parsing the string twice against different context-free grammars; if both accept it, then the context-sensitive language also accepts it. Rabbit/duck grammars apply this idea to document grammars: while in general languages like TexMecs require more than context-free power, they can be factored into a finite set (often a small one) of mostly conventional document grammars.3

A second related idea is the SGML feature CONCUR, which rabbit/duck grammars resemble in some ways. In SGML using CONCUR,4 overlap is handled by grouping elements into trees. Each tag is marked with the identifiers of the trees it belongs to; tags belonging to other trees are ignored.5

Rabbit/duck grammars resemble CONCUR in analysing overlap for the most part as involving distinct trees which share frontiers and some internal structures, and in using a set of several context-free grammars as the basis of validation. They differ from CONCUR in allowing each grammar to refer to tags for elements which are not part of the tree being validated, and in providing for elements which overlap other elements with the same generic identifier.

The closest analogue I am aware of to rabbit/duck grammars, however, is an idea formulated in February 1992 by David Megginson on the Netnews discussion list comp.text.sgml [Megginson 1992].

Megginson's idea was to use end-tag omission, together with the fact that in SGML empty elements look the same as start-tags, to allow the same SGML data stream to be parsed in either of two ways. He provides two DTDs. In one, the elements recto, verso, line, and clause are empty while the elements poem, fit, verse, averse, and bverse have content. In the other DTD, the roles are reversed. Whichever grammar is used for parsing and processing, one hierarchical view is dominant and the other recessive, but Megginson's technique allows information about (the start of) elements in the recessive view to be visible. Whether the tag recto denotes an empty milestone or the beginning of an element with content depends on the grammar; changing grammars performs a sort of Gestalt switch.

Rabbit/duck grammars use a similar sort of Gestalt switch (although the technical details are different), and are named for the well known image used in many discussions of Gestalt perception.6

Figure 1: Figure 1: Rabbit / Duck
[Link to open this graphic in a separate page]

A figure which could be the head of a rabbit facing right or of a duck facing left.

Rabbit/duck grammars resemble Megginson grammars in using multiple document grammars and treating the same tag as a start-tag in some and a milestone in others; they differ in some ways made necessary by syntactic differences between the SGML reference concrete syntax and XML. Megginson also did not contemplate restricting the interactions of the two grammars; he used inclusion exceptions to allow milestones for recessive elements to appear anywhere.

Rabbit/duck grammars

The basic idea of rabbit/duck grammars is simple: define a set of distinct document grammars whose intersection describes the language to be accepted. In each grammar, each element type is assigned to one of four classes.7

  • First-class elements (‘normal’ elements) are treated as non-terminals of the grammar, just as in DTDs and conventional schema languages.
  • Second-class elements (‘milestones’) are also visible in the grammar, in a limited way. When second-class elements are encountered, their start- and end-tags are treated for purposes of parsing as if they were empty milestone elements; thus the elements themselves are not visible, but their boundaries are.
  • Start- and end-tags of third-class (‘transparent’) elements are ignored completely (like the tags of other document types in CONCUR). The content of third-class elements, however, is visible.
  • Fourth-class (‘opaque’) elements are ignored completely; neither their tags nor their content is checked against the grammar.
In a complete set of rabbit/duck grammars, each element type in the vocabulary is a first-class element in at least one grammar. A document is valid against a set of rabbit/duck grammars if and only if it is valid against each grammar in the set; the language defined by the set of grammars (viewed as a set of data streams) is thus the intersection of the languages defined by the members of the set.

The remainder of the paper is organized as follows. Section 2 provides a more complete definition of rabbit/duck grammars, defines a simple notation for writing them, and shows a simple example, concentrating on the simple case of languages without self-overlapping elements (i.e. languages in which any two overlapping elements are required to have different generic identifiers). Section 3 extends the discussion to languages with self-overlap. Section 4 addresses implementation issues. Section 5 describes some simple applications of rabbit/duck grammars.

Vocabularies without self-overlap

I first consider rabbit/duck grammars for languages like Mecs [Huitfeldt 1999], which allow elements with different generic identifiers to overlap each other but prohibit overlap between elements with the same generic identifier.

Description

Informally, one can describe the application of rabbit/duck grammars to such vocabularies thus. First, form groups of elements by assigning each element in the vocabulary to one or more groups, such that no two elements in the same group should ever overlap in a valid document instance. For example, if the vocabulary includes the elements doc (document), ch (chapter), sec (section), h (heading, title of chapter or section), p (paragraph), hi (highlighted phrase), vol (volume), page, and tl (typographic line), with the obvious meanings, then vol, page, and tl might go into one group, and the other elements into a second.

Second, define a document grammar for each group. For purposes of this paper, I will use an extension of DTD notation in which content models can contain primitive tokens of the following forms (where g is a generic identifier):

  • #stag( g ) matches a start-tag for g
  • #etag( g ) matches an end-tag for g
  • #tag( g ) matches either a start- or an end-tag for g
These are all used for constraining the locations where the start- and end-tags of second-class elements may appear in valid documents.

In addition, by analogy with XML Schema, I allow occurrence indicators to take the form of integer ranges, so that for any expression E one can write E { n , m }, where n is a non-negative integer and m is either a positive integer or the character “∞”8 The usual occurrence indicators of DTDs are also used.

Further extensions allow second-, third-, and fourth-class elements to be defined: instead of content models, they have the keywords MILESTONE-TAGS, IGNORE-TAGS, and IGNORE, respectively. In the examples given here, I adopt the convention that any element g mentioned in a primitive token of the form #stag( g ), #etag( g ), or #tag( g ) is a second-class element and need not be declared explicitly. Elements neither declared nor mentioned in a content model will be assumed to be third-class elements.

For our simple illustration, the page-oriented grammar might be

<!GRAMMAR vol [
<!ELEMENT vol  (#stag(doc)?, page+, #etag(doc)?) >
<!ELEMENT page (tl+) >
<!ELEMENT tl   (#PCDATA) >
]>
The content model for vol allows the start-tag for the doc element to appear at the beginning of the vol element and the end-tag at the end, but nowhere else. Other tags from the logical structure markup (ch, sec, h, p, hi) are all ignored, because they are implicitly declared third-class.

The logical-structure grammar might be

<!GRAMMAR doc [
<!ELEMENT doc  (#stag(vol)?, 
               (p, #tag(page)*)+, 
               (ch, #tag(page)+)+, 
               #etag(vol)?) >
<!ELEMENT ch   (#tag(page)*, 
               h, 
               (((p, #tag(page)*)+, 
                 (sec, #tag(page)*)*) 
                | (sec, #tag(page)*)+)) >
<!ELEMENT sec  (#tag(page)*, 
               h, 
               (((p, #tag(page)*)+, 
                 (sec, #tag(page)*)*) 
                | (sec, #tag(page)*)+)) >
<!ELEMENT h    (#PCDATA | hi | #tag(tl))* >
<!ELEMENT p    (#PCDATA | hi | #tag(tl) 
               | #tag(page))* >
<!ELEMENT hi   (#PCDATA | #tag(tl) 
               | #tag(page))* >
]>
Here, the interactions between the trees are more tightly constrained. The content model for doc ensures that the volume can begin and end only at the beginning and end of the document, and that page breaks, but not line breaks, may appear between paragraphs and chapters. The expression (ch, #tag(page)+)+ requires that at least one page tag appear after each ch element, thus enforcing the rule that each chapter begins on a new page.9 The identical declarations for ch and sec forbid page breaks between the chapter heading and the first paragraph, but allow page breaks between pages and sections. The declarations for h and p show different degrees of tolerance for typographic boundaries: in headings, only line breaks are allowed, while in paragraphs, line breaks and page breaks may occur.

Formally, a rabbit/duck grammar can be viewed as a triple (G, D, S), where

  • G is a set of generic identifiers (e.g. QNames)
  • SG is a start-symbol, the generic identifier of the document element
  • D is a set of declarations of the abstract form gE, where g is a generic identifier and E is either a content keyword (one of EMPTY, MILESTONE-TAGS, IGNORE-TAGS, or IGNORE) or a content-model expression. Content-model expressions may take the usual forms:10
    content-model ::= group | counted
    group   ::= seq | or | all | conj
    counted ::= group counter
    expr    ::= token | group | counted
    seq     ::= '(' expr (',' expr)* ')'
    or      ::= '(' expr ('|' expr)+ ')'
    all     ::= '(' expr ('&' expr)+ ')'
    and     ::= '(' expr ('∧' expr)+ ')'
    counter ::= '+'   // '+' = {1,∞}
              | '*'   // '*' = {0,∞}
              | '?'   // '?' = {0,1}
              | '{' min ',' max '}' 
              | group '{' count '}'
    min     ::= '0' | count
    count   ::= POSITIVE_INTEGER
    max     ::= count | '∞'
    token   ::= basic-token 
              | counted-token
    counted-token ::= basic-token counter
    
    The primitive content tokens include generic identifiers, the keyword #PCDATA, and the tag-expressions noted above:
    token        ::= first_class
                   | second_class
                   | '#PCDATA' 
    first_class  ::= GI
    second_class ::= '#stag(' GI ')'
                   | '#etag(' GI ')'
                   | '#tag(' GI ')'
The terminal symbol GI denotes anything that can be interpreted as a generic identifier; in practice this will typically be a QName.

In a clean rabbit/duck grammar, every element type is declared once (explicitly or implicitly), every generic identifier used in a right-hand side as a first-class token is that of a first-class element, and every generic identifier used in a second-class token is that of a second-class element (except in overlap-groups, as specified below in section 3).

Undefined element types, unused element types, non-productive declarations, and so on can be defined for rabbit/duck grammars in the same way as for context-free grammars in general.11 In what follows, I will without loss of generality restrict my attention to clean grammars in which none of these deficiencies occur.

In a full set of rabbit/duck grammars, each grammar is defined over the same vocabulary G, and each generic identifier in G is first-class in at least one grammar.

Some technical commentary is in order, on the four ways a grammar can treat elements and their start- and end-tags.

  • The treatment of first-class elements ensures that conventional DTDs and schemas are a simple subcase of the rabbit/duck mechanism: they are rabbit-duck grammars in which all elements are first-class.
  • Second-class elements make it possible to constrain the interrelation of grammars: the grammar author can ensure that lines of verse can begin within or between speeches in a verse drama, and can cross speech boundaries, but lines cannot cross act or scene boundaries, or begin in the middle of a stage direction.
  • Third-class elements make it convenient to say that the start and end points of a particular element type are completely unconstrained by a given grammar, and that the contents of the element should be visible to the grammar. The same effect could be achieved by declaring the element second-class and allowing its start- and end-tags absolutely everywhere in the grammar, but that is error-prone and rather boring. Note that the treatment of recessive tags in CONCUR is the same as the treatment of third-class elements in rabbit-duck grammars.
  • Fourth-class elements make a similar convenience possible: the location of a particular element type is unconstrained by the grammar, and the entire element is invisible to the grammar.

Example: Peer Gynt

An example may help clarify things. Consider the opening lines of Peer Gynt used as a running example in [Sperberg-McQueen / Huitfeldt 2000]:

Åse: Peer, du lyver!

Peer Gynt : [uten å stanse] Nei, jeg gjør ei!

Åse: Nå, så bann på det er sant!

Peer Gynt: Hvorfor banne?

Åse: Tvi, du tør ei! Alt i hop er tøv og tant!

In English:

AASE: Peer, you're lying!

PEER GYNT : [without stopping] No, I'm not!

AASE: Well then, swear to me it's true.

PEER GYNT: Swear? why should I?

AASE: See, you dare not! Every word of it's a lie.

In the dramatic hierarchy, we want a hierarchy of play / act / scene, with scenes containing speeches and stage directions (speech, stage).

In the metrical hierarchy, we want play / act / scene, with scenes containing verse lines (L).

The dramatic hierarchy can be defined thus:

<!--* Grammar D *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (sp | stage | #tag(L))* >
<!ELEMENT sp    (#PCDATA | stage | #tag(L))* >
<!ATTLIST sp    
          who CDATA #REQUIRED >
<!ELEMENT stage (#PCDATA) >
]>
This explicitly ensures that line breaks only occur as children of speeches or scene, and never within stage directions or between scenes or acts.12 If line breaks should occur only within speeches and not between them, the #tag(L) can be dropped from the declaration for scene.

The verse structure can be defined this way:

<!--* Grammar V *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (L | stage | #tag(sp) )* >
<!ELEMENT L     (#PCDATA | stage | #tag(sp) )* >
<!ELEMENT stage (#PCDATA) >
]>

In this view, a scene of a verse drama is a sequence of verse lines interspersed with stage directions. If we wish to ignore the stage directions entirely, we can do that, too:

<!ELEMENT scene (L | #tag(sp) )* >
<!ELEMENT stage IGNORE >

In TexMecs notation, this fragment can be marked up this way:13

<play| <* Peer Gynt *>
<act n="1"|
<scene n="i"|
<sp who="Aase"|<l|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping|stage> 
    No, I'm not!|l>|sp>
<sp who="Aase"|<l| Well then, swear to me it's true! |l>|sp>
<sp who="Peer"|<l| Swear? Why should I? |sp>
<sp who="Aase"| See, you dare not! |l>
<l| Every word of it's a lie! |l> |sp>

<* ... more dialogue ... *>
|scene>
<* ... more scenes ... *>
|act>
<* ... more acts ... *>
|play>

If we associate a view of the document (in the database sense) with each grammar, the difference between the two variants of the verse grammar just given is the difference between a view in which stage directions are present:

<l| Peer, you're lying! 
  <stage|without stopping|stage> No, I'm not! |l>
<l| Well then, swear to me it's true! |l>
<l| Swear? Why should I? See, you dare not! |l>
<l| Every word of it's a lie! |l> 
and one in which they are absent entirely:
<l| Peer, you're lying! No, I'm not! |l>
<l| Well then, swear to me it's true! |l>
<l| Swear? Why should I? See, you dare not! |l>
<l| Every word of it's a lie! |l> 
(These transcriptions are a bit casual about white space, for the sake of legibility).

For metrical studies, the second view may be more convenient than the first. It's easy enough to transform the first into the second, but some vocabulary designers may also find it easier to write grammars like the second, in which they can completely ignore elements in the other hierarchies. (When all elements in a grammar are first- or third-class, rabbit/duck grammars are very similar to CONCUR.)

Two general conditions

Two important conditions are established and maintained by the rules given above for all documents valid against a clean set of rabbit/duck grammars.

  1. Every grammar rule is satisfied each time it applies. That is, every declaration in every grammar is satisfied by every element in the document instance to which it applies.
  2. Every element in the document is covered by a declaration in at least one grammar, and (unless it is the outermost element in some grammar) matches a first-class token on the right-hand side of at least one declaration in at least one grammar (namely, the declaration applicable to its parent in that grammar).

For the cases covered by the current rules, it seems plausible to believe (1) that any document for which the two general conditions are true should count as valid for a given set of rabbit/duck grammars, and (2) that no document for which the general conditions do not hold should count as valid. The conditions, that is, seem to be both sufficient and necessary conditions for validity against a full set of clean rabbit/duck grammars.

The rules defined so far handle all overlap as the overlaying of the tree defined by one grammar with the tree defined by another. They cannot handle self-overlap, because in self-overlap both element instances are first-class elements in the same grammars. But the rules do suffice to cover most cases of concurrent hierarchies in which overlap stems from the tagging of multiple analytic perspectives, and when applied to such cases, the rules given so far appear to produce intuitively plausible results. So the conditions may serve to guide us in attempting to extend the mechanism to cover cases of self-overlap: what is needed is a way of handling self-overlap that satisfies both of them.

Vocabularies with self-overlap

If within a given language, instances of some particular element type b can overlap other instances of element type b, then we wish to write one or more of the grammars in a set of rabbit/duck grammars in such a way as to ensure that

  • When an instance of element type b occurs and does not overlap any other b, it is matched by a token in a grammar rule in the usual way.
  • When two b elements overlap each other, the grammar is ambiguous and provides two ways to parse the overlap, one parse in which the first b element is a first-class element and the second is a second-class element, and another parse in which the second b element is first-class and the first b is second-class.
  • When three b elements overlap each other, three parses are generated, one with the first b element as first-class, one with the second, and one with the third; in each parse the other two instances of b are treated as second-class elements.
  • And so on: for n overlapping b elements, the grammar will generate n parses. In each parse a different instance of b will be first-class.

For example, consider the following XML document, written with Trojan Horse milestones for b (later examples will use a more compact notation):

<doc>
 aaa
 <b sID="b1"/>
 bbb
 <b sID="b2"/>
 ccc
 <b eID="b1"/>
 ddd
 <b eID="b2"/>
 eee
</doc>
We wish to validate both element b1 and element b2. That is, we wish this document to be validated both as if it were written
<doc>
 aaa
 <b>
 bbb
 <b sID="b2"/>
 ccc
 </b>
 ddd
 <b eID="b2"/>
 eee
</doc>
and also as if it were written
<doc>
 aaa
 <b sID="b1"/>
 bbb
 <b>
 ccc
 <b eID="b1"/>
 ddd
 </b>
 eee
</doc>

It is desirable to have the same control over overlap as we have for other items in a content model, so we wish to be able to control where it can and cannot occur. The following sections outline a way to accomplish this.

Description

To handle languages in which elements with the same generic identifier can overlap each other, a few modest changes suffice. We need (1) a syntactic mechanism for matching start- and end-tags, (2) a new kind of token which we will call an overlap-token, and (3) a new kind of group for use in content models, the overlap-group.

First, since elements don't necessarily nest, some syntactic mechanism is needed for matching start- to end-tags. The generic identifier alone no longer suffices. The examples shown here will use the co-indexing syntax of TexMecs: a tilde and an arbitrary string of characters will decorate the generic identifier, and decorated start- and end-tags match only other tags with the same decoration. So in the document

<doc|<s~1|aaa<s~2|bbb|s~1>ccc|s~2>|doc>
the first s element (decorated with “~1”) contains the string “aaabbb” and the second contains “bbbccc”. As shown above, in an XML-based system Trojan Horse elements [DeRose 2004] could be used instead. The example shown above in Trojan Horse notation looks like this in TexMecs notation:
<doc|
 aaa
 <b~b1|
 bbb
 <b~b2|
 ccc
 |b~b1>
 ddd
 |b~b2>
 eee
|doc>

Second, a new kind of token is required, namely an overlap-token:

overlap-token ::= '~' GI '~'
Within an overlap-group, an overlap-token indicates a location where an instance of the given element type which is allowed to overlap with another instance of the same element type. When there is no overlap, an overlap-token has the same meaning as a first-class token with the same generic identifier: it matches one element instance of the type named. When there is overlap in the document instance, the overlap-group containing the overlap-token (see next paragraph) is matched against the document instance once for each overlapping element, each time yielding one reading or parse of the relevant portion of the document instance. In each reading, the overlap-token matches a different one of the overlapping element instances; the other overlapping element instances are treated as second-class elements and must match second-class tokens.

Third, a new type of content-model expression is required, an overlap group:

group ::= '#overlap(' expr ')'
If there are no instances of overlap in the document, the overlap group will have the same meaning as the expression it contains. If there are instances of (self-)overlap in the document, then the overlap group will be matched against the relevant part of the document instance one time for each overlapping element. The expr of an overlap expression differs from other expressions in two ways: first, it may contain overlap-tokens (for a single generic identifier), and second, it may (and always will, since otherwise it's unsatisfiable) contain second-class tokens (#tag( g ) and so on) for the first-class element named in the overlap-tokens.

In any normal overlap-group, there will be at least one overlap-token (~ g ~) and at least one #tag( g ) (or at least one #stag( g ) and at least one #etag( g )) token for the same generic identifier. Together, these show the two ways in which every occurrence of g within the scope of the overlap-group must be read. Every occurrence of g must be a first-class element in (at least) one reading, and a pair of milestones (i.e. a second-class element) in all other readings.14

If #overlap(α) is an overlap-group with an overlap-token for g, then we can use α1 to denote the reading of the expression in which the first g is first-class and any others are second-class, α2 to denote the reading in which the second g is first-class, and so on. Then if there is just one g in the relevant part of the document instance, the meaning of the overlap-group is (α1). If there are two overlapping g elements, the meaning of the overlap-group is (α1 ∧ α2), where the connector “∧” denotes conjunction. When F and G are expressions, the content-model expression (FG) denotes the set of sequences which are members both of F and of G. When F and G are readings, the expression (FG) denotes the set of sequences which can be read both as F and as G. So (α1 ∧ α2) denotes the set of expressions which can be read both as α1 and as α2 For three overlapping g elements, the meaning is (α1 ∧ α2 ∧ α3). And so on. The meaning of an overlap-group, formally, is thus the infinite disjunction ((α1) ∨ (α1 ∧ α2) ∨ (α1 ∧ α2 ∧ α3) ∨ ...)

The two general conditions identified above (section 2-3) are preserved by this interpretation of overlap-groups: the grammar rules for the parent apply separately for each overlapping element matching the overlap-token, as do those for the overlapping element itself. And each overlapping instance is covered by a declaration and produces a reading in which it is a first-class element.

Examples of self-overlap

A simple example of overlap might involve pages (tagged p) deletions (d), and other elements (a, b). Deletions can overlap other deletions (i.e. element d may self-overlap), but no other overlaps occur in this group of element types. We can define the grammar thus:

<!GRAMMAR p [
<!ELEMENT p #overlap( (
              (#PCDATA | a | #tag(d))*, 
              ~d~, 
              (#PCDATA | b | #tag(d))* )? ) >
<!ELEMENT d (#PCDATA | a | b | #tag(d))*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
]>

Given a document instance

<p|
 <a><d~1|<a><d~2|<a><b><a>|d~1><b>|d~2><b>
|p>
The rules for interpreting overlap-groups produce two readings of the document. In one of these, element d~1 is the first-class element, and d~2 is second-class:
  • p
    • a
    • d~1
      • a
      • stag(d~2)
      • a
      • b
      • a
    • b
    • etag(d~2)
    • b
In the other, the roles are reversed:
  • p
    • a
    • stag(d~1)
    • a
    • d~2
      • a
      • b
      • a
      • etag(d~1)
      • b
    • b
Note that the content model for p must contain both the overlap-token ~d~ and the second-class tokens #tag(d). Without the latter, neither reading would be valid, since there would be unmatched milestones. The content model for d must similarly contain locations where start- and end-tags for other instances of d may occur (if they did not occur within d, the elements would not be overlapping).15

A second example illustrates slightly tighter constraints: d elements contain not an unordered mix of a and b elements, but an ordered sequence:

<!GRAMMAR p [
<!ELEMENT p (h*, 
            #overlap((a | #stag(d))*, 
                     ~d~, 
                     (c | #etag(d))*),
             f*) >
<!ELEMENT d ((a | #tag(d) | ~d~)*, 
             b,
             (b | #tag(d) | ~d~)+, 
             (c | #tag(d) | ~d~)*) >
<!ELEMENT a EMPTY >
<!ELEMENT b EMPTY >
<!ELEMENT c EMPTY >
<!ELEMENT f (#PCDATA) >
<!ELEMENT h (#PCDATA) >
]>
The instance given above is not valid according to this grammar, but the following is:
<p|
 <a><d~1|<a><d~2|<b>|d~1><c>|d~2><c>
|p>

Implementation issues

In the preceding sections I have endeavored to give a purely declarative account of rabbit/duck grammars and their meaning, without any appeal to particular parsing techniques. But of course a grammatical formalism is more likely to be useful in practice if the constraints it expresses can actually be enforced at reasonable implementation and runtime cost.

One implementation technique is straightforward for languages without self-overlap:

  1. Use a simple tokenizer to read the input document and produce a stream of tags and data-content chunks.
  2. For each grammar, create a filter which passes through all tags for first-class elements, translated to XML form. Second-class tags are translated into XML milestone elements. Third-class tags are suppressed entirely; fourth-class elements are suppressed entirely (both tags and content).16
  3. Translate each grammar in the set into any conventional grammar-based schema language.
  4. Validate the resulting XML document against the schema created in the previous step.
Repeat this process once for each grammar in the full set of rabbit/duck grammars. If no grammatical errors are found, the input document is valid.

Alternatively, one can dispense with the filtering step by building a parser for each grammar, in which the lexical scanner takes care of distinguishing among first- and second-class tags, suppressing third-class tags, and suppressing fourth-class elements. The document can be validated against a set of rabbit/duck grammars by parsing it once for each grammar.

It is also possible to validate an input document against a set of rabbit/duck grammars in a single pass; one way to do that is described below. It involves some unconventional extensions of Brzozowski derivatives; these are introduced piecemeal, in an attempt to ease comprehension. The exposition begins with a description of one way to check the well-formedness and validity of XML (section 4-1) using Brzozowski derivatives. These are relatively simple and straightforward illustrations of the mechanism. Section 4-2 discusses validation of TexMecs languages without self-overlap, and finally section 4-3 describes validation of languages which allow self-overlap.

Checking XML using Brzozowski derivatives

Brzozowski derivatives were introduced in 1964 by Janusz Brzozowski [Brzozowski 1964]. The derivative D of a set L with respect to some input string s, written D s L, is the set of strings which can follow s in words of the language L. Brzozowski showed, given a regular expression E denoting L, how to construct a regular expression for D s E, which makes possible an algebraic approach to recognizing members of the language. Less formally, given a regular expression and a (prefix) string, the derivative (of the expression, with respect to the string) tells us what can come next. In what follows, it will be important to recall that if one of the things which can come next is ε, the empty string, then the string s is itself a member of the language defined by E. An expression E whose language includes ε is said to be nullable, and the function ν(E) is true or false accordingly. In what follows, successful parses will normally be those which end with a derivative which is nullable.17

Well-formedness checking

Brzozowski derivatives are conventionally used to recognize or reason about regular languages. But with a couple of extensions, they can also be used to parse well-formed XML.

The first extension is to define an expression language slightly different from regular expressions as defined either conventionally or by Brzozowski. I shall call these language expressions rather that regular expressions, because the languages they denote are not necessarily regular. Each expression denotes a set of strings, and may be said to match any string in that language. The language expressions of interest here may take any of the following forms:

  • elem( g ) matches an element with generic identifier g.
  • stag( g ) matches a start-tag for an element with generic identifier g.
  • etag( g ) matches an end-tag for an element with generic identifier g.
  • #PCDATA matches a sequence of zero or more characters. It is intrinsically nullable.
  • W denotes a string of well-balanced XML; i.e., it is (#PCDATA | elem( g ))*, where g need not be the same for all elements matched. When parsing well-formed XML, W can be used to represent the expected content of any element.
  • X matches an arbitrary XML document; i.e., it is elem( g ) for any generic identifier g
  • ε matches the empty string.
  • ∅ matches nothing (it denotes the empty set).
  • If F and G are expressions, then (F G) is an expression denoting the set of strings formed by concatenating a string in F with one in G.
  • If F and G are expressions, then (F | G) is an expression denoting the union of F and G.
  • If F and G are expressions, then (F & G) is an expression denoting the set of strings formed by interleaving a string in F with a string in G.18
  • If F and G are expressions, then (FG) is an expression denoting the intersection of F and G.

The second extension is to define an infinite input alphabet. Assume that the input stream is a sequence of symbols each of which is one of the following:

  • sole( g ) denotes a sole-tag for an empty element with generic identifier g. For purposes of calculating derivatives, this is treated as if the input were the two-symbol string “stag( g ) etag( g )”.
  • stag( g ) denotes a start-tag for an element with generic identifier g.
  • etag( g ) denotes an end-tag for an element with generic identifier g.
  • pcdata() denotes a sequence of one or more characters.
Since there are an infinite number of legal generic identifiers, there are an infinite number of possible input symbols. Note that for simplicity I ignore comments, processing instructions, and insignificant white space, imagining that they are handled at another level of the processor. If it is desired, they can be handled explicitly by adding the rule that if x is a comment, processing instruction, or insignificant white space, D x E = E for all E.

As usual, the derivative of an expression E with respect to a sequence of symbols x 1, x 2, x 3, ..., x n ,is found by finding the derivative of E with respect to x 1, then the derivative of D x 1 E with respect to x 2, and so on. Formally,

D s E = D <x 1, x 2, ... x n > E = D x n (... (D x 3 (D x 2 (D x 1 E))) ...)

A sequence of input symbols d is well-formed XML if and only if D d X is nullable, when the rules for derivatives are as described below.

The third extension is the the derivative of a primitive language expression with respect to an input symbol may be something other than ε or ∅. Derivatives of compound expressions (concatenation, disjunction, conjunction) are given in [Brzozowski 1964] and need not be repeated here. [Sperberg-McQueen 2005] defines derivatives for interleave.

When x = stag( h ), then

  • D x elem( g ) = (W etag( g )) if h = g, otherwise ∅.
  • D x stag( g ) = ε if h = g, otherwise ∅.
  • D x etag( g ) = D x #PCDATA = ∅.
  • D x W = (W etag( g ) W).
  • D x X = (W etag( g )).
  • D x ε = D x ∅ = ∅.

When x = etag( h ), then

  • D x elem( g ) = D x stag( g ) = ∅.
  • D x etag( g ) = ε if h = g, otherwise ∅.
  • D x #PCDATA = D x W = D x X = D x ε = D x ∅ = ∅.

When x = pcdata(), then

  • D x elem( g ) = D x stag( g ) = D x etag( g ) = ∅.
  • D x #PCDATA = ε.
  • D x X = D x W = D x ε = D x ∅ = ∅.

Examples of using these rules to check XML well-formedness are given in appendix 7-1.

The reader will note that the derivative, which represents the expected sequence of tokens to come, is being used as a stack: when we encounter a start-tag in the input which matches an elem( g ) in the expression being matched, we push (W etag( g )) onto the stack by placing it at the front of the derivative.

Validation

This method can be applied to validation as well, by changing just one rule. When stag( g ) in the input matches the expression elem( g ), the rules just given make the derivative be (W etag( g )): in well-formed XML, what follows the start-tag of any element is well-formed XML followed by the end-tag of that element. To validate against a DTD, all we need to do is to replace W in the derivative with the content model of the element. If we write the expression “cm( g )” to denote the content model of element type g, and “[[cm( g )]]” to denote the value of that expression (i.e. to indicate that the expression should be replaced by its value), the rule becomes:

  • D stag(h) elem( g ) = ([[cm( g )]] etag( g )) if h = g, otherwise ∅.

For example, consider a vocabulary for serializing binary trees whose definition (in DTD syntax) is:

<!ELEMENT tree ((tree, tree) | leaf) >
<!ELEMENT leaf (#PCDATA) >
Let us validate the document
<tree><leaf>42</leaf></tree>

We start with the expression elem(tree) and validate the input stream “stag(tree), stag(leaf), pcdata(), etag(leaf), etag(tree)”:

E 0 = elem(tree)
E 1 = D stag(tree) E 0 = [[cm(tree)]], etag(tree) = ((elem(tree), elem(tree)) | elem(leaf)), etag(tree)
E 2 = D stag(leaf) E 1 = ((∅, elem(tree)) | ([[cm(leaf)]], etag(leaf))), etag(tree) = ((∅) | ([[cm(leaf)]], etag(leaf))), etag(tree) = [[cm(leaf)]], etag(leaf), etag(tree) = #PCDATA, etag(leaf), etag(tree)
E 3 = D pcdata() E 2 = ε, etag(leaf), etag(tree) = etag(leaf), etag(tree)
E 4 = D etag(leaf) E 3 = ε, etag(tree) = etag(tree)
E 5 = D etag(tree) E 4 = ε
The final expression is nullable, and we can therefore conclude that the input was valid against the grammar.

Further examples of validation using Brzozowski derivatives are given in appendix 7-2.

Parsing TexMecs without self-overlap

The technique just described can be used without major changes to validate documents with overlap against individual rabbit/duck grammars which have no overlap-groups. The only change is that the implementation must take care to treat fourth-class elements and tags for third-class elements appropriately.

We can validate a document against a whole set of rabbit/duck grammars rather than just individual grammars with parsing mechanism which keeps track of the current derivative in each grammar and for each token calculates the next derivative for each grammar. Perhaps the simplest way to acquire such a mechanism is to connect the derivatives for the different languages using logical conjunction. Thus if the current validation state of grammar 1 is denoted E 1, and the state of grammar 2 is E 2, etc., then for n grammars we will have a single expression of the form

E 1E 2 ∧ ... ∧ E n

It is necessary, however, to keep track of which expressions come from which grammars, so as to treat third- and fourth-class elements appropriately. One way to do this is to assign a number to each grammar and extend the expression language so as to handle the necessary bookkeeping. The foundation of the expression language are the content model expressions defined above in section 2-1; for purposes of validation we extend that expression language as follows.

First, we subscript or mark each reference to a first-class element in a grammar with the number of that grammar. In written form, this can be represented as a subscript to the generic identifier, so that instead of being written (a, (b | #tag(c)), d), a content model in grammar 1 would be written (a 1, (b 1 | #tag(c)), d 1), and similarly for grammar 2, 3, and so on. The rule for start-tags matching elements can then be rewritten

  • D stag(h) elem( g , n ) = ([[cm( g , n )]] etag( g )) if h = g, otherwise ∅.
where “cm(g,n)” denotes the content model of element g in grammar n.

Second, we add two new types of group expressions. Neither of these will normally occur in user-written content models; they will be used only as the result of taking a derivative from some other expression.

group   ::= g-tag | spinner
g-tag   ::= '#gr(' POSITIVE_INTEGER ')' content-model
spinner ::= '#spin(' POSITIVE_INTEGER ')' 
            '(' list ')' content-model
list    ::= empty | GI ' / ' list
empty   ::= ε // nothing, the empty sequence
It's also convenient to be able to use symbols for the empty string and the empty set:
token ::= 'ε'
        | '∅'

An expression of the form #gr( n )( E ) labels expression E as having come from grammar number n. The rules for taking the derivative D x #gr( n )( E ) are these:

  • If x is stag( g ), etag( g ), or sole( g ), and g is a third-class element in grammar n, then D x #gr( n )( E ) = #gr( n )( E ). Informally: third-class tags are ignored.
  • If x is stag( g