Fixing broken XML is a problem interesting in its own right, and fixing broken (X)HTML is a particularly timely topic. This paper introduces a declarative approach to specifying the repair of ill-formed XML, based on the TagSoup work of John Cowan. A prototype implementation, called PYXup, is described, which operates on the PYX output of the scanner phase of TagSoup to fix-up all well-formedness and some structural problems.
| XML Source | PDF (for print) | Author Package | Typeset PDF |
The historical and social dimensions of the current situation and future trajectory of the HTML family of languages are so complex as to defy easy analysis (see for example [TAG07]), but one issue at least stands out: should 'the next HTML' be an XML language, and should it have an explicit schema? One major constituency has specifically rejected the use of any form of schema, and is describing their candidate for 'the next HTML' only using (somewhat stylised) natural language [Hickson07].
The primary motivation for the 'no schema' approach is the perceived necessity of specifying the language in a way which accommodates the brutal reality of current practice: most HTML on the Web today is invalid, but browsers process it pretty consistently none-the-less. The differences between HTML and XHTML are at best poorly understood, and rarely respected. The complex interplay of the relevant technologies and human propensities are judged likely to mean that this situation will persist indefinitely. In [TAG07] the W3C's Technical Architecture Group asks
"Is the indefinite persistence of 'tag soup' HTML consistent with a sound architecture for the Web? If so, (and the going-in assumption is that it is so), what changes, if any, to fundamental Web technologies are necessary to integrate 'tag soup' with SGML-valid HTML and well-formed XML?
. . .
"Can we [treat tag soup] 'as if' it was the serialization of the DOM produced by (some formalization of) common browser error recovery strategies?"
This paper explores one possible route to implementing the 'as if' suggestion, particularly the "some formalization of" parenthetical. There is one outstanding precedent for this, namely John Cowan's TagSoup [Cowan04]. Although to a certain extent TagSoup's behaviour is driven by declarative specifications, there are significant aspects which cannot be affected easily, or at all, without modifying the code. Preliminary exploration of matching the behaviour of TagSoup to the error-correcting behaviour of current browsers ran into difficulties for this reason. The PYXup system described here is an attempt to reconstruct part of TagSoup in a way which exposes more behavioural options to external declarative control.
PYXup is closely based on [Cowan04], which has two distinct components, both implemented in Java:
The first component, the scanner, which parses input into a stream of low-level tokens, is already pretty much completely customisable via the external table which controls it. This table is expressed in XML.
The second component, the 'rectifier', which deals with, for example, missing end tags, is controlled in part by an external schema, again expressed in XML, but the degree of customisation this provides for is fairly limited.
Accordingly, PYXup uses the scanner component of [Cowan04] unchanged, but implements its own 'rectifier', operating on the PYX [McGrath00] output of the scanner to produce a fixed-up stream of PYX events.
The fixup component of PYXup is driven by two declarative inputs:
The simplified grammar is expressed in XML, and can be derived automatically from a W3C XML Schema schema (and thus from other grammar-orientated schema languages such as DTD and Relax NG). It consists entirely of immediate dominance rules, and is interpreted as specifying a very regular push-down automaton. The fixup rules are likewise expressed in XML, and are interpreted as a kind of production system, with patterns expressed in terms of problem type and parsing state and actions expressed in terms of output events and operations on parser state.
For example, the following broken XML document:
<c>
(c -\n
<!DOCTYPE a [ <!ELEMENT a (b,d)> <!ELEMENT b (#PCDATA|c)*> <!ELEMENT c EMPTY> ]>
(a (b (c )c -\n )b )a
[Cowan04] defines a simple schema language, TSSL [Tag Soup Schema Language], which represents primarily immediate dominance information, with no sequencing, disjunction or cardinality information. PYXup uses a similar approach, with even less detail. A PYXup schema is a set of rules, each of which specifies only
| nonTerm |
An element name |
| children |
A set of element names |
| mixed |
A boolean |
| maybeRoot |
A boolean |
| preferredParent |
(optional) An element name |
A PYXup schema also may specify a preferred parent for otherwise not-allowed text.
A PYXup schema is interpreted as a push-down automaton, in which each rule is corresponds to a finite state automaton, whose name is given by its nonTerm, as follows (for an element named nt):

The )) transition is to accommodate the PYX event generated for empty tags and the anachronistic </> markup.
Furthermore, for every child a transition from and to state 2 is added, labelled with the name of the corresponding FSA. If the element allows text content, a transation labelled -* is added likewise. For example, for the b element as defined above, that is <!ELEMENT b (#PCDATA|c)*>, the result would be

Such an automaton will accept the beginning and end PYX events for a b element, with any number of balanced c element begin/end pairs, mixed with any number of text events, in between.
Finally, the collection of automata is transformed to simplify processing and improve efficiency, by as it were promoting the opening transition from each embedded automaton, so that for example the automaton for the a element as defined above, that is <!ELEMENT a (b,d)> as actually used looks like this:

This change implies that there is must also be a root automaton which dispatches on the open event for all possible document elements to the appropriate named automaton.
The automata act as recognisers and identity transducers, that is, as each event is accepted it is echoed.
What can go wrong, and how can it be fixed? At one level, there are only four things that can go wrong:
Case (2) is currently not detected by PYXup. This means that strictly speaking PYXup's output is a well-formed XML entity, that is, it may contain one or more elements. This case is discussed further in “Further work”.
Case (4) covers a wide range of well-formedness and basic validity problems. PYXup currently handles the following subcases, as well as (1) and (3), based on the type of event, the current state and the contents of the pushdown stack (examples are based on the 'abcd' doctype above):
| upEnd |
The event is a close event, and there is an automaton higher up the pushdown stack which could accept it, that is, a matching open even has occurred, as in (a (b )a. The default fixup is to output a close event for the current automaton (in the example, a )b) and return from it. |
| badEnd |
The event is a close event, and there is no automaton higher up the pushdown stack which could accept it, as in (a )d. The default fixup is to discard the event. |
| upChild |
The event is an open event, and there is an automaton higher up the pushdown stack which could accept it, that is, an open event for an element which could accept this as a child has occurred, as in (a (b (d. The default fixup is to output a close event for the current automaton (in the example, a )b) and return from it. |
| badChild |
The event is an open event, there is no automaton higher up the pushdown stack which could accept it, but the event's tag is known to be allowed as the child of at least one other element, as in (a (c. The default fixup is to postpone the current event and process an open event for the preferred parent element (if specified in the grammar) or an arbitrary allowed parent element (otherwise) instead (in the example, this would be a (b). |
| badOrphan |
The event is an open event but no known element allows it as a child, as in (a (b (a. The default fixup is to force acceptance, that is, to act as if the current automaton were constructed from an element which did have the relevant element as a child. |
| upText |
The event is a text event, and there is an automaton higher up the pushdown stack which could accept it, that is, an open event for an element which allows text has occurred, as in (a (b (c -text. The default fixup is to output a close event for the current automaton (in the example, a )c) and return from it. |
| orphanText |
The event is a text event, there is no automaton higher up the pushdown stack which could accept it, but at least one element is known to accept text, as in (a -text. The default fixup is to postpone the current event and process an open event for the preferred parent element for text (if specified in the grammar) or an arbitrary text-allowing parent element (otherwise) instead (in the example, this would be a (b). |
| badText |
The event is a text event, but there is no automaton higher up the pushdown stack which could accept it, and no text-allowing element in the grammar at all. The default fixup is to force acceptance, that is, to act as if the current automaton were constructed from an element which did allow text. |
| overrun |
The token stream ends, but the pushdown stack is not empty, as in the case where (a (b were the entire input. The default fixup is to output a close event for the current automaton (in the example, a )b) and return from it. |
| unknown |
The start tag for a tag with no definition is seen, as in (a (x. The automatic fixup is to synthesise an automaton for the tag which allows all known tags as children (but not text). |
The abstract syntax for the fixup language represents the defaults given above as follows:
upEnd: end("%o") & pop
badEnd: ignore
upChild: end("%o") & pop
badChild: splice("%p") & retry
badOrphan: force
upText: end("%o") & pop
orphanText: splice("%t") & retry
badText: force
overrun: end("%o") & pop
PYXup's default fixup as described above comes very close to reconstructing TagSoup's built-in 'rectification'. The major differences I'm aware of are as follows:
At the moment, when a given problem (e.g. non-allowed start tag) may need differing fixups depending on context, PYXup handles this by dispatching on a finite number of different keys which encode the context differences (e.g. upChild, badChild, badOrphan). It might be possible to address the 'restartable', 'unclosable' and 'cdata' issues described in the list above by further articulation of such keys, but it probably makes more sense to revise the rule language to allow for richer patterns which could test for (boolean combinations of) a range of properties of the current event and the events on the pushdown stack.
From a much more theoretical perspective, there is a body of work in formal language theory about error correction, see e.g. [Fischer77], which deserves to be explored for possible relevance.
For the time being PYXup discards comment, processing instruction and attribute events -- obviously this needs to be fixed! A systematic approach to namespaces is also needed.
The most important next step is to look at the kinds of fixup described in the prose of [Hickson07] to see which ones are covered by the default rules given above, which ones could be covered with new rules in the existing language or the extensions proposed above in section “Comparison with TagSoup”, which ones would require other changes to the language and, of course, which ones cannot be accommodated by the PYXup paradigm at all.
The XML languages for specifying the schema and the fixup rules are both very simple. Only two elements are involved for the schema:
<idgram
textParent? = IDREF
namespace? = anyURI>
(element)*
</idgram>
<element
name = ID
mixed? = true|false
maybeRoot? = true|false
parent? = IDREF
children? = IDREFS/>The fixup rules require a bit more:
<rules>
rule*
</rules>
<rule
match = string>
(pop|ignore|force|retry|splice|end|error)*
</rule>
<pop/>
<ignore/>
<force/>
<retry/>
<splice>
string
</splice>
<end>
string
</end>
<error>
string
</error>The current status of PYXup, including material for download, is available from http://www.ltg.ed.ac.uk/PYXup/.
Needless-to-say, I owe a substantial debt of gratitude to John Cowan, not only for the design of TagSoup, but for making its implementation available. He is not, of course, responsible for any of the changes and additions reported here.
[Cowan04] Cowan, J., TagSoup: A SAX parser in Java for nasty, ugly HTML, 2004. Available online at http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf, see also the TagSoup project home page http://home.ccil.org/~cowan/XML/tagsoup/.
[Fischer77] Fischer, C. N., D. R. Milton, S. B. Quiring, "An efficient insertion-only error-corrector for LL(1) parsers", Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 97--103, ACM, Los Angeles, 1977.
[Hickson07] Hickson, I. ed., Web Applications 1.0, 2007. Available online as http://www.whatwg.org/specs/web-apps/current-work/.
[McGrath00] McGrath, S., XML Processing with Python, Prentice Hall, 0-13-021119-2, 2000. See also online http://www.xml.com/pub/2000/03/15/feature/index.html.
[TAG07] W3C Technical Architecture Group, Draft description of new TAG issue TagSoupIntegration-54, W3C, 2007. Available online at http://lists.w3.org/Archives/Public/www-tag/2006Oct/0062.html.