XML Europe 2003 logo

What can you do with half a parser?

Abstract

While most developers are happy parsing their XML with off-the-shelf parsers and working with fully-cooked results, there are times when developers need a little more control over their document processing. While XML is text, applying text processing tools directly to XML has some serious drawbacks. This presentation will explore the possibilities offered by a combination of XML parsing for context with text processing to manipulate that content.

Keywords


Table of Contents

1. The Ripper Parser
1.1. Challenges in Writing Ripper
1.2. Ripper licensing and availablity
2. Applications
2.1. Custom Characters and Context
2.2. Entity processing
2.3. Valuable Sugar
2.4. Custom parsing
3. Conclusions
Acknowledgements
Bibliography
Glossary
Biography
Most developers have spent the past five years of XML growing ever more deeply trapped in a view of XML documents as node trees. While that view is useful in many circumstances, it is also severely limited by the many differences between the original text of XML documents and the results reported by the XML 1.0 [XML] processor, or worse, the XML 1.0+namespaces [XMLNS] processor. These differences are the result of hard work on the part of the processor (more commonly called the parser), but these differences have severely limited the applicability of text-based tools to XML and made it difficult to create transformations which change as little as possible of the original document. The tight integration of DTD processing with XML 1.0 parsing has both driven proposals for more cleanly layered models and made it difficult or impossible to update DTD processing for new features like namespaces.
In reponse to these problems, many developers have retreated more and more deeply into the node-based view, commonly describing their work as "Infoset manipulation" and paying little or no attention to the markup syntaxes which provide the foundation for such the XML Infoset [Infoset]. While the infoset-oriented community seems to believe that their approach provides better interoperability, infoset-oriented tools also remove much of XML's flexibility - particularly since DTDs play no direct role in the infoset, which represents a processed view of the XML document.
So far as I can determine, XML parsers have been consistently built around what is effectively an Infoset model. The Simple API for XML (SAX) [SAX2] and the Infoset are very similar, and while there are certainly differences between the Document Object Model (DOM) [DOM] and XPath models and the Infoset, all of these models are tightly bound to notions of nodes in a tree structure. Also, XML 1.0 is fairly specific about the processing required to be a "conforming XML 1.0 processor", and performing that processing excises a lot of information from the original document. Wisely, most developers followed the path of least resistance and created tools which report fully-processed Infosets to the application.
Because of my various frustrations with this approach as well as an occasional need to make automated but minimal transformations, I have been working for a while on this area, starting with an article [Layered] suggesting that XML parsing would benefit from a refactoring into separate components for syntax parsing, well-formedness checking, entity resolution, attribute defaulting, namespace processing, structural validation and finally presentation to the application. All of these pieces of the XML puzzle are useful in isolation as well as in combination.
This perspective has also led me to create tools which support a richer view of XML documents than is possible through the lens of conformant XML 1.0 processing or subsequent specifications. Markup Object Events (MOE)[MOE] provides an object structure which can preserve entity information and other aspects of lexical XML. Ents [Ents] provides an alternate mechanism for working with character references and entities composed only of characters, while Gorille [Gorille] defines a mechanism for testing acceptable characters in markup that could handle the shift from XML 1.0 to XML 1.1. All of these have been foiled by the locked box of the XML 1.0 parser, as they provide functionality that requires tight integration with the parser. As a result, I wrote "Ripper", a new part of the Gorille package which parses XML documents but leaves much of the processing to external layers.

1. The Ripper Parser

The Ripper parser performs some but not all of the functions of an XML 1.0 processor. Its primary function is to break documents down into components conforming to the markup grammar used by XML 1.0, hence its name. It performs some error reporting, primarily in cases where the markup itself violates the basic grammatical rules laid out in XML 1.0. Ripper keeps track of the element, attribute, and namespace contexts, and reports all of the content of the document to a handler, including tidbits like attribute quoting style and whitespace inside of tags.

Perhaps more important than what Ripper does is what Ripper leaves to the application. Ripper performs no DOCTYPE processing, entity processing, attribute defaulting, character checking, or normalization on the textual information it passes to the application. The application is responsible for performing any of these tasks as it deems appropriate, or it can ignore them and just process the raw information that is handed to it.

Communications between the application and the Ripper is managed through two key interfaces. Ripper uses the ContextI interface to communicate information about the document - ranging from the origin URI to currently-scoped namespace declarations to a brief element tree - to the application. The application can also modify this context, either to communicate with the parser or to communicate with other applications in a chain of processors. The context object also provides a small foundation of initial information, notably XML 1.0's built-in entities and namespace URIs for the xml and xmlns prefixes.

The DocProcI interface provides a means for Ripper to communicate the actual textual content of the document, angle brackets, whitespace, and all, with the receiving application. Because Ripper is so focused on text, the API is almost completely text-oriented, using StringBuffer objects to represent everything. This tactic is unusual and would probably horrify most proper Java programmers, but is appropriate to the kind of information Ripper provides. The API is also quite close to the markup, as the following excerpt demonstrates:

public StringBuffer XMLDecl (StringBuffer content) throws GorilleException;
public StringBuffer DOCTYPE (StringBuffer content) throws GorilleException;
public StringBuffer startElementOTag (StringBuffer content) throws GorilleException;
public StringBuffer startElementCTag (StringBuffer content) throws GorilleException;
public StringBuffer elementName (StringBuffer content) throws GorilleException;
public StringBuffer tagSpace (StringBuffer content) throws GorilleException;
public StringBuffer attName (StringBuffer content) throws GorilleException;
public StringBuffer attEquals (StringBuffer content) throws GorilleException;
public StringBuffer attStartQuote (StringBuffer content) throws GorilleException;
public StringBuffer attEndQuote (StringBuffer content) throws GorilleException;
public StringBuffer endElementOTag (StringBuffer content) throws GorilleException;
public StringBuffer endElementETag (StringBuffer content) throws GorilleException;
public StringBuffer endElementCTag (StringBuffer content) throws GorilleException;
public StringBuffer chars (StringBuffer content) throws GorilleException;
public StringBuffer decCharRef (StringBuffer content) throws GorilleException;
public StringBuffer hexCharRef (StringBuffer content) throws GorilleException;
public StringBuffer entRef (StringBuffer content) throws GorilleException;
public StringBuffer commentStart (StringBuffer content) throws GorilleException;
public StringBuffer commentContent (StringBuffer content) throws GorilleException;
public StringBuffer commentEnd (StringBuffer content) throws GorilleException;
public StringBuffer PIStart (StringBuffer content) throws GorilleException;
public StringBuffer PITarget (StringBuffer content) throws GorilleException;
public StringBuffer PISpace (StringBuffer content) throws GorilleException;
public StringBuffer PIData (StringBuffer content) throws GorilleException;
public StringBuffer PIEnd (StringBuffer content) throws GorilleException;
public StringBuffer CDATAStart (StringBuffer content) throws GorilleException;
public StringBuffer CDATAEnd (StringBuffer content) throws GorilleException;

It's not exactly lovely code, but it does make it possible, even easy, to process lexical content and return lexical content. The original use-case for Ripper was as a pre-processor to another parser, making modifications in the text before passing the document to the parser. Some applications may just build themselves on top of Ripper, which is fine - the StringBuffer return value can be treated as void if needed. Applications can also ignore events that don't interest them, using helper classes that come with the package. Most interestingly, of course, they can change or suppress content.

1.1. Challenges in Writing Ripper

Ripper is not all that complicated a program, though its character-by-character parsing logic isn't particularly delightful. For the most part, it trudges through the document, keeps track of both lexical and structural context, and reports what it finds. There are a few cases where markup is so badly wrong that it can't be processed even at Ripper's relatively simple level, and these are reported as errors.

The only particularly difficult part of writing Ripper was created by Namespaces in XML. Prior to namespaces, an XML document could be parsed directly in sequence. Everything needed to know to parse a given part of a document, if anything, came from earlier parts of the document. Because of namespaces, however, the parser frequently needs to read to the end of the start tag to interpret the element name at the beginning of the tag. Namespaced attributes often have the same problem, with namespace declarations that come after the prefix has already been used.

Solving this problem requires parsing the start tag twice. The first parse is used to set the context, including namespace context, and the second parse is used to report the text to the application. Using this approach, the application will have the namespace context it needs to interpret element and attribute names as they arrive.

Unfortunately this double-parse has created some duplicate code, both for the double-parse itself and for ampersand and entity handling. Attribute values may of course include entities and character references, even attribute values which happen to be used by namespace declarations. For purposes of context, Ripper resolves these entities, but it then reports the unresolved entities separately during the reporting phase.

1.2. Ripper licensing and availablity

Ripper is distributed as part of the Gorille project, all of which is licensed under the Mozilla Public License (MPL). Gorille includes Ripper, rules-based character testing code, and some common code used to create shorthand descriptions of document structures. Ripper (and all of Gorille) is written in Java, and requires at least Java 1.2.

2. Applications

There are a number of cases where such an impractical-looking and not particularly efficient API may be useful. Minimal transformations where you don't want to change surrounding context are relatively easy in this context, especially cases where you want to preserve the original DOCTYPE and XML declaration. Some developers also need to perform processing where the entity references in a document should be left alone, preserved but unopened. Ripper is also useful for processing where entity or namespace declarations need to be specified outside of the document itself, and for processing where custom character rules are necessary - typically for the XML 1.0 to 1.1, but possibly also for other environments that need to perform such filtering.

2.1. Custom Characters and Context

While most of the arguments about XML 1.1 focus on the NEL character and whether or not change of any kind is a good thing for the core of XML, XML 1.0 certainly helped create its own versioning problem. The list of characters included in XML 1.0 was illuminating and useful, but it was also built so deeply into parsers that changing it now is difficult.

Ripper can't fix all of the old parsers, but it does offer an approach that may be useful in the future. Ripper builds only its expectations for the markup characters themselves into the parsing logic, and leaves determination of whitespace and other acceptable characters to the application. As Gorille integration with Ripper proceeds, this information will become available through the Context object and applicable through a dedicated layer or processing.

This separation of context from the parsing logic also means that it is possible to configure context and then process document fragments within that context. This greatly simplifies the processing of external entities which themselves contain entity references, to give one common XML 1.0 example. It may also help in the processing of document fragments which are missing their namespace declarations.

2.2. Entity processing

The most recently prominent use case involves the current set of issues surrounding character entities, where DTDs, particularly the internal subset, are used to provide entity declarations in otherwise schema-centric (or completely unvalidated) environments. Some developers would prefer not to deal with DTDs at all, and there are now a fair number of environments (BEEP's core vocabulary, SOAP messages) where the DOCTYPE declaration is prohibited. This creates problems for some developers, notably those using MathML with its many frequently-used entities.

Because Ripper reserves entity processing to the application, applications can solve problems like these with entity resolvers focused on their particular needs rather than the expectations of a given parser. An application could even resolve entities based on their current namespace or element scope, making it possible to create entity vocabularies which are associated with particular structural vocabularies rather than with a single document. This could potentially reduce name collisions between the entities used by different vocabularies, a problem avoided today by copying and coordination.

While Ents only provides support for character entities, Ripper's foundations are flexible enough that applications can summon a new instance of Ripper to parse an external entity and integrate it with the existing document, if desired. If the application takes care to preserve context objects, those can be combined to provide support for complex cases like nested entities which rely on namespace declarations from the parent.

2.3. Valuable Sugar

Another use case involves situations where otherwise unimportant details of an XML document are used to mark content which needs special treatment - a transformation that only applies to attributes with single quotes, for example. While such work doesn't necessarily accord with an Infoset view of XML, and has often "flown under the radar" it can still be a practical means of combining and massaging information from different sources.

Unfortunately for the "Desperate Perl Hacker", XML is not exactly conducive to simple manipulation with regular expressions. Entity references and namespace prefixes both serve as abbreviations for information declared elsewhere, and default attributes can also make such processing difficult. Also, while transformations are a critical part of XML processing, such transformations may throw away information, notably comments, processing instructions, and whitespace - which are actually useful to developers manipulating documents as text.

Ripper doesn't solve these problems automatically, but it provides a framework within which developers can combine textual processing and an understanding of the markup context.

2.4. Custom parsing

Developers can also build their own parsing or storage logic on top of this API. Because the entire document will be reported verbatim, it's possible to use this information to report an XML document to an application while preserving its original form more precisely than is possible with approaches like SAX or DOM. (Encoding issues may still keep it from being byte-for-byte identical, but character-for-character is plausible.)

There are a number of uses for this kind of processing. Environments like MOE, which support more features than are provided by SAX or DOM, have always been limited by the kinds of parsers available to them. A Ripper application could easily create MOE events, effectively building a MOE parser. Lexical analysis programs might also use Ripper as a front end to their own work.

3. Conclusions

While this approach may not appeal to every developer, I hope that it will find a useful place in many developers' toolkits, helping to solve problems that require both knowledge of the document as text and the document as marked-up structure and content.

Acknowledgements

Thanks to Walter Perry, Rick Jelliffe, Gavin Thomas Nicol, John Cowan, Paul Prescod, and the xml-dev list generally for various sparks. Additional thanks to the xmlhack editors for providing an informal support group for various demented XML adventures.

Bibliography

[DOM] W3C DOM Working Group. Document Object Model http://www.w3.org/DOM/DOMTR

[Ents] St.Laurent, Simon. Ents http://simonstl.com/projects/ents/

[Gorille] St.Laurent, Simon. Gorille http://simonstl.com/projects/gorille/

[Infoset] Cowan, John and Tobin, Richard. XML Information Set http://www.w3.org/TR/xml-infoset/

[Layered] St.Laurent, Simon. Toward A Layered Model for XML http://simonstl.com/articles/layering/layered.htm

[MOE] St.Laurent, Simon. Markup Object Events (MOE) http://simonstl.com/projects/moe/

[SAX2] Megginson, David, Brownell, David, et al. The Simple API for XML (SAX) http://saxproject.org

[XML] Bray, Tim, et al. Extensible Markup Language 1.0 (Second Edition). http://www.w3.org/TR/REC-xml

[XMLNS] Bray, Tim, et al. Namespaces in XML. http://www.w3.org/TR/REC-xml-names

Glossary

MOE

Markup Object Events

Biography

Simon St. Laurent is an Associate Editor with O'Reilly and Associates. Prior to that, he'd been a web developer, network administrator, computer book author, and XML troublemaker. He lives in Ithaca, NY. His books include XML: A Primer, XML Elements of Style, and the upcoming Office 2003 XML Essentials. He is a contributing editor to xmlhack and an occasional contributor to XML.com.