Abstract
While most developers are happy parsing their XML with off-the-shelf parsers and working with fully-cooked results, there are times when developers need a little more control over their document processing. While XML is text, applying text processing tools directly to XML has some serious drawbacks. This presentation will explore the possibilities offered by a combination of XML parsing for context with text processing to manipulate that content.
Keywords
Table of Contents
The Ripper parser performs some but not all of the functions of an XML 1.0 processor. Its primary function is to break documents down into components conforming to the markup grammar used by XML 1.0, hence its name. It performs some error reporting, primarily in cases where the markup itself violates the basic grammatical rules laid out in XML 1.0. Ripper keeps track of the element, attribute, and namespace contexts, and reports all of the content of the document to a handler, including tidbits like attribute quoting style and whitespace inside of tags.
Perhaps more important than what Ripper does is what Ripper leaves to the application. Ripper performs no DOCTYPE processing, entity processing, attribute defaulting, character checking, or normalization on the textual information it passes to the application. The application is responsible for performing any of these tasks as it deems appropriate, or it can ignore them and just process the raw information that is handed to it.
Communications between the application and the Ripper is managed through two key interfaces. Ripper uses the ContextI interface to communicate information about the document - ranging from the origin URI to currently-scoped namespace declarations to a brief element tree - to the application. The application can also modify this context, either to communicate with the parser or to communicate with other applications in a chain of processors. The context object also provides a small foundation of initial information, notably XML 1.0's built-in entities and namespace URIs for the xml and xmlns prefixes.
The DocProcI interface provides a means for Ripper to communicate the actual textual content of the document, angle brackets, whitespace, and all, with the receiving application. Because Ripper is so focused on text, the API is almost completely text-oriented, using StringBuffer objects to represent everything. This tactic is unusual and would probably horrify most proper Java programmers, but is appropriate to the kind of information Ripper provides. The API is also quite close to the markup, as the following excerpt demonstrates:
public StringBuffer XMLDecl (StringBuffer content) throws GorilleException; public StringBuffer DOCTYPE (StringBuffer content) throws GorilleException; public StringBuffer startElementOTag (StringBuffer content) throws GorilleException; public StringBuffer startElementCTag (StringBuffer content) throws GorilleException; public StringBuffer elementName (StringBuffer content) throws GorilleException; public StringBuffer tagSpace (StringBuffer content) throws GorilleException; public StringBuffer attName (StringBuffer content) throws GorilleException; public StringBuffer attEquals (StringBuffer content) throws GorilleException; public StringBuffer attStartQuote (StringBuffer content) throws GorilleException; public StringBuffer attEndQuote (StringBuffer content) throws GorilleException; public StringBuffer endElementOTag (StringBuffer content) throws GorilleException; public StringBuffer endElementETag (StringBuffer content) throws GorilleException; public StringBuffer endElementCTag (StringBuffer content) throws GorilleException; public StringBuffer chars (StringBuffer content) throws GorilleException; public StringBuffer decCharRef (StringBuffer content) throws GorilleException; public StringBuffer hexCharRef (StringBuffer content) throws GorilleException; public StringBuffer entRef (StringBuffer content) throws GorilleException; public StringBuffer commentStart (StringBuffer content) throws GorilleException; public StringBuffer commentContent (StringBuffer content) throws GorilleException; public StringBuffer commentEnd (StringBuffer content) throws GorilleException; public StringBuffer PIStart (StringBuffer content) throws GorilleException; public StringBuffer PITarget (StringBuffer content) throws GorilleException; public StringBuffer PISpace (StringBuffer content) throws GorilleException; public StringBuffer PIData (StringBuffer content) throws GorilleException; public StringBuffer PIEnd (StringBuffer content) throws GorilleException; public StringBuffer CDATAStart (StringBuffer content) throws GorilleException; public StringBuffer CDATAEnd (StringBuffer content) throws GorilleException;
It's not exactly lovely code, but it does make it possible, even easy, to process lexical content and return lexical content. The original use-case for Ripper was as a pre-processor to another parser, making modifications in the text before passing the document to the parser. Some applications may just build themselves on top of Ripper, which is fine - the StringBuffer return value can be treated as void if needed. Applications can also ignore events that don't interest them, using helper classes that come with the package. Most interestingly, of course, they can change or suppress content.
Ripper is not all that complicated a program, though its character-by-character parsing logic isn't particularly delightful. For the most part, it trudges through the document, keeps track of both lexical and structural context, and reports what it finds. There are a few cases where markup is so badly wrong that it can't be processed even at Ripper's relatively simple level, and these are reported as errors.
The only particularly difficult part of writing Ripper was created by Namespaces in XML. Prior to namespaces, an XML document could be parsed directly in sequence. Everything needed to know to parse a given part of a document, if anything, came from earlier parts of the document. Because of namespaces, however, the parser frequently needs to read to the end of the start tag to interpret the element name at the beginning of the tag. Namespaced attributes often have the same problem, with namespace declarations that come after the prefix has already been used.
Solving this problem requires parsing the start tag twice. The first parse is used to set the context, including namespace context, and the second parse is used to report the text to the application. Using this approach, the application will have the namespace context it needs to interpret element and attribute names as they arrive.
Unfortunately this double-parse has created some duplicate code, both for the double-parse itself and for ampersand and entity handling. Attribute values may of course include entities and character references, even attribute values which happen to be used by namespace declarations. For purposes of context, Ripper resolves these entities, but it then reports the unresolved entities separately during the reporting phase.
Ripper is distributed as part of the Gorille project, all of which is licensed under the Mozilla Public License (MPL). Gorille includes Ripper, rules-based character testing code, and some common code used to create shorthand descriptions of document structures. Ripper (and all of Gorille) is written in Java, and requires at least Java 1.2.
There are a number of cases where such an impractical-looking and not particularly efficient API may be useful. Minimal transformations where you don't want to change surrounding context are relatively easy in this context, especially cases where you want to preserve the original DOCTYPE and XML declaration. Some developers also need to perform processing where the entity references in a document should be left alone, preserved but unopened. Ripper is also useful for processing where entity or namespace declarations need to be specified outside of the document itself, and for processing where custom character rules are necessary - typically for the XML 1.0 to 1.1, but possibly also for other environments that need to perform such filtering.
While most of the arguments about XML 1.1 focus on the NEL character and whether or not change of any kind is a good thing for the core of XML, XML 1.0 certainly helped create its own versioning problem. The list of characters included in XML 1.0 was illuminating and useful, but it was also built so deeply into parsers that changing it now is difficult.
Ripper can't fix all of the old parsers, but it does offer an approach that may be useful in the future. Ripper builds only its expectations for the markup characters themselves into the parsing logic, and leaves determination of whitespace and other acceptable characters to the application. As Gorille integration with Ripper proceeds, this information will become available through the Context object and applicable through a dedicated layer or processing.
This separation of context from the parsing logic also means that it is possible to configure context and then process document fragments within that context. This greatly simplifies the processing of external entities which themselves contain entity references, to give one common XML 1.0 example. It may also help in the processing of document fragments which are missing their namespace declarations.
The most recently prominent use case involves the current set of issues surrounding character entities, where DTDs, particularly the internal subset, are used to provide entity declarations in otherwise schema-centric (or completely unvalidated) environments. Some developers would prefer not to deal with DTDs at all, and there are now a fair number of environments (BEEP's core vocabulary, SOAP messages) where the DOCTYPE declaration is prohibited. This creates problems for some developers, notably those using MathML with its many frequently-used entities.
Because Ripper reserves entity processing to the application, applications can solve problems like these with entity resolvers focused on their particular needs rather than the expectations of a given parser. An application could even resolve entities based on their current namespace or element scope, making it possible to create entity vocabularies which are associated with particular structural vocabularies rather than with a single document. This could potentially reduce name collisions between the entities used by different vocabularies, a problem avoided today by copying and coordination.
While Ents only provides support for character entities, Ripper's foundations are flexible enough that applications can summon a new instance of Ripper to parse an external entity and integrate it with the existing document, if desired. If the application takes care to preserve context objects, those can be combined to provide support for complex cases like nested entities which rely on namespace declarations from the parent.
Another use case involves situations where otherwise unimportant details of an XML document are used to mark content which needs special treatment - a transformation that only applies to attributes with single quotes, for example. While such work doesn't necessarily accord with an Infoset view of XML, and has often "flown under the radar" it can still be a practical means of combining and massaging information from different sources.
Unfortunately for the "Desperate Perl Hacker", XML is not exactly conducive to simple manipulation with regular expressions. Entity references and namespace prefixes both serve as abbreviations for information declared elsewhere, and default attributes can also make such processing difficult. Also, while transformations are a critical part of XML processing, such transformations may throw away information, notably comments, processing instructions, and whitespace - which are actually useful to developers manipulating documents as text.
Ripper doesn't solve these problems automatically, but it provides a framework within which developers can combine textual processing and an understanding of the markup context.
Developers can also build their own parsing or storage logic on top of this API. Because the entire document will be reported verbatim, it's possible to use this information to report an XML document to an application while preserving its original form more precisely than is possible with approaches like SAX or DOM. (Encoding issues may still keep it from being byte-for-byte identical, but character-for-character is plausible.)
There are a number of uses for this kind of processing. Environments like MOE, which support more features than are provided by SAX or DOM, have always been limited by the kinds of parsers available to them. A Ripper application could easily create MOE events, effectively building a MOE parser. Lexical analysis programs might also use Ripper as a front end to their own work.
While this approach may not appeal to every developer, I hope that it will find a useful place in many developers' toolkits, helping to solve problems that require both knowledge of the document as text and the document as marked-up structure and content.
Thanks to Walter Perry, Rick Jelliffe, Gavin Thomas Nicol, John Cowan, Paul Prescod, and the xml-dev list generally for various sparks. Additional thanks to the xmlhack editors for providing an informal support group for various demented XML adventures.
![]() ![]() |
Design & Development by deepX Ltd. |