Event-based Java parsers, like SAX and Xerces, are extended to a more powerful state-based XML stream parser API. Resource use is not much greater than for SAX, and application development is not much more difficult than with DOM and XSLT. The new API offers a good compromise in those applications where low resource requirements and ease-of-use are each a sufficiently important goal to justify a small sacrifice in achievement of the other goal.
Keywords: DOM; SAX; Parsing; API; Java
| XML Source | PDF (for print) | Author Package | Typeset PDF |
Java developers use two kinds of parsers, DOM parsers that produce a memory-based tree containing the full document, and SAX parsers that report a series of "events" as the document is scanned, but without retaining any information. DOM parsers are easier to work with because they present the data in a way that more closely matches a designer's view of XML, and because they are supported by powerful tools like XSLT. On the other hand, SAX parsers can handle documents of unlimited size, and can parse a document faster than a DOM parser.
This paper presents an XML stream parser API that can be implemented over a SAX parser, making the SAX parser easier to use without greatly reducing the advantages of a SAX parser. The new API brings some of the advantages of the DOM parser, by presenting a data in a way that more closely follows a designer's view of an XML document, and by adding some supporting tools to make design easier.
Java and XML represent reality in different ways. This section illustrates how DOM parsers and SAX parsers support Java representations of XML documents and why the difference between the Java representations and the XML representation sometimes causes problems.
XML markup identifies the components of a document by inserting structural information, like tags, between or around parts of the actual data in the document. Some of the components, like character entities, are "atomic" and can be represented as simple objects that combine markup and data. Other components, like elements or CDATA sections, are not atomic. Those components are represented as complex objects, consisting of boundaries (start and end) and content. The content may contain other components, which can be either simple or complex
XML markup requires a document to be "well-formed", in the sense that the document has a hierarchy of complete components and, for any two components, one component contains the other or the components do not overlap at all. Current SAX and DOM parsers (and the “XML Stream Parser API” presented by this paper) recognize, use, and depend on the well-formed property of XML documents.
DOM parsers load XML components into memory as a tree of Java objects, beginning with org.w3c.xml.Document as the root. Each XML component, whether simple or complex, is mapped to a single Java object as shown in Table 1.
| XML Component | DOM Class |
|---|---|
| attribute | Attr |
| CDATA section | CDATASection |
| comment | Comment |
| document | Document |
| document type name | DocumentType |
| element | Element |
| entity | Entity |
| notation | Notation |
| processing instruction | ProcessingInstruction |
| character data | Text |
The Java representation and the XML representation of the document differ in a significant way. In a book, titles will be handled differently for the book, chapters, and sections. In XML the <title> element will appear as a child of <book>, <chapter>, and <section> elements. The <title> element is contained in its parent element as a part of that parent element. In Java, the parent object identifies its children, but does not contain them. DOM parsers add additional information to the Java objects, so that the parent, child, predecessor, or successor of any element is easily identified.
DOM parsers store the entire document in memory before any of the application code is executed. This is both a strength and a weakness. The memory storage supports random access to all parts of the document, but very large documents may occupy too much memory, and no output can be produced until after reaching the end of the document.
DOM application developers can write code that reflects their understanding of the document and of XML, instead of code that requires them to know the (possibly esoteric) internal details of the tools they use. Application developers do not even need to know exactly how the document tree is maintained in memory, because they can use XPATH to describe the data they want, and to operate on the data they request. Using XPATH, a developer can describe an transformation in an XSLT file and then, in three simple operations, load the input document, load the XSLT file, and apply the XSLT file to the input document to produce the required output.
SAX parsers scan the document once and invoke event methods to as they encounter each atomic XML components and each boundary of a large XML component, as shown in Table 2.
| XML Component | SAX Event |
|---|---|
| attribute | none1 |
| CDATA section | LexicalHandler.startCDATA |
| LexicalHandler.endCDATA | |
| comment | LexicalHandler.comment |
| document | ContentHandler.startDocument |
| ContentHandler.endDocument | |
| document type name | none2 |
| DTD | LexicalHandler.startDTD |
| LexicalHandler.endDTD | |
| element | ContentHandler.startElement |
| ContentHandler.endElement | |
| entity | LexicalHandler.startEntity |
| LexicalHandler.endEntity | |
| notation | DTDHandler.notationDecl |
| processing instruction | ContentHandler.processingInstruction |
| character data | ContentHandler.characters |
Information from the SAX parser is available only during the event that presents the information. Consequently, documents that require simple processing can be of unlimited size, because memory resources depend only on the activity needed to translate one stream of input data to a second stream of output data. Moreover, output can begin as soon as enough of the input has been read to provide data for the beginning of output. However, more complex processing will require extensive work by the application designer, who must work without the range of tools that would be available for use with a DOM parser.
The events that a SAX parser produces from an XML document are not very close to the view that an application designer has of the document. The application designer thinks in terms of whole elements that contain other elements. When the SAX parser encounters a hierarchy of nested elements, it will first report the start of each element and then, in reverse order, the end of each element. The application must construct a view of the elements (and their relationship) from these events. Moreover, the application design is extremely vulnerable to changes in document structure, because the application view of the structure will almost certainly require change when the DTD changes.
DOM parsers and SAX parsers are limited in the sense that each kind lacks some of the strengths held by the other. Table 3 illustrates some of the differences between DOM parser API and the SAX parser API.
| XML Component | DOM Class | SAX Event |
|---|---|---|
| attribute | Attr | none1 |
| CDATA section | CDATASection | LexicalHandler.startCDATA |
| LexicalHandler.endCDATA | ||
| comment | Comment | LexicalHandler.comment |
| document | Document | ContentHandler.startDocument |
| ContentHandler.endDocument | ||
| document type name | DocumentType | none2 |
| DTD | none3 | LexicalHandler.startDTD |
| LexicalHandler.endDTD | ||
| element | Element | ContentHandler.startElement |
| ContentHandler.endElement | ||
| entity | Entity | LexicalHandler.startEntity |
| LexicalHandler.endEntity | ||
| notation | Notation | DTDHandler.notationDecl |
| processing instruction | ProcessingInstruction | ContentHandler.processingInstruction |
| character data | Text | ContentHandler.characters |
Table 3 shows that the DOM parser API is closer than the SAX parser API to representing the application designer's view of an XML document. The DOM parser API represents CDATA sections, elements, entities, and even the entire document as single objects. The SAX parser API reports, as separate events, the entry to, and exit from each of these XML components.
Table 3 does not show some other differences:
Although DOM and SAX parsers have proven to be good tools, they will be found wanting in situations that require a combination of the strengths from both DOM and SAX. Consequently, there is room for a stream parser that offers a compromise between SAX and DOM.
The XML Stream Parser API (Section “XML Stream Parser API”) offers an alternative midway between DOM and SAX in terms of speed, ease of use, and resource requirements. The differences between the three alternatives can be illustrated with a simple application that processes the simple document in Figure 1 to produce the output in Figure 2. The document and the application requirements are designed to create more difficulty in using a SAX parser than in using a DOM parser. The application is required to print the titles of sections without printing the document title or any other text. To do this, the application must recognize the document structure, which is shown in Figure 3.
<?xml version='1.0' encoding='us-ascii'?>
<!DOCTYPE document [
<!ELEMENT document (title, section*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT section (title, para*)>
<!ELEMENT para (#PCDATA)>
<!ATTLIST section>
]>
<document>
<title>Document</title>
<section>
<title>Section One</title>
</section>
<section>
<title>Section Two</title>
<para>Some text</para>
</section>
</document>
document
/ | \
/ | \
title section section
/ / | \
/ / | \
#PCDATA title title para
| | |
| | |
#PCDATA #PCDATA #PCDATA
Figure 4 shows how the document can be parsed in Java using a SAX parser. The document is parsed very quickly, but the logic for recognizing the document structure is scattered over several different event methods. That programming style introduces a kind of complexity that increases very rapidly as the size of the application increases. Moreover, the program is likely to be very sensitive to changes in a DTD because the program is, in effect, duplicating information in the DTD, and the information in both places must be consistent.
Java code:
saxParser.parse(inputSource, this);
writer.close();
public void startElement(String uri, String localName, String qName,
Attributes attributes)
throws SAXException
{
if (qName.equals("section")) {
isOutputEnabled = true;
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException
{
try {
if (qName.equals("section")) {
isOutputEnabled = false;
} else if (qName.equals("title") && isOutputEnabled) {
writer.write(lineSeparator);
isOutputEnabled = false;
}
} catch(IOException ex) {
throw new SAXException(ex);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException
{
try {
if (isOutputEnabled) writer.write(ch, start, length);
} catch(IOException ex) {
throw new SAXException(ex);
}
}
Output:
>java SaxParse ../input.xml
Section One
Section Two
Elapsed time: 7 milliseconds
Figure 5 shows how the document can be parsed in Java using a DOM parser and an XSLT stylesheet. Structural information appears in only one place, as the "section/title" match condition in one template and the structural information is directly related to the application requirement. Other information, like the parentage of an element, is available when needed.
Java code:
StreamSource stylesource
= new StreamSource(new File(arg[0]));
Transformer transformer
= transformerFactory.newTransformer(stylesource);
document = documentBuilder.parse(new File(arg[1]));
transformer.transform(new DOMSource(document),
new StreamResult(System.out));
XSLT stylesheet:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="*">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="section/title">
<xsl:apply-templates select="text()"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Output:
>java DomParse stylesheet.xml ../input.xml
Section One
Section Two
Elapsed time (stylesheet+parse+transform): 91 milliseconds
Elapsed time (parse+transform): 11 milliseconds
Figure 6 shows how the document can be parsed in Java using the XML Stream Parser API (Section “XML Stream Parser API”. As with the XSLT stylesheet, structural information appears in only one place, in the element_title method, where output is enabled only for titles that have a section as parent. The logic for handling an element is entirely within a single method that handles that element. Information is neither as transient as with SAX, nor as long-lived as with DOM. For example, the attributes of any element remain available on a stack until the end of that element, not disappearing with the start tag, like it does with SAX, nor remaining available until the end of processing, like it does with DOM.
Java code:
xmlParser.parse(input);
writerStack.close();
public void characters(ParseReader reader)
throws ListenerException, IOException {
reader.writeTo(writerStack);
}
public void doDocument(CurrentDocument document) {
try {
writerStack.push(nullOutputWriter); // disable output
document.parseContent(); // parse
writerStack.pop(); // re-enable output
writerStack.close(); // cleanup
} catch(Exception e) {
e.printStackTrace();
}
}
public void defaultElement(CurrentElement element)
throws ListenerException, IOException {
element.parseContent();
}
public void element_title(CurrentElement element)
throws ListenerException, IOException {
if (element.getParent().getQName().equals(SECTION)) {
Writer writer = writerStack.pop(); // pop nullOutputWriter
try {
element.parseContent(); // parse with output enabled
writerStack.write(lineSeparator);
} finally {
writerStack.push(writer); // push nullOutputWriter
}
} else {
element.parseContent(); // parse with output disabled
}
}
Output:
>java -cp ~/j/lib/gxparse.jar:. GxParse ../input.xml
Section One
Section Two
Elapsed time (including reflection): 12 milliseconds
Elapsed time (excluding reflection): 11 milliseconds
An essential difference between the XML Stream parser API and the SAX and DOM APIs is that the Stream Parser API is flow-oriented and state-based. The flow of data is controlled by changing state as elements are entered and exited. In Figure 6, the characters method simply transfers characters from the input stream to an output stream. The output stream is actually a stack of output streams, and the actual destination is controlled by element methods that push and pop output streams.
The doDocument method, which is invoked before any element method, disables output by pushing a NullOutputWriter on the stack. When a title element is encountered, the element_title method is invoked. Upon entry, the method checks to see if the element is a child of a section element. If so, the NullOutputWriter is popped to enable output, the content of the element is parsed, and the NullOutputWriter is pushed again to disable output. Otherwise, output remains disabled while the element content is parsed.
The difference in the Stream Parser API supports a programming style that can be very convenient, even though the style may initially seem strange to a SAX or DOM developer. That programming style depends on both explicit and implicit use of stacks to save and restore state [Knuth-1973]. The programming style views XML processing as an activity on a serial stream of data, with processing controlled indirectly by changes of state that affect the output process, rather than by allowing the output process itself to make input-based decisions about what to do. Element methods exercise this control by pushing original state and modifying current state when an element is entered, and popping the original state when an element is exited. A change of state can be done, as in Figure 6 by actually pushing and popping objects on a stack, or the change of state can be done by maintaining data in local variables that the method discards upon exit. In this programming style, an element method is a good "object" for representing an element, because
If we can accept the notion of a method as an "object", the XML Stream Parser API can be considered to be an object-oriented approache to processing XML documents.
The author was surprised to find that, at least in some circumstances, DOM is not much slower than SAX, and is just as fast as the Stream Parser API.
The execution times in the examples were obtained by preloading all classes to eliminate time consumed by the Java class loader. Following the preload, the DOM and Stream Parser API examples ran the parser twice, first to get processing times including one-time startup costs and then to get processing times excluding the startup costs. Startup cost for DOM with XLST is compilation of the XSLT stylesheet which only needs to be done once for any number of documents that require the same processing. Startup cost for the Stream Parser is looking up element methods by Java reflection, which only needs to be done the first time the element is encountered.
Table 4 summarizes the differences in processing time for the simple document example (Figure 1. After deducting startup costs, DOM took fifty percent longer than SAX, and took the same time as the XML Stream Parser API:
| API Used | Startup Costs | Repetitive Processing | One-time Processing |
|---|---|---|---|
| DOM (Figure 5) | 80 mSec. | 11 mSec. | 91 mSec. |
| SAX (Figure 4) | 0 mSec. | 7 mSec. | 7 mSec. |
| Stream Parser API (Figure 6) | 1 mSec. | 11 mSec. | 12 mSec. |
In repetitive processing, DOM seems to be as fast as the Stream Parser API, and SAX does not have enough extra speed to justify its greater development cost in many applications. In non-repetitive processing, DOM is significantly slower than SAX or the Stream Parser API, and the Stream Parser API does not lose much of its speed relative to SAX.
Class loading times were not measured precisely, but were about 120 mSec for DOM and about 150 mSec for SAX and the Stream Parser API.
These results indicate that, for repetitive processing, the Stream Parser API would need to be justified on some ground other than speed of processing. The justification is unlikely to be the programming style, because the Stream Parser API, while comparable to XSLT in ease-of-use, does not offer some of the features (like access to sibling elements) provided by XSLT.
The Stream Parser API becomes a good alternative when processing non-repetitive input, or when processing large documents. For non-repetitive processing, the Stream Parser API may offer speed advantages over DOM due to lower startup costs. For large documents, the Stream Parser API offers advantages over DOM because memory requirements are not much greater than for SAX. In both cases the Stream Parser API offers a programming style that, like XSLT, is oriented toward a designer's view of XML documents. In these circumstances, designers can obtain a good compromise between the low resource requirements of SAX and the ease-of-use provided by XSLT.
This section describes a stream parser API that can be implemented on event-based parsers like the SAX parser. The new API is implemented over an existing parser because event-based parsers are already doing part of the essential work of stream parsing. They recognize XML structure, finding and reporting the boundaries of XML components like elements and sequences of character data. All that needs to be added is the reporting of these components as entire objects, retention of important data (like the attributes and the parents of an element), easy access to the retained data, and an easier programming style.
DOM parsers and SAX parsers treat the incoming stream in different ways. DOM parsers view a document as a static object to be reconstructed from a serial stream. SAX parsers view a document as a stream to be translated during reception, where the precise nature of the translation is controlled and dynamically modified by the embedded markup. Because of this, a SAX parser provides no information at all about any part of the document that has already gone by. A SAX parser does not report an XML element as a single object, but reports the start and end of an element as two separate events, leaving it to the application to perceive a relationship between the two events. DOM parsers, on the other hand, reconstruct the document as a tree of static objects, which are represented in Java as class instances. Although the Java objects do not precisely represent corresponding XML objects, they are close enough to support XPATH syntax, which does represent XML objects well. Consequently, it is much easier to develop an XML application with DOm than with SAX, whenever the resource requirements and processing times of DOM are acceptable.
SAX parsers are used when it is more important to minimize the use of resources than it is to work easily with the XML document structure. DOM parsers are used when it is more important to have convenient access to the XML structure. A stream parser will be easier to use if it offers access to XML structure as conveniently as the DOM parser.
A DOM parser, receiving the simple document example (Figure 1), will reconstruct the hierarchy as a tree that provides access to all parts of the document and its structural information (see Figure 3).
A SAX parser, receiving the same document, will produce a sequence of events, interspersed with character data. If the application needs to recognize structural information, such as elements and the nesting of elements, that information must be reconstructed by the application, because the information is not supplied by the SAX parser. Figure 7 shows how an XML application would use SAX events from parsing the simple document example (Figure 1) to recognize the structure of that document (Figure 3).
SAX Event XML Element
startElement(document) --------------------------------+ document
|
startElement(title) ------------+ title |
| |
characters() -- #PCDATA | |
| |
endElement(title) ------------+ |
|
startElement(section) ---------------------+ section |
| |
startElement(title) ------------+ title | |
| | |
characters() -- #PCDATA | | |
| | |
endElement(title) ------------+ | |
| |
endElement(section) ---------------------+ |
|
startElement(section) ---------------------+ section |
| |
startElement(title) ------------+ title | |
| | |
characters() -- #PCDATA | | |
| | |
endElement(title) ------------+ | |
| |
startElement(para) ------------+ para | |
| | |
characters() -- #PCDATA | | |
| | |
endElement(para) ------------+ | |
| |
endElement(section) ---------------------+ |
|
endElement(document) --------------------------------+The XML stream parser API allows each element to be handled by a single "element" method, instead of by two event methods (start and end of element). The element method is invoked when the start tag has been read, and has access to all of the information in the start tag. The element method does any necessary processing at the start of the element, and then invokes a parseContent method to return control to the parser, and to parse the content of the element. The parseContent method returns when the end tag of the element has been reached. At this time, the element method still has access to all of the start tag information, which will only be discarded after the element method returns. While the element content is being parsed, other parts of the application can access the start tag information by getting it from an element stack maintained by the API.
The simple document example (Figure 1) would be parsed by methods like those shown in Figure 6. Figure 8 shows how the methods would be invoked when parsing the simple document example. Each method name is simply the element name prefixed by "element_" (the prefix is chosen by the application developer).
A comparison of Figure 7 and Figure 8 will show that each element start event has been replaced by an element method invocation, and that the corresponding end event has been replaced by a return from the element method.
Stream API Method XML Element
element_document() --------------------------------+ document
|
element_title() ------------+ title |
| |
characters() -- #PCDATA | |
| |
<-----------+ |
|
element_section() ---------------------+ section |
| |
element_title ------------+ title | |
| | |
characters() -- #PCDATA | | |
| | |
<-----------+ | |
| |
<--------------------+ |
|
element_section() ---------------------+ section |
| |
element_title ------------+ title | |
| | |
characters() -- #PCDATA | | |
| | |
<-----------+ | |
| |
element_para() ------------+ para | |
| | |
characters() -- #PCDATA | | |
| | |
<-----------+ | |
| |
<--------------------+ |
|
<-------------------------------+The XML stream parser API represents an element by a Java method, instead of by a Java class. This is appropriate, because a stream parser is producing data for an activity (translating of a flowing stream of characters) instead of producing static data (a document tree in memory) for input to a subsequent phase of processing. Each element method, upon entry, sets up a new processing state for the translation of characters, and upon exit, restores the previous state.
The processing state for an element would include all of the information about the enclosing elements. When method element_section is invoked in Figure 8, method element_document is suspended, but information from the document element is still available from an element stack, and can be obtained requesting the parent of the current element (document). Upon entry to the characters, the element stack contains three elements (document, section, and title) with all of the start tag information for each element. Each of the suspended element methods may have also stored some application-specific information that will be available to the characters method.
Table 5 lists the methods in XML Stream Parser API, and contrasts those methods with the objects produced by a DOM parser and the events produced by a SAX parser. Start and End methods in the SAX API become single methods in the stream parser API, and the new methods correspond to objects produced by a DOM parser. The Stream parser API can be implemented on a SAX parser, giving most of the advantages obtained by direct use of the SAX parser, but also allowing an application to deal explicitly with XML components, just as it could if using a DOM parser.
| XML Component | DOM Class | SAX Event | Stream Parser Method |
|---|---|---|---|
| attribute | Attr | none1 | none2 |
| CDATA section | CDATASection | LexicalHandler.startCDATA | Listener.cdata |
| LexicalHandler.endCDATA | |||
| comment | Comment | LexicalHandler.comment | Listener.comment |
| document | Document | ContentHandler.startDocument | Listener.doDocument |
| ContentHandler.endDocument | |||
| document type name | DocumentType | none3 | none4 |
| DTD | none5 | LexicalHandler.startDTD | Listener.doDtd |
| LexicalHandler.endDTD | |||
| element | Element | ContentHandler.startElement | Listener.doElement |
| ContentHandler.endElement | |||
| entity | Entity | LexicalHandler.startEntity | Listener.doEntity |
| LexicalHandler.endEntity | |||
| notation | Notation | DTDHandler.notationDecl | Listener.notationDecl |
| processing instruction | ProcessingInstruction | ContentHandler.processingInstruction | Listener.processingInstruction |
| character data | Text | ContentHandler.characters | Listener.characters |
The XML Stream Parser API can be implemented in a number of different languages, like python, perl, or C++. However, the author is more familiar with Java than perl or python, and finds Java easier than C++ to use for XML work. Consequently, the interface details are presented as they appear in a Java implementation.
Java package ca.gorman.xml.parse contains a large number of interfaces and classes, but three interfaces and two classes will be sufficient to illustrate the principles of the stream parser API
A Parser provides
public interface Parser
{
public Listener getListener();
public void setListener(Listener listener);
public void parse(Input input)
throws ListenerException, ParseException, IOException;
public Element getCurrentElement();
public Element [] getElementStack();
public boolean isParsing(String prefixedName);
public boolean isParsing(QName qName);
public Location getLocation();
}A Listener
In particular, the doElement method receives a CurrentElement (Figure 11) containing information that remains available from the time when the start tag has been read until the time the end tag has been read. The CurrentElement is at the top of a stack that contains all of the parent elements. Consequently, the application can get at this information at any time, but never needs to explicitly store the information.
The doElement method receives all elements, and uses the ElementMapper (Figure 13) to invoke the element method that should be invoked on any particular element.
In contrast to the corresponding SAX method, the characters method receives only the characters to be processed, and receives them as an input stream (a ParseReader, Figure 11), instead of receiving the characters in an array.
public interface Listener
extends ElementListener, EntityListener, DtdListener
{
public void doDocument(CurrentDocument document)
throws ListenerException, IOException;
public void startPrefixMapping(final String prefix, final String uri)
throws ListenerException, IOException;
public void endPrefixMapping(final String prefix)
throws ListenerException, IOException;
public void characters(final ParseReader reader)
throws ListenerException, IOException;
public void ignorableWhitespace(final ParseReader reader)
throws ListenerException, IOException;
public void processingInstruction(final String target, final String data)
throws ListenerException, IOException;
public void skippedEntity(final String name)
throws ListenerException, IOException;
public Input resolveEntity(final String publicId, final String systemId)
throws ListenerException, IOException;
public void cdata(final CurrentMarkedSection cdata)
throws ListenerException, IOException;
public void comment(final ParseReader reader)
throws ListenerException, IOException;
}
public interface ElementListener
{
public void doElement(CurrentElement element)
throws ListenerException, IOException;
}
public interface EntityListener
{
public void doEntity(CurrentEntity entity)
throws ListenerException, IOException;
}
/* DtdListener is seldom used */
public interface MessageListener
{
public void warning(final String description, final Location location,
final Exception exception)
throws ListenerException, IOException;
public void error(final String description, final Location location,
final Exception exception)
throws ListenerException, IOException;
public void fatalError(final String description, final Location location,
final Exception exception)
throws ListenerException, IOException;
}Each invocation of an element method receives a CurrentElement that contains information about the element, and that provides access to information from the stack of enclosing elements, back to the document root element. Additionally, the CurrentElement contains a parseContent method that returns control to the parser, to parse the content of the element. The parseContent method will throw an exception if invoked more than once. The API will throw an exception if the element method returns without invoking the parseContent once.
The element method is invoked after the start tag has been read. No further input is read until the parseContent method is invoked. The parseContent method returns when the end tag has been read. At that time, the CurrentElement is still valid, and still provides all of the information that was available when the element method was invoked.
The caching of information differs from DOM in two ways. First, the only cached information is the start tags of the elements. In particular, character content is never cached by by the Stream Parser API, although the API does make it easy to cache text when required (see Section “Redirecting Output”). Second, only those elements from the element currently being parsed, back to the document root element are cached. This means that memory requirements are much less than for DOM, but it also means that the API provides less information at any one time than can be obtained from DOM.
public interface CurrentElement extends Element
{
public int getDocumentLevel();
public void parseContent();
}
public interface Element
{
public String getPrefixedName();
public QName getQName();
public Element getParent();
public Attribute[] getAttributes();
public Attribute getAttribute(String prefixedName);
public Attribute getAttribute(QName qName);
public boolean hasParent(String prefixedName);
public boolean hasParent(QName name);
public boolean hasAncestor(String prefixedName);
public boolean hasAncestor(QName name);
public boolean equals(Object obj);
public int hashCode();
}
Each invocation of the characters method in the Listener receives a ParseReader that contains characters to be processed by the application. In contrast to SAX API, the characters are in a Java stream (a subclass of java.io.Reader), instead of in a sub array of an array of characters. Also in contrast to the SAX API, the ParseReader contains only the characters of interest. Nevertheless, the presentation of characters in this way is consistent with the SAX API, because SAX views the input as a stream. The new API merely extends this view to the application side of the API.
ParseReader extends java.io.Reader) by adding some methods for more convenient access to the character data.
The Listener (Figure 10) invokes the same doElement for every element. This would not be very convenient for application developers, who would have to test every incoming element and invoke the appropriate element method. The XML Stream Parser API provides an ElementMapper (Figure 13) to avoid this inconvenience.
The first time a new element is encountered, ElementMapper uses Java reflection to find a method to invoke on that element. The method is stored in a hash table with the element name as key, so that the lookup only needs to be done once, no matter how many times that element is encountered. The default ElementMapper can map an XML name (including one in a namespace) to a Java method even when the XML name is not a valid Java method name.
public abstract class ElementMapper implements ElementListener
{
public static ElementMapper newInstance(String methodNamePrefix,
Object defaultClassInstance,
String defaultMethodName)
throws NoSuchMethodException, IllegalAccessException {}
public void doElement(CurrentElement element)
throws ListenerException, IOException {}
}
Figure 14 shows how ElementMapper would be used in an application that produces HTML from the simple docoument example in Figure 1.
One very strong advantage of the ElementMapper is that an element method can be added or removed without any other change in the program. If the method is present, it will be used. If the method is not present, a default method will be used instead.
// Create the ElementMapper, mapping element names to methods in this
// class, methods names prefixed by "element_", and mapping the element
// to method "defaultElement" when there is no specific method defined
elementMapper = ElementMapper.newInstance("element_", this,
"defaultElement");
public void doDocument(CurrentDocument document)
throws ListenerException, IOException {
writerStack.write("<html><body>");
document.parseContent();
writerStack.write("</body></html>");
}
public void characters(ParseReader reader)
throws ListenerException, IOException {
// copy all text to the output
reader.writeTo(writerStack);
}
public void doElement(CurrentElement element)
throws ListenerException, IOException {
// let the element mapper find the appropriate method
elementMapper.doElement(element);
}
// use this element method when no specific element method is defined for
// a particular element name
public void defaultElement (CurrentElement element)
throws ListenerException, IOException {
element.parseContent();
}
public void element_title (CurrentElement element)
throws ListenerException, IOException {
writerStack.write("<h1>");
element.parseContent(); // write the title text
writerStack.write("</h1>");
}
public void element_para (CurrentElement element)
throws ListenerException, IOException {
writerStack.write("<p>");
element.parseContent(); // write the paragraph text
writerStack.write("</p>");
}The XML Stream Parser API delivers all characters to a single characters method in the Listener (Figure 10), which can be inconvenient when text requires different treatment, depending on which element contains the text.
Unlike DOM, the XML Stream Parser API does not offer random access to the input document, which would be very useful for applications that must handle ID and IDREF attributes during output to a sequential stream.
The XML Stream parser API supports output redirection to solve the first problem by providing control over the destination of output from the characters method, and output resequencing to solve the second problem by providing a kind of pseudo-random access to a sequential output stream. Section “Resequencing the Output” will show that random access output is just as useful as random access input for resolving ID and IDREF attributes.
Management of processing state can be enhanced by providing a Unix-like redirection facility within Java. A stack of Java Writers (WriterStack) can be used as the output, supporting temporary redirection by pushing and popping other Writers. Figure 15 shows how a WriterStack can be used to collect the content of one element and save the content until the end of the document for output, without a test inside the characters method to see which element is being processed.
String appendix = null;
WriterStack writerStack = new WriterStack(new StreamWriter(System.out));
doDocument(CurrentDocument document) {
document.parseContent(); // parse the whole document
writerStack.write(appendix); // write data saved from element a
}
// redirect output while in element a
element_a(CurrentElement element) {
writerStack.push(new CharArrayWriter());
element.parseContent(); // give control back to the parser
appendix = ((CharArrayWriter) writer.pop()).toString();
}
// write to current destination
characters(ParseReader reader) {
reader.writeTo(writerStack);
}
In an XML document, all forward and backward references are declared by attaching ID and IDREF attributes to the appropriate elements. These references can be resolved without using random access to the input, because ID and IDREF attributes obey two rules:
These two rules make it possible to use a technique that has been used for almost 50 years to link separate code files into a single executable program:
An example of this technique in the early years of microcomputers was the CP/M linker [Link-80].
It does not matter whether the second or third operation occurs first, because there is no need for the symbol value until the fourth operation commences, at which time all markers will have been written. The first three operations can be done together in a single sequential process, like parsing a document with the XML Stream Parser API. The last operation can be done automatically after that sequential process is completed.
All four operations are handled in the XML Stream Parser API by a Writer that implements the Resequencer interface. This is a straight-forward extension of java.io.Writer that uses the linking technique above to provide a virtual random access to the output:
Since normal data can overflow to a file during temporary storage, and out-of-sequence data will usually be a very small proportion of the data, the Resequencer can handle documents of unlimited size, without requiring large amounts of memory.
Figure 16 shows a simple example of a document with ID and IDREF attributes. Figure 17 shows the required output from that document. Figure 18 shows how the output can be obtained using a DOM parser with XSLT. Figure 19 shows how to get the same output with the XML Stream Parser API. It is much easier to get this output from SAX indirectly via the XML Stream Parser API, than it is to get the output directly via the SAX API.
<?xml version='1.0' encoding='us-ascii'?>
<!DOCTYPE document [
<!ELEMENT document (para*)>
<!ELEMENT para (title?,(text|ref)*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT text (#PCDATA)>
<!ELEMENT ref EMPTY>
<!ATTLIST para
id ID #IMPLIED>
<!ATTLIST ref
idref IDREF #REQUIRED>
]>
<document>
<para id="p1">
<title>The first title</title>
<ref idref="p2"/>
<text>The first para
</text>
</para>
<para id="p2">
<title>The last title</title>
<text>The last para
</text>
<ref idref="p1"/>
</para>
</document>
The first title
See "The last title"
The first para
The last title
The last para
See "The first title"
The quoted titles are obtained from forward or backward references via the ref elements.
XSLT stylesheet:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="*">
<xsl:apply-templates select="*"/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="para">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="title">
<xsl:apply-templates select="text()"/>
</xsl:template>
<xsl:template match="text">
<xsl:text>
</xsl:text>
<xsl:apply-templates select="text()"/>
</xsl:template>
<xsl:template match="ref[@idref]">
<xsl:param name="idvalue" select="@idref"/>
<xsl:text>
See "</xsl:text>
<xsl:apply-templates select="//para[@id=$idvalue]/title"/>
<xsl:text>"</xsl:text>
</xsl:template>
</xsl:stylesheet>
Output:
>java Idrefs stylesheet.xml ../input.xml
The first title
See "The last title"
The first para
The last title
The last para
See "The first title"
// create an element mapper to automatically find the appropriate
// method for each element
elementMapper = ElementMapper.newInstance("element_", this,
"defaultElement");
// create a Writer that supports both resequencing and redirection
writerStack = new ResequencingWriterStack<String>(
new ResequencingWriter<String>(
new PrintWriter(System.out)));
// get parser
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
xmlParser = new XMLParser(factory);
xmlParser.setListener(this);
// parse, translate, and write out the document
xmlParser.parse(input);
writerStack.close();
public void characters(ParseReader reader)
throws ListenerException, IOException {
// copy input to the output
reader.writeTo(writerStack);
}
/**<P>
* Element dispatcher to select and invoke the appropriate element method.
* @see ElementMapper
*/
public void doElement(CurrentElement element)
throws ListenerException, IOException {
// let the element mapper find the appropriate method
elementMapper.doElement(element);
}
public void defaultElement (CurrentElement element)
throws ListenerException, IOException {
element.parseContent();
}
public void element_title (CurrentElement element)
throws ListenerException, IOException {
// divert output to a CharArrayWriter so we can use the output twice
CharArrayWriter charArrayWriter = new CharArrayWriter();
writerStack.push(charArrayWriter);
element.parseContent();
writerStack.pop();
// write to the normal destination
charArrayWriter.writeTo(writerStack);
writerStack.write(lineSeparator);
// Using the ID value of the parent as a key, store the title in a Mark
// for later use
Attribute id = element.getParent().getAttribute(ATT_ID);
if (id != null) {
charArrayWriter.writeTo(writerStack.mark(id.getValue()));
}
}
public void element_ref (CurrentElement element)
throws ListenerException, IOException {
Attribute idref = element.getAttribute(ATT_IDREF);
// write a mark using the IDREF value as the key that identifies
// the mark
writerStack.write("See \"");
writerStack.writeMark(idref.getValue());
writerStack.write("\"");
writerStack.write(lineSeparator);
element.parseContent();
}This section describes an implementation of the “XML Stream Parser API” over the SAX parser.
The greatest difficulty in the implementation is the need to run two synchronous activities, but without using the ordinary invoke-and-return paradigm of Java methods. The XML Stream Parser API requires that an element method suspend itself and allow the underlying SAX parser to run from its last position, even though the element method has not returned. At a subsequent point, the SAX parser will be suspended, and the element method will continue to its own exit, again causing the SAX parser to continue from its most recent position.
This section will show how coroutines are used to solve the problem managing and coordinating two synchronous activities to transform the SAX API (which produces a linear series of start and end events) into the XML Stream Parser API (which produces nested invocations of methods that each represent a pair of start and end events).
Sam Wilmott [Wilmott-2003] describes coroutines, which are not new [Conway-1963], shows how coroutines differ from the usual way of transferring control between two lines of execution, and explains how the coroutines are used to coordinate a pair of synchronous activities. Section “Using Coroutines to Implement the API on a SAX Parser” will explain why the XML Stream Parser API requires two synchronous lines of execution.
Transfer of control in a Java program occurs, as shown in figure 20, via the invoking of a method which
Main Subordinate
V
|
suspend | --- invoke (1) ------------+-> | begin
/ |
/ |
/ |
resume | <-- return (1) --------+---+-- | end
| / /
| / /
| / /
suspend | --- invoke (2) --->+ /
/
/
/
resume | <-- return (2) ----+
|
V
Note the asymmetry of control transfer.
In effect, one line of execution is the manager of both lines. Coroutines, on the other hand, give equal standing to each line of execution. Each line chooses when to yield control to the other, but does not specify the starting point in the other line. Instead, the other line of execution continues from the point where it previously yielded control, as shown in Figure 21
Coroutine 1 Coroutine 2
V
|
suspend | -- resume 2 -----------------> | continue
|
|
|
continue | <----------------- resume 1 -- | suspend
|
|
|
suspend | -- resume 2 -----------------> | continue
|
|
|
continue | <----------------- resume 1 -- | suspend
|
|
V
Note the symmetry of control transfer.
The XML Stream Parser API transforms each pair of matched start and end events from the SAX parser into a single method invocation(see Figure 23. This means that each of the invoked methods must be suspended while the SAX parser runs to the next SAX event. Since the method has not yet completed, and was invoked by an event from the SAX parser, the method cannot share a stack with the SAX parser. In Java (and other common languages like C++), the only way to run with separate stacks is to run in separate threads (or separate processes in C++). In order to transfer control by suspending a method, instead invoking a method and returning from the method, it is necessary to use coroutines.
The SAX parser is run in a Coroutine, which extends Thread by adding methods to support the transfers of control illustrated in Figure 22
Application Parser
V
|
suspend | -- start parser -------------->| begin
|
|
continue | <----- yield to application -- | suspend
|
|
suspend | -- yield to parser ----------> | continue
|
|
continue | <----- yield to application -- | suspend
|
|
suspend | -- yield to parser ----------> | continue
|
|
V
Note that transfer of control transfer is symmetric after the parser has been started.
Figure 23 shows how the XML Stream Parser API would coordinate an application and a SAX parser when parsing a simple document containing two nested elements with some PCDATA in the inner element.
Application SAX Parser API Method
Activity Activity Activity and Scope
a b char
V
|
suspend | -- start parser ----------------->| begin
|
|
continue | <-------- yield to application -- | suspend
|
|
|
suspend | -- yield to parser -------------> | continue
|
|
continue | <------------ startElement (a) -- | suspend |
| begin element method (a) |
| |
| |
suspend | -- yield to parser -------------> | continue |
suspend element method (a) | -
| -
continue | <------------ startElement (b) -- | suspend - |
| begin element method (b) - |
| - |
| - |
suspend | -- yield to parser -------------> | continue - |
suspend element method (b) | - -
| - -
continue | <------------ characters event -- | suspend - - |
| begin characters method - - |
| - - |
| - - |
suspend | -- return ----------------------> | continue - - |
| - -
| - -
continue | <-------------- endElement (b) -- | suspend - |
| resume element method (b) - |
| - |
| - |
suspend | -- return ----------------------> | continue - |
| -
| -
continue | <-------------- endElement (a) -- | suspend |
| resume element method (a) |
| |
| |
suspend | -- return ----------------------> | continue |
|
|
continue | <---------------------- resume -- | end
|
|
V
The scope of element method a encloses the scope of element method b, because element method b is invoked and returns while element method a is suspended. Similarly, scope of the characters method is inside the scope of element method b.
The author chose Java as the implementation language because of his familiarity with the language, and because of the availability of supporting software.
However, Java as a language does not offer any particular advantage over some other languages. Languages like perl and python would work as well, possibly even having an advantage with respect to operating on text.
The essential requirement for any language is that it allow an event-based parser and an application to run on separate stacks but synchronized with each other, as could be done with coroutines [Wilmott-2003].. Since Java does not support coroutines, it was necessary to use separate threads synchronized by exclusive locks. The same approach would work in python, using a python rlock. However, python supports generators. A generator produces output on request, but does not use the same stack as the requesting application. Consequently, a parser running in a generator would be more or less equivalent to a parser running in a coroutine.
Sam Wilmott took a different approach, creating a language [OmniMark] designed from the ground up as a markup stream processing language, in which an XML stream processing API would be superfluous.
Jeremy Carroll [Carroll-2001] also advocates implementing another layer over an event-based parser like SAX but found, in his tests, that coroutine overhead uses 15 to 25 percent of the processing time, due to Java thread-swapping overhead. For this reason, he recommends using the lower level to drive a state machine in the higher level (he call this "inverting" the higher level parser) so that the lower level can run the higher level as a subroutine of the lower level.
The author has not followed up on this line of inquiry, but notes that his own preliminary tests show the Stream Parser API running at a speed more like that of DOM than SAX. However, the same tests show the Stream Parser API taking less than twice as long as SAX.
The choice is no longer between the low resource requirements of SAX and the greater ease-of-use and less complex design requirements of DOM. The “XML Stream Parser API” provides a middle ground for those applications that need the ease-of-use inherent in DOM parsers using XSLT and the ability, inherent in SAX, to handle documents of unlimited size.
The XML Stream Parser API makes SAX parsers easier to use, in much the same way that XSLT makes DOM parsers easier to use. In both cases, application designers can focus on what they want to do with the XML elements in a document, instead of focusing much of their effort on how to make a parser do what they want.
Resequenced output provides a pseudo-random access to the output stream. Consequently, it is easy to handle ID and IDREF attributes, and easy to divorce the sequence of the output and the sequence of the input document. Resequenced output is almost as flexible as the random access to input offered by DOM.
With large documents, the XML Stream Parser API will use much less memory than DOM parsers, even when using resequenced output.
When not using resequenced output, the XML Stream Parser API permits immediate output, prior to the end of the input, while still providing an ease-of-use that was previously available only from XSLT and DOM.
Output redirection within Java increases ease-of-use by making it possible to write element methods that are more self-contained and modular, giving stream parsers some of the advantages that were previously only available from XSLT and DOM.
Thanks to Sam Wilmott for suggestions and criticism.
Thanks to anonymous referees for some very helpful advice.
[Carroll-2001] Carroll, Jeremy J., CoParsing of RDF & XML, HP Laboratories Bristol, 2001