Acceleration Techniques for XML Processors

Keywords: parsing, acceleration, offloading

Biswadeep Nag
Staff Engineer
Sun Microsystems, Inc.
Performance Engineering
Menlo Park
California
United States of America

Biography

Biswadeep Nag has a Ph.D. in Computer Science from the University of Wisconsin-Madison. He presents regularly at conferences on databases, web services, JavaOne and OracleWorld. He presented two papers and a tutorial at XMLConference 2003. Dr. Nag is currently working at Sun Microsystems on areas related to database performance, XML processing and web services performance. Inside Sun he is a frequent presenter on database and XML performance issues.


Abstract


This paper presents the architecture of an XML Offload Engine (XOE) that can accelerate the processing of XML documents and messages. The XOE plugs into the I/O bus of the host system and is designed to support almost all current and future generations of XML processing technologies. Careful attention has been paid to the design of the communication between the XOE and the host so that most of the existing software stack and user applications can remain unchanged. By attacking the largest common source of XML processing cost, this XOE aims to be effective for the broadest class of XML processing applications.


Table of Contents


1. Introduction
2. The XML Processing Stack
3. XML Acceleration Architecture
4. Design of an Offload Engine
5. The Tokenized XML Format
6. Conclusion

1. Introduction

The processing of XML documents has entered the mainstream of software application development. System and software configuration files, UI specifications and many office documents (such as spreadsheets, presentations, etc.) are being specified as XML. This is in addition to the exchange of SOAP-XML messages in web-service oriented applications that are used for enterprise application integration and B2B transactions.

The problem though is that the emphasis on human readability of XML documents and the use of an ASCII or UTF representation of data has made XML processing relatively CPU intensive. XML processing typically involves the use of an off-the-shelf parser program that has to perform lexical analysis, tokenization and character integrity checks on the XML stream. In addition it has to convert UTF characters to Unicode, verify well-formednes of the document and optionally validate against an XML Schema or DTD. Finally there is the stage that may construct a DOM tree, perform transformations, bind to Java or C# objects and call in to the application program. All this consumes a fair amount of CPU cycles such that most of the time taken in a service-oriented application is spent in parsing XML.

The general idea we espouse in this paper is that general purpose computing processors may not be well-suited to the specialized task of processing XML. It might be far more efficient to offload much of the XML processing to an appliance or I/O coprocessor that handles most of the XML crunching as it comes off the network or is read from a file, much in the manner of a TOE. In addition to the use of special-purpose hardware that can shave off many CPU instructions, there is the added benefit of proximity to the network interface, greater cache locality because of the repeated execution of a limited amount of code, and freeing up cycles on the host CPU that can concentrate on running the user application.

2. The XML Processing Stack

There is a large diversity of applications that are interested in using XML, and so there is an equally large number of XML processing technologies. The W3C has been instrumental in standardizing a large number of XML processing APIs such as SAX, DOM, XSLT, StaX and so on. In the Java ecosystem, the JSR process has played a major role in enveloping the W3C APIs, into industry-standard Java APIs that can be uniformly used in all Java applications. There are several implementations of these APIs and they in turn can use most of the available parsers that support the W3C APIs. In the Microsoft world, there is a similar software stack that implements some of the W3C APIs and has equivalent implementations that support similar functionality in other areas such as binding XML elements to C#. For the purpose of this paper we will illustrate our ideas with examples from the Java stack, but the reader should be able to readily appreciate how the basic concept would apply to any XML processing stack.

app-stack.png

Figure 1: Java Software Stack for XML Processing

Figure 1 shows the standard Java stack for processing XML. Note that there are many implementations possible for each of the Java Standard API, XML Parsing API and XML Parser layers. Applications could even bypass the Java Standard APIs and directly code to the Parsing APIs, but that is not recommended for portability reasons. Using the JAX standards allows the choice of the XML parser to be decided at deployment time. We will now briefly describe the Java Standard APIs for processing XML.

JAXP
The Java API for XML Processing is a generic XML parsing interface that allows the application to be independent of the underlying XML parser implementation. It currently encapsulates the SAX (Simple API for XML), DOM (Document Object Model) and XSLT interfaces and will be extended to include the StaX pull parser model.
JAXB
The Java Architecture for XML Binding is a very programmer-friendly technology that can convert an XML document to an in-memory tree of Java objects whose types corresponds to the elements and attributes specified in the document. JAXB requires the use of an XML Schema that allows it to map elements to complex and simple types. Prior to the binding of XML elements to Java objects, JAXB invokes an XML parser to tokenize the XML document. In the JAXB Reference Implementation, the Xerces parser is used to generate SAX events that can then be used to construct the JAXB tree.
JAX-RPC
The Java API for XML-based RPC is the fundamental way of creating web services using Java. JAX-RPC allows a Java class to make what appear to be method calls on a remote Java application. Under the covers, JAX-RPC serializes the method arguments into an XML document, encapsulates it in a SOAP envelope and sends to the web-service endpoint which may or may not be implemented in Java. The reverse happens for the return results. In a way, JAX-RPC implements the XML to Java binding of JAXB in addition to the messaging. The current reference implementation uses a custom parser but plans are to move to a Xerces based StaX implementation.

3. XML Acceleration Architecture

Our proposed XML acceleration architecture leverages the fact that most of the Java XML processing technologies can make use of a common XML parsing infrastructure as long as they can access it using their desired XML Parsing API such as SAX, DOM, StaX etc. Typically they would not even care about what exact XML parser is used underneath. However in our proposal, it would be possible to create accelerated versions of most currently available XML parsers, so much of the software stack remains unchanged.

parser.png

Figure 2: Schematic of XML Parser

Figure 2 shows a blowup of a typical XML parser architecture. Though each parser implementation will differ in their exact class diagram, they should each implement the functionality shown. In Figure 2, the XML tokens flow from the bottom to the top. Let us now discuss each of the modules in more detail.

Decoder
This module reads in UTF-8 or UTF-16 characters from the input stream and converts them to the character format of the host programming language. In the case of Java, this format is 16-bit Unicode. In addition, this module makes sure that the inputs characters are part of a valid XML character set.
Scanner
This classifies the characters in the XML input stream to locate (namespace qualified) element and attribute names, element content, attribute values and the like. In essence this implements the lexical analyzer corresponding to the XML grammar that searches for characters such as <, >, = , : etc. to tokenize the XML document.
Parser
This is the module that checks for well-formedness of an XML document. Well-formedness requires that the begin- and end-element tags are properly nested and matched. This requires the implementation of a stack that contains all the begin-element names that have been seen so far but that have not been matched by a corresponding end-element tag.
Validator
Schema or DTD validation is an optional part of XML processing. It is often required that an XML document conform to a particular XML schema, particularly in cases involving transactions between untrusted parties. However in the case of JAXB or JAX-RPC, the schema is compiled into higher Java classes that perform the validation outside the parser.
API Implementor
This layer provides the SAX, DOM or StaX interface that is required by the calling application. In most cases, (and in the particular case of the Xerces parser), the same underlying parsing engine can be used to do the tokenization, well-formedness checking and validation while the API implementor can be customized to either generate SAX events, build a DOM tree or supply nodes on demand as a pull-parser.

There will of course be other assisting modules in a functional parser such as an entity resolver, a grammar cache pool and so on that have not been depicted in Figure 2. Though such modules are important we consider them peripheral to the issue of XML acceleration. The above architecture has been loosely based on the open-source Xerces Java parser, though we believe that any usable XML parser, regardless of its structure has to implement very similar functionality.

4. Design of an Offload Engine

Offload engines have become very popular in accelerating repetitive and computationally expensive tasks. Examples include the TCP/IP Offload Engine found in many network interface cards, and SSL or cryptographic algorithm accelerators. We believe the exchange of XML messages is going to become equally commonplace and will also burn up a large number of CPU cycles. The use of an XML offload engine is designed to take care of two issues:

  1. Freeing up some cycles from the host CPU that were previously used for XML parsing.
  2. Use special purpose hardware to speed up certain parsing tasks.

There have been several different proposals for special purpose XML accelerators from Datapower, Tarari and Convergys. These have typically concentrated on narrow problems such as XPath and XSLT processing. It is difficult to justify the high cost of development, and procurement of these acclerators given the limited scope of deployment. Instead, we chose to focus on the most common subset of the XML processing problem: that of generic XML parsing. Figure 3 shows our proposal for which parts of the XML processing stack are targeted for offloading.

offload.png

Figure 3: XML Offloading Architecture

It is clear from Figures 1 & 3 that is possible to build all the common XML applications in use today on top of a common parsing infrastructure. By drawing the boundary of the XML Offload Engine (XOE) between the parser and the validator module, we are ensuring that all XML applications can take advantage of the XOE. At the same time, parts of the stack that are optional (such as validation) and those that will differ from one application to the other (e.g. XML Parsing APIs) can be implemented in software on the host CPU. The goal is to also keep most of the software running on the host be unchanged. For example, the implementation of the JAXB and JAX-RPC APIs will not have to change if the underlying software provides the SAX & StaX APIs. In fact existing XML parsers, such as Xerces and XPP3 can take advantage of the XOE and eliminate much of their code path.

Similarly, the expectation is that the design of the XOE, which is a combination of special-purpose hardware and firmware, should be independent of the current generation of XML APIs and even independent of particular parser implementations. It should concentrate on solving problems that are of general applicability in all XML parsing situations. The goal is to keep the optional and variable elements of XML processing still running on the host CPU in software. This led us to the XOE design illustrated in Figure 4.

xoe.png

Figure 4: XML Offload Engine

Figure 4 shows our XOE prototype in the form of hardware card that can be plugged into a PCI or PCI-X slot in a host system. The specialized hardware in the XOE is used to implement the UTF to Unicode decoder and also the scanner that tokenizes the XML stream to segment it into element and attribute names as well as element contents and attribute values. The hardware can do the decoding in multi-byte chunks and at the same time do much of the regular expression matching involved in tokenization in parallel. The parser/well-formedness checker that runs in the XOE firmware makes use of two data structures:

  1. A symbol table that stores all the (namespace qualified) element and attribute names. This is used to make sure that there is only one copy of each name. The symbol table is usually implemented as a hash table where an incoming name is compared with the existing names and a pointer is returned to the unique copy of the name.
  2. A stack that stores all unmatched element names seen so far. The stack actually stores pointers to the strings in the symbol table. When an end element tag is encountered, its pointer that is returned from the symbol table is compared with the pointer from the top of the stack. A match indicates that well-formedness is preserved.

The choice of what XOE features are implemented in hardware versus what would are implemented in firmware is a tradeoff that weighs the benefits against the real-estate costs required for the hardware solutions. It also make sense to combine the XOE with the smart network interfaces available today that implement a TOE (TCP Offload Engine) and a crypto accelerator (for SSL sessions). Though this would be a natural fit for processing XML that is coming over a wire, it could also be used to process an XML document stored in the local filesystem by transmitting the document over the I/O bus to the XOE. Instead of creating an offline XML appliance, we chose to design an inline network card that can also perform in coprocessor mode. The crucial aspect of the XOE design is to make sure that the additional latencies and CPU overhead associated with communicating with the XOE do not offset the benefit of offloading. This led us to carefully consider the data exchange format between the XOE and host CPU that is described in the next section.

5. The Tokenized XML Format

There are three major design goals of the communication infrastructure between the host CPU and the XOE.

  1. It should be low latency. This rules out back-and-forth communication between the host and the XOE and requires that data is transferred only in streaming mode.
  2. It should not consume more bandwidth than the original XML document moving across the I/O bus. We shall show that this is achievable by using a more compact tokenized XML format that we will describe in this section.
  3. Processing the semi-parsed tokenized XML should consume significantly less CPU cycles in the host as compared to the original XML. Our measurements show that this is indeed the case with at least 50% CPU savings being realized by typical XML parsing applications.

The tokenized XML format represents the output of the XOE that is consumed by the software running on the host and tailored for the XML application being accelerated. SAX events, DOM trees and StaX nodes can all be generated from it. While the tokenized XML encodes the complete XML infoset, the ability to make certain assumptions allows a more efficient encoding than the original document. The format is easier to explain if we use an example. Suppose the following XML fragment is processed by the XOE.

<customer id="1234">
  <name>
    <firstname>John</firstname>
    <lastname>Bull</lastname>
  </name>
</customer>
<customer id="4321">
  <name>
    <firstname>Jane</firstname>
    <lastname>Doe</lastname>
  </name>
</customer>
        

This will be converted into a message that has two parts. The first part is a symbol table (that has also been internally used in the XOE) containing all the element and attribute names found in the document.

Symbol Table
KeyValue
1customer
2id
3name
4firstname
5lastname

Table 1

The second part of the message will contain all the parsed XML tokens in document order. This will make references to the symbol table of the first part. It will also make use of a common code to differentiate between element, attribute and namespace names etc.

Code Table
CodeValue
1Begin Element Name
2Attribute Name
3Namespace Name
4End Element
5Element Content
6Attribute Value

Table 2

Now comes the actual tokenized XML corresponding to the above document fragment.

Tokenized XML
CodeLengthValue
1-1
2-2
641234
1-3
1-4
54John
4--
1-5
54Bull
4--
4--
4--
1-1
2-2
641234
1-3
1-4
54Jane
4--
1-5
54Doe
4--
4--
4--

Table 3

There are a number of observations to be made here:

Let us now see how this tokenized XML format satisfied the requirements outlined at the beginning of this section.

Latency
The tokenized XML can be sent in streaming mode or in large chunks (with appropriate buffering) to avoid multiple transactions across the I/O bus
Bandwidth
The tokenized XML is obviously smaller in size as compared to the original document. Compression is achieved by eliminating redundancy by using a symbol table.
CPU savings
It is cheaper to process the tokenized XML because of the following reasons:
  1. The XML markup is in binary format. For the strings representing element content and attribute values the length is specified. This eliminates much of the string processing cost.
  2. The lexical checking (including verification of valid XML characters) of the XML document has already been done. This also includes the conversion from UTF to unicode.
  3. There is no further need for well-formedness checking.

6. Conclusion

In this paper we presented an architecture for accelerating XML document processing by using an XML Offload Engine (XOE). The XOE is a combination of hardware and firmware that plugs into the I/O bus of a system and does some of the fundamental XML parsing that is common to all XML processing and web-service applications. As such it as wide applicability, has the ability to support all current and future XML processing APIs and at the same time is not tied to any particular technology. It is optimized for latency, bandwidth and for reducing CPU consumption on the host system. It is a natural fit with network protocol and crypto acceleration technologies. We expect such general purpose XML Offload Engines to become widely available in the near future.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.