Keywords: Schema, XSLT, XPath, Database, Streaming, E-Business, mapping, XML Database, Enterprise Content Management, E-Commerce, XQuery
As XML formatted content and data becomes pervasive on intranets and the Internet the requirement to minimize individual process times becomes great. XPath has been evolving into a rich expression language to query and extract data in a precise way. While it has been designed to be used by a host language such as XSLT and XQuery, an XPath processor can be used quite usefully standalone or as part of an application framework.
This paper discusses the design of such an XPath-powered framework which supports registration of processes to individual XPaths. These XPaths are compiled with a plug-in analysis engine and satisfied through stream-based document parsing for optimal scalability and performance. A publish-subscribe model allows data consumers to register XPath expressions and function callback handlers to the framework. Through real-world examples, we will discuss how this streaming XPath XML data processing framework can efficiently extract XML data for content management applications populating electronic warehouses or serve as high-performance data router for Web services and application integration applications. We will also present how this framework can be used as a high-performance data acquiring processor for XSLT and XQuery engines.
1. Introduction
2. Data Retrieval Strategies
2.1 DOM Parsing
2.2 SAX and StAX Parsing
3. Design of Extractor: Streaming and Multitasking XML Data Retrieval
3.1 Initialization
3.1.1 XPaths and Namespaces
3.1.2 Execution Options
3.2 XPath Compilation
3.3 XPath Matching
3.4 The Outputs
4. Application Use Cases
4.1 Content Management
4.2 Web Services
4.3 XQuery and XSLT Engine
5. Conclusion
Acknowledgements
Bibliography
As XML becomes the industry standard for exchanging business data, a continuing performance challenge exists in retrieving selective data from large XML documents. As XPath is the XML navigational language, it lies at the foundation of standards-based data retrieving solutions. XPath supported solutions such as XSLT have required creation and traversal of the input XML’s Document Object Model (DOM). In practice, DOMs have required up to 10-times the memory of the original document with traversal APIs that are difficult to optimize. Other XML processing methods, such as SAX and StAX are event-based and require less memory, however, they lack the desired XPath support for retrieving XML data.
In this paper, we will first examine typical requirements for data retrieval from XML documents and then provide an in-depth analysis of existing XML data retrieval strategies and why they are not optimal. We will then discuss the design of a stream-based framework that provides XPath-based XML data retrieval. We will analyze how a lightweight data extraction engine can efficiently match XPaths in a streaming XML process and how a publish-subscribe framework efficiently disseminates XML data to the receiving end. Finally, we describe how this engine can be easily integrated into content management applications in an actual use case and present how it can be used as a high-performance data router for Web services or integrated within XSLT and XQuery engines as a data acquiring processor.
XPath lies at the foundation of standards-based XML content navigation. It allows retrieving XML data not only based on its content but also on the XML document structure. An XML parser determines how to retrieve XML data using XPath.
An XML parser reads the XML document, provides programmatic access to its XML data, and consequently determines how this data can be accessed or retrieved. Nowadays, the most common XML parsers include:
Each XML parsing method has benefits and drawbacks for retrieving XML data. It should also be noted that we have not touched on JAXP or JAXB. The Java Architecture for XML Parsing (JAXP) is simply an encapsulation of the DOM and SAX parsing methods and thus has their same issues in this context. The Java Architecture for Data Binding (JAXB) is really designed for end-to-end XML processing based solely on XML Schema and thus cannot be broadly applied for documents that are based on DTDs or have extensive mixed content as is typical for large document processing.
By representing an XML document in-memory as an object tree, DOM parsers preserve and allow dynamic access to the XML document structure and content. However, this object-based XML processing does not perform or scale well when processing large XML documents because of its high memory costs.
Additionally, DOM-based XML data retrieval has redundant DOM tree traversals when processing multiple XPath expressions. For example, suppose you have two XPaths, /A/B/C and /A/B/C/D, from which you wish to retrieve data in XML document DOM as shown in Figure-1.

Figure 1: DOM XPath Data Retrieval
When evaluating /A/B/C, the DOM parser will select the A element, then iterate its children selecting the B elements under A. After all the B elements are selected, then all the C elements under B are selected and evaluated. Next to evaluate /A/B/C/D, the same process is performed learning nothing from the previous XPath. From Figure-1, you can find that A, B and C elements are processed twice. This is not efficient especially when you have hundreds of XPath expressions as in a large document and can result in scalability issues once the application gets deployed under real-world document loads.
Because of their limited memory use, the stream-based SAX and StAX XML parsers are ideal options for effectively retrieving data from large XML documents. SAX is even better than StaX for multiple XPath evaluations, because you can easily leverage the broadcasted SAX events to multitask within the XML parsing process. However, both of them do not maintain the hierarchical structure of XML documents and thus lack the XPath support.
As more and more real world applications require efficient data retrieval from large XML documents, we have to find a solution that provides a lightweight and efficient strategy to:
To meet the requirements to efficiently retrieve XML data from large XML documents, a project was established to build an XML data extraction utility, called Extractor (XML Extractor).
The Extractor uses SAX parser to parse XML document and extract XML data into datasets using XPath. The reason for choosing SAX parser is that memory usage under SAX processing can be managed independent of document size. Its multitasking support also allows an application to perform other tasks while processing large documents.
Figure-1 illustrates that the Extractor also uses the publish-subscribe model. Under this processing model, all the XPath expressions and their related content handlers are registered to Extractor as subscribers of the data retrieval service. The Extractor processes XML data in SAX and publishes the extracted dataset to the receiving ends through the content handlers.

Figure 2: The Publish/Subscribe Processing Model of Extractor
The reason for choosing the publish/subscribe processing model is that it offers a powerful mechanism for processing SAX events and allows efficient data dissemination with multiple receiving ends.

Figure 3: The Function Blocks with the Extractor
Figure-3 shows the details of the function blocks within the Extractor. First, the Extractor is initialized to subscribe to its services, where an application can register a set of XPath expressions and their corresponding content handlers. The Extractor then can compile and build indexes for all the registered XPath expressions to optimize the XPath matching process.
Key to this compilation process is the maintenance of the XPath context throughout the SAX process, thus a finite state machine is used. When the Extractor receives XML data in SAX, it maintains this state machine for tracking the current XPath semantics and matches the current XPath against the registered XPaths in its index. If there are any matches, the Extractor will publish the extracted data by sending out events to the registered content handlers. To summarize, four major processes are included:
In the following sections, we will discuss each step in more detail.
The Extractor initialization requires registrations of XPath expressions and their respective content handlers to receive the extracted XML. As an XML schema or DTD describes the structure of input XML documents, the Extractor also provides a provision to accept them within this initialization, as it will be able to use the information to optimize retrieval paths.
Each XPath should conform to the XPath standard with support of XML namespaces. As documents can have one un-prefixed namespace known as the default and many prefixed ones the registration process must me able to support associating these to XPaths. Therefore a name-value pair registration format is chosen of the format <namespace-uri>, <XPath expression>. For example, to extract the shipping address from purchase order documents, an XPath expression is registered as:
xmlns:my="http://www.foo.com/xml", /my:PO/Shipping_Address
XPaths also should be able to contain predicates. For example, you could have an XPath to extract all the line items in a PO where the name of the person in the billing address is John Smith:
/PO/LineItems/Item[../../Billing_Address/To/text() =”John Smith”]
As there is a streaming requirement for the document, relative XPaths could produce unexpected results as they depend on a context that does not usually exist in this model. Therefore, all XPath expressions need to be made absolute.
In order to uniquely identify each registered XPath, the Extractor requires each XPath expression registered with a unique ID assigned.
Second, the XContentHandler interface or two built-in content handlers, the XMLSequenceBuilder and XMLSAXSerializer, are available to simplify the receiving of the extracted data from the Extractor. To receive the retrieved XML data, each XPath is associated with a content handler instance which implements one of these interfaces. We provide more details when discussing the Extractor Output.
Within this initialization process the Extractor also needs to provide options to specify the execution behavior. For example, the Extractor needs an option to limit the processing to XPaths 1.0, 2.0 or both to anticipate further functions and data bindings.
Also, as the Extractor is designed to serve as a lightweight data extraction engine, it is necessary that all registered XPaths be streamable. However there may be the case where some level document buffering is acceptable thus an option to control the strictness of the streamable requirement is useful.. Therefore, a flag, isAll=true/false, is included to tell the Extractor to only processes streamable XPath expressions and that all the non-streamable XPaths will not produce results.
For SAX XML processing, all streamable XPath are the XPath expressions that do not include any forward references (children or decedents). When these references are included in the XPath, the Extractor cannot decide the match of the XPath when it receives the data. In other words, it can’t decide whether to dispatch XML data events to the data receivers or not. To support these the Extractor has to preserve XML data in memory.
In the isAll=false mode, the Extractor scales the best, because it will only buffer data associated with the XPath expression endpoint internally when tracking and evaluating XPath expressions. This is useful for applications that require scalable data extraction with a limited range of limited XPath patterns.
On the other hand, because an XQuery or XSLT engine requires processing all kinds of XPath expressions, to enable Extractor to integrate with XQuery and XSQL engine, the flag is set to isAll=true, that tells the Extractor to buffer XML data in order to resolve the XPath expressions requiring forward references. The performance and the scalability may suffer because of buffering the data. However, because the Extractor did not build a DOM in memory, it is still more efficient than the DOM approach.
The XPath compilation process needs to translate and build lookup indexes for all the registered XPath expressions. This process will first split an XPath into its location path and predicate component parts, as these will be processed differently. The location paths will be further compiled to reflect the fact that SAX events will occur in document order and normalized to eliminate redundant matching of common path steps.
Each XPath predicate will be compiled into a predicate dependency table that will then be linked to its associated compiled location path as seen in Figure 4.

Figure 4: Compilation for Each XPath
This table contains its own set of predicate location XPath expressions related with a value or reference to another location XPath expression. These paths may contain either forward (children) or backward (parent) references. For backward references, all the data will be preserved in hash table for later reference. For forward references, depending on the isAll option, the Extractor will have to hold the data in memory until the XPath expression can be evaluated and the SAX events sent out. We will discuss this further in the next XPath Tracking section.
Secondly, an overall index tree will be built based on the location XPath expressions including those in the predicate table. There are two types of index trees for Extractor:
In order to track XPaths, an automaton is built to maintain the hierarchical relationships between XML elements and attributes. Each state in this automaton reflects the current XPath context with SAX events triggering the state transitions. Due to the nature of XPath semantics, the state machine can be implemented as a stack, which keeps the following information:
Based on the information in the stack, a set of XPath expressions are treated as the current XPath. All the current XPaths will be matched against the XPath index built during the XPath compilation.
Without DTD or XML schema, the dependency index is updated when the state machine transitions so that it can provide a list of the registered XPaths that might match the current XPaths. Then a string-based XPath match will be performed between the current element and the XPaths on this list.
With DTD or XML schema, matching of the XPath is then primarily based on the synchronous data model traversal when parsing the XML document as shown in Figure 5. During this traversal, the annotations on the data model are retrieved and evaluated to determine XPath match. This annotation look-up process is much more efficient than the string-match approach taken when DTDs or XML schemas are not available.

Figure 5: Synchronous Data Model Traversal
By using the existing XML schema validation engine, the implementation of such look-up process for the Extractor also takes less time and effort.
We have discussed that isAll=true affects the execution behavior of the Extractor when XPath expressions contain forward (child or decedent) data references. Because the Extractor cannot decide the match of XPath when it receives the data, in order to evaluate such XPaths (when isAll=true), the Extractor has to hold the XML data in memory until it can successfully evaluate the XPath. To minimize the memory use, we create a table similar to the predicate table to record all the matching SAX events in order to determine when to evaluate the predicate. This same process is used to fill in the predicate table in case it includes forward references to SAX events which have not yet occurred. Once the table is complete the value can be sent out.
The Extractor needs to provide flexible outputs to make it easy-to-use. Thus with this model, its outputs can be a set of events, XML Sequence objects or XML files
First, when the Extractor signifies the match of XPath using events, these can easily be pipelined in a stream-based XML processing applications resulting in managed memory use. The events include all the SAX events reporting the XML data with two additional events signifying the XPath match:
Thus an application can register its own handlers. In this case, it needs to implement the XContentHandler interface used by the Extractor.
Second, in order to enable the Extractor to easily integrate with XQuery 1.0 or XSLT 2.0 engines, the XMLSequenceBuilder is provided as a built-in handler, which processes the events sent out by the XPath extraction engine and represents the result set as an XML Sequence object. The XML Sequence is an object defined in XPath 2.0 data model representing the result set of XPath evaluations. In the Extractor implementation, an XMLSequence interface is defined as follows:
public interface XMLSequence { public boolean next(); public XMLItem getCurrentItem(); }
The XMLSequence object contains a list of XMLItems, which includes both the XML data and its datatype:
public interface XMLItem { public OXMLSequenceType getItemType(); public String getLexicalValue(); public boolean getBoolean(); ... public XMLNode getNode(); }
Another build-in handler is the XMLSAXSerializer, simplifies the use of the Extractor when a set of files for the extracted XML data set is required.
Designed on an open XML processing infrastructure, the Extractor can be easily integrated in various applications to efficiently retrieve and disseminate XML data. In this section, we will discuss some real world use cases.
Figure-6 shows a content management system that extracts the XML data and metadata using the Extractor and inserts them into the relational database tables in an Oracle database. The XML data extraction is based on the XPaths stored in an XPath table associated with a DTD and uses the DTD’s sysID and docID or an XML schema’s location URL to retrieve the appropriate set.

Figure 6: Extractor Content Management Use Case
When initializing the Extractor, the content management application retrieves the DTD’s sysID and docID or the XML schema location URL from the XML document and uses them to query the XPath table. After it gets a list of XPaths, the application then registers the XPaths to the Extractor and specifies instances of content handlers to receive the retrieved data. The XML schema or DTD is also provided to the Extractor to optimize the XPath compilation and tracking process.
When XML document is parsed in SAX, the Extractor retrieves the XML data for each XPath and disseminates the data to the corresponding the content handlers. The content handlers then insert the extracted data into either metadata or data tables in the database.
The Extractor in this application delivers high performance XML data retrieval because all the XPath expressions are now processed within one XML document traversal and all the XML processing occurs in a SAX stream that permits memory resource management.
The Extractor can also integrate into Web service proxy servers or clients to extract data from SOAP XML messages. Figure-7 shows a Web service application using the Extractor as a data router to disseminate data to the proper users.

Figure 7: Extractor Web Service Use Case
In this Web service application, all the client applications subscribe to the data services by registering XPath expressions to a service broker. The service broker then sets the XPath expressions to an Extractor. When the service broker invokes a Web service using SOAP and gets back a Web service response in SOAP, the SOAP messages are then parsed and processed by the Extractor, where data is extracted from the SOAP messages and disseminated to each client based on their XPath-based service subscriptions.
In this case, the Extractor not only provides high-performance XML data extraction, it also allows Web service clients to share Web service responses. This can greatly reduce the round-trip traffic of Web service invocations over the Internet and thus improve the performance and scalability of the whole system.
As we have discussed, XQuery and XSLT rely heavily on XPath queries for their execution. The current DOM-based XPath engines suffer in their ability to process large XML documents. The design of the Extractor allows it to provide the XPath services required by XQuery or XSLT through registering XPaths to the Extractor and receiving results in XML Sequence objects defined in XPath 2.0 as shown in Figure-8.

Figure 8: Extractor XSLT/XQuery Use Case
The design of the Extractor delivers a high-performance XML data retrieval engine for powering the XML infrastructure. The XPath data retrieval syntax simplifies its use. The SAX stream-based XML processing framework permits managing the memory consumption regardless of document size. The publish/subscribe processing model seamlessly works with the event-based SAX XML processing and maximizes efficiency of the data dissemination across multiple clients. Moreover, the open architecture enables the Extractor to easily plug into content management applications, Web services and XQuery or XSLT engines thereby enjoying the performance and scalability benefits.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.