Abstract
As XML becomes the industry standard for exchanging business data, a continuing performance challenge exists in retrieving selective data from large XML documents. The XML navigational language-XPath-lies at the foundation of standards-based data retrieving solutions. However, XPath supported solutions such as XSLT have up to now required creation and traversal of the input XML's Document Object Model (DOM). In practice, DOMs have required up to 10 times the memory of the original document with traversal APIs that are difficult to optimize. While other XML processing methods such as SAX and StAX are event-based and require less memory, they lack the desired XPath support for retrieving XML data.
This paper examines typical requirements for data retrieval from XML documents and then provides an in-depth analysis of the existing XML data retrieval strategies and why they are not optimal. It then discusses the design of a stream-based framework that provides an XPath-based XML data retrieval. It will analyze how a lightweight data extraction engine can efficiently match XPaths in a streaming XML process and how the publish-subscribe framework efficiently disseminates XML data to the receiving end.
Through real-world examples, it will discuss how this streaming XML data retrieval technology can efficiently disseminate XML data for content management applications. It will also present how this framework can be used as a high-performance data router for Web services or integrating within XSLT and XQuery engines as a data acquiring processor.
Keywords
Table of Contents
As XML becomes the industry standard for exchanging business data, a continuing performance challenge exists in retrieving selective data from large XML documents. The XML navigational language-XPath-lies at the foundation of standards-based data retrieving solutions. However, XPath supported solutions such as XSLT have up to now required creation and traversal of the input XML's Document Object Model (DOM). In practice, DOMs have required up to 10 times the memory of the original document with traversal APIs that are difficult to optimize. While other XML processing methods such as SAX and StAX are event-based and require less memory, they lack the desired XPath support for retrieving XML data.
In this paper, we will first examine typical requirements for data retrieval from XML documents and then provide an in-depth analysis of the existing XML data retrieval strategies and why they are not optimal. We will then discuss the design of a stream-based framework that provides XPath-based XML data retrieval. We will analyze how a lightweight data extraction engine can efficiently match XPaths in a streaming XML process and how a publish-subscribe framework efficiently disseminates XML data to the receiving end. Finally, we describe how this engine can be easily integrated into content management applications in a real-world use case and present how it can be used as a high-performance data router for Web services or integrated within XSLT and XQuery engines as a data acquiring processor.
XPath lies at the foundation of standards-based XML content navigation. It allows retrieving XML data not only based on its content but also on the XML document structure. An XML parser determines how to retrieve data using XPath.
An XML parser reads the XML document, provides a programmatic access to its XML data, and consequently determines how this data can be accessed or retrieved. Nowadays, the most common XML parsers include:
Document Object Model (DOM) Parser: is an XML parser that parses XML documents based on the W3C DOM standards. It represents XML data as an object tree in memory and provides an object-oriented interface to access XML data
Simple API for XML (SAX) Parser: is a stream-based XML parser that parses XML documents based on the W3C Note SAX standard. It represents XML data as a set of events and pushes them to their registered content handlers through function callbacks. Because SAX parsers push out events in a broadcast fashion, it is efficient for multitask XML data processing.
Simple API for XML (StAX) Parser: is another stream-based XML parser that parses XML documents based on the JSR-173 standard. Like SAX, it represents XML data as a set of events, but you have to make event requests from StaX parser through function callbacks or as event objects. In other words, unlike a SAX parser that pushes out all the events, a StAX parser supports a pull model sending out events only if requested. StAX is more friendly for building business applications because you can process XML based on the application logic and easily skip the unneeded XML data.
Each XML parsing method has benefits and drawbacks for retrieving XML data. It should also be noted that we have not touched on JAXP or JAXB. The Java Architecture for XML Parsing (JAXP) is simply an encapsulation of the DOM and SAX parsing methods and thus has their same issues in this context. The Java Architecture for Data Binding (JAXB) is really designed for end-to-end XML processing based solely on XML Schema and thus cannot be broadly applied for documents that are based on DTDs or have extensive mixed content as is typical for large XML document processing.
By representing an XML document in-memory as an object tree, DOM parsers preserve and allow dynamic access to the XML document structure and content. However, this object-based XML processing does not perform or scale well when processing large XML documents because of its high memory costs.
Additionally, DOM-based XML data retrieval has redundant DOM tree traversals when processing multiple XPath expressions. For example, suppose you have two XPaths: /A/B/C and /A/B/C/D, from which you wish to retrieve data in an XML document DOM as shown in Figure-1.
When evaluating /A/B/C, the DOM parser will select the A element, then iterate its children selecting all the B elements under A. After all the B elements are selected, then all the C elements under B are selected and evaluated. Next to evaluate /A/B/C/D, the same process is performed learning nothing from the previous XPath. From Figure-1, you can find that A, B and C elements are processed twice. This is not efficient especially when you have hundreds of XPath expressions as in a large XML document and can result in scalability issues once the application gets deployed under real-world document loads.
Because of their limited memory use, the stream-based SAX and StAX XML parsers are ideal options for effectively retrieving data from large XML documents. SAX is even better than StAX for multiple XPath evaluations, because you can easily leverage the broadcasted SAX events to multitask within the XML parsing process. However, both of them do not maintain the hierarchical structure of XML documents and thus lack the XPath support.
As more and more real world applications require efficient data retrieval from large XML documents, we have to find a solution that provides a lightweight and efficient strategy to:
Retrieve XML data with managed memory resources in order to enable process large XML documents.
Avoid unnecessary XML traversals when processing multiple XPaths in order to efficiently handle hundreds of XPaths per document.
To meet the requirements for efficiently retrieving XML data from large XML documents, a project is established to build an XML data extraction utility, called the Extractor (XML Extractor).
The Extractor uses SAX parser to parse XML document and extract XML data into datasets using XPath. The reason for choosing SAX parser is that memory usage under SAX processing can be managed independent of document size. Its multitasking support also allows an application to perform other tasks while processing large documents.
Figure-2 illustrates that the Extractor also uses the publish-subscribe model. Under this processing model, all the XPath expressions and their related content handlers are registered to Extractor as subscribers of the data retrieval service. The Extractor processes XML data in SAX and publishes the extracted dataset to the receiving ends through the content handlers.
The reason for choosing the publish/subscribe processing model is that it offers a powerful mechanism for processing SAX events and allows efficient data dissemination with multiple receiving ends.
Figure-3 shows the details of the function blocks within the Extractor. First, the Extractor is initialized to subscribe to its services, where an application can register a set of XPath expressions and their corresponding content handlers. The Extractor then can compile and build indexes for all the registered XPath expressions to optimize the XPath matching process.
Key to this compilation process is the maintenance of the XPath context throughout the SAX process, thus a finite state machine is used. When the Extractor receives XML data in SAX, it maintains its state machine for tracking the current XPath semantics and matches the current XPaths against the registered XPaths in its index. If there are any matches, the Extractor will publish the extracted data by sending out events to the registered content handlers. To summarize, four major processes are included:
Initialization: registers XPaths and their content handlers
XPath Compilation: compiles XPaths and builds XPath indexes
XPath Tracking: maintains XPath state and matches XPaths with the indexed XPaths
Output: sends out events reporting start and end of XPath matches along with SAX events reporting the XML data
In the following sections, we will discuss each step in more detail.
The Extractor initialization requires registerations of XPath expressions and their respective content handlers to receive the extracted XML. As an XML schema or DTD describes the structure of input XML documents, the Extractor also provides a provision to accept them within this initialization, as it will be able to use the information to optimize retrieval paths.
Each XPath should conform to the XPath standard with support of XML namespaces. As documents can have one un-prefixed namespace known as the default and many prefixed ones, the registration process must be able to support associating these to XPaths. Therefore, a name-value pair registration format is chosen to specify all the namespaces in the format <namespace prefix>, <namespace-uri>. For example, to extract the shipping address from purchase order documents, you can register the namespaces for the XPath expression as:
("my", "http://www.foo.com/xml")The XPath using the namespace prefix is:
/my:PO/Shipping_Address
XPaths also can have predicates. For example, you can have an XPath to extract all the line items in a PO where the name of the person in the billing address is John Smith:
/PO/LineItems/Item[../../Billing_Address/To/text()="John Smith"]
As there is a streaming requirement for the document, relative XPaths could produce unexpected results as they depend on a context that does not usually exist in this model. Therefore, all XPath expression needs to be made absolute with no relative context.
In order to uniquely identify each registered XPath, the Extractor assigns a unique ID for each registered XPath expression.
Second, the XContentHandler interface or two built-in content handlers, the XMLSequenceBuilder and XMLSAXSerializer, are available to simplify the receiving of the extracted data from the Extractor. To receive the retrieved XML data, each XPath is associated with a content handler instance which implements one of these interfaces We provide more details when discussing the Extractor Output.
Within this initialization process, the Extractor also needs to provide options to specify the execution behavior. For example, the Extractor needs an option to limit the processing to XPath 1.0, 2.0 or both to anticipate further functions and data bindings.
For SAX XML processing, all streamable XPath are the XPath expressions that do not include any forward references (children or decedents). When these references are included in the XPath, the Extractor cannot decide the match of the XPath when it receives the data. In other words, it can't decide whether to dispatch XML data events to the data receivers or not. To support these, Extractor has to preserve XML data in memory.
As the Extractor is designed to serve as a lightweight data extraction engine, it is necessary that all registered XPaths be streamable.However there may be the case where some level document buffering is acceptable thus an option to control the strictness of the streamable requirement is useful. Therefore, a flag, isAll=true/false, is included to tell the Extractor whether only processes streamable XPath expressions with all the non-streamable XPaths with no results, or bugger the document to provide result for non-streamable Xpaths.
In the default isAll=false mode, the Extractor will not buffer any data internally when tracking and evaluating XPath expressions. This is useful for the applications that require scalable data extraction with a limited range of XPath patterns
However, because an XQuery or XSLT engine requires processing all kinds of XPath expressions, to enable Extractor to integrate with XQuery and XSQL engine, the flag is set isAll=true that tells the Extractor to buffer XML data in order to resolve the XPath expressions requiring forward references. The performance and the scalability may suffer because of buffering the data. However, because the Extractor did not build a DOM in memory, it is still more efficient than the DOM approach.
The XPath compilation process needs to translate and build indexes for all the registered XPaths. This processwill first split an XPath into its location path and predicate component parts, as these will be processed differently. The location paths will be further compiled to reflect the fact that SAX events will occur in document order and normalized to eliminate redundant matching of common path steps. Each XPath predicate will be compiled into a predicate dependency table that will then be linked to its associated compiled location path as seen in Figure-4.
The predicate dependency table contains its own set of location XPath expressions related with a value or reference to another location XPath. These paths may contain either forward (children) or backward (parent) references. For backward references, all the data will be preserved in hash table for later reference. For forward references, depending on the isAll option, the Extractor will have to hold the data in memory until the XPath expression can be evaluated and the SAX events send out. In this case, the status of the dependency will be set as yet-to-be-matched so that when later all precidates value are available, the action for re-evaluating the location XPath will be triggered. We will discuss this further in the next XPath Tracking section.
Secondly, an overall index tree will be built based on the location XPath expressions including those in the predicate dependency tables. There are two types of index trees for Extractor:
XPath Dependency Tree: Without DTD or XML schema, the Extractor compiles and builds an XPath dependency tree that is optimized for the stream-based XPath extraction. The dependence tree is a tree build from the registered XPath experssions using a special node called fake node to represent // or / in the XPath as shown in Figure-5.
This dependence tree is annotated with the processing actions and the match of the registered XPath so that they can share computation and limit the transversal lookups for potential matches. For example, /A/B/C/D, /A/B/C/E, and /A/B/C can be collapsed.
Data Model Tree: With DTD or XML Schema, the Extractor builds an in-memory index tree utilizing the data model defined by DTD or XML schema. This data model is then annotated to indicate the local XPath expressions and actions when processing XML data. For example, this will speed up the output of XPath expressions such as //C, as the Extractor will now look up a set of possible location paths defined in the data model.
In order to track XPath, an automaton is built to maintain the hierarchical relationships between XML elements and attributes. Each state in this automaton reflects the current XPath context with the SAX events triggering the state transitions. Due to the nature of XPath semantics, the state machine can be implemented as a stack, which keeps the following information:
The inscope namespaces
The name of the current element.
The attribute of the current element
The position of the node relative to its siblings
The number of its child elements
The current stack level
Based on the information in the stack, a set of XPath expressions are treated as the current XPath. All the current XPaths will be matched against the XPath index built during the XPath compilation. Any predicates evaluated to be false will result in removing the XPath from the tracking list.
Without DTD or XML schema, the XPath dependency tree is traversed synchronously to when parsing the XML document as shown in Figure-6. It can provide list of possible node matches and the match of the registered XPaths.
With DTD or XML schema, matching of the XPath is then primarily based on the synchronous data model traversal when parsing the XML document as shown in Figure-7. During this traversal, the annotations on the data model are retrieved and evaluated to determine XPath match. This annotation look-up process is much more efficient than the string-match approach taken when DTDs or XML schemas are not available.
By using the existing XML schema validation engine, the implementation of such look-up process for the Extractor also take less time and effort.
We have discussed that isAll=true affects the execution behavior of the Extractor when XPath expressions contain forward (child or decedent) data references. Because the Extractor cannot decide the match of XPath when it receive the data, in order to evaluate such XPaths (when isAll=true), the Extractor has to hold the XML data in memory until it can successfully evaluate the XPath. To minimize the memory use, we create a table to record all the SAX events in order to send out the data after all the predicates are evaluated. This similar process is used to fill in the predicate table in case it includes forward references to SAX events which have not yet occurred. Once the table is complete the value can be sent out.
The Extractor needs to provide flexible outputs to make it easy-to-use. Thus with this model, its outputs can be a set of events, XML Sequence objects or XML files.
First, when the Extractor signifies the match of XPath using events, these events can easily be pipelined in stream-based XML processing applications resulting in managed memory use. The events include all the SAX events reporting the XML data with two additional events signifying the XPath match:
The startXPath event: reports the start of XPath match by referring to an XPath_id
The endXPath event: reports the end of XPath match by referring to an XPath_id
Thus an application can register its own handlers. In this case, it needs to implement the XContentHandler interface used by the Extractor.
Second, in order to enable the Extractor to easily integrate with XQuery 1.0 or XSLT 2.0 engine, the XMLSequenceBuilder is provided as a built-in handler, which processes the events sent out by the XPath extraction engine and represents the result set as an XML Sequence object. The XML Sequence is an object defined in XPath 2.0 data model representing the result set of XPath evaluations. In the Extractor implementation, an XMLSequence interface is defined as follows:
public interface XMLSequence {
public boolean next();
public XMLItem getCurrentItem();
}
The XMLSequence object contains a list of XMLItems, which includes both the XML data and its datatype:
public interface XMLItem {
public OXMLSequenceType getItemType();
public String getLexicalValue();
public boolean getBoolean();
...
public XMLNode getNode();
}Another build-in handler is the XMLSAXSerializer, which simplifies the use of the Extractor by creating a set of files for the extracted XML data set.
Designed on an open XML processing infrastructure, the Extractor can be easily integrated in various applications to efficiently retrieve and disseminate XML data. In this section, we will discuss some real world use cases.
Figure-8 shows a content management system that extracts the XML data and metadata using the Extractor and inserts them into the relational database tables in the Oracle database. The XML data extraction is based on the XPaths stored in an XPath table associated with a DTD using the DTD's system ID and public ID, or with an XML schema using the XML Schema Location URL.
When initializing the Extractor, the content management application retrieves the DTD's sysID and docID or the XML schema location URL from the XML document and uses them to query the XPath table. After it gets a list of XPaths, the application then registers the XPaths to the Extractor and specifies instances of content handlers to receive the retrieved data. The XML schema or DTD is also provided to the Extractor to optimize the XPath compilation and tracking process.
When XML document is parsed in SAX, the Extractor retrieves the XML data for each XPath and disseminates the data to the corresponding the content handlers. The content handlers then insert the extracted data into either metadata or data tables in the database.
The Extractor in this application delivers high performance XML data retrieval because all the XPath expressions are now processed within one XML document traversal and all the XML processing occurs in a SAX stream that permits memory resource management.
The Extractor can also integrate into Web service proxy servers or clients to extract data from SOAP XML messages. Figure-9 shows a Web service application using the Extractor as a data router to disseminate data to the proper users.
In this Web service application, all the client applications subscribe to the data services by registering XPath expressions to a service broker. The service broker then sets the XPath expressions to an Extractor. When the service broker invokes Web services using SOAP and gets back Web service responses in SOAP, the SOAP messages are then parsed and processed by the Extractor, where data is extracted from the SOAP messages and disseminated to each client based on their XPath-based service subscriptions.
In this case, the Extractor not only provide high-performance XML data extraction, it also allows Web service clients to share Web service responses. This can greatly reduce the round-trip traffic of Web service invocations over the Internet and thus improve the performance and scalability of the whole system.
As we have discussed, XQuery and XSLT rely heavily on the XPath queries for their execution. The current DOM-based XPath engines suffer in their ability to process large XML documents. The design of the Extractor allows it to provide the XPath services required by the XQuery or XSQL engine through registering XPaths to the Extractor and receiving results in XML Sequence objects defined in XPath 2.0 as shown in Figure-10.
The design of the Extractor delivers a high-performance XML data retrieval engine for powering the XML infrastructure. The XPath data retrieval syntax simplifies its use. The SAX stream-based XML processing framework permits managing the memory consumption regardless of the document size. The publish/subscribe processing model seamlessly works with the event-based SAX XML processing and maximizes efficiency of the data dissemination across multiple clients. Moreover, the open architecture enables the Extractor to easily plug into content management applications, Web services and XQuery or XSLT engines thereby enjoying the performance and scalability benefits.
Thanks Karun. K, Stanley Guan,Anguel Novoselesky, Tim Yu and Ben Chang for their valuable contributions and comments on this paper.
[XQuery10] W3C, XQuery 1.0: An XML Query Language, November 2003
[XSLT20] W3C, XSL Transformations (XSLT) Version 2.0, November 2003
[XPath10] W3C, XML Path Language (XPath) 2.0, November 2003
[DOM] W3C, XML Document Object Model, 2.0, 3.0, 2003
[XMLN10] W3C, XML Namespace 1.0, January 1999
![]() ![]() |
Design & Development by deepX Ltd. |