Anatomy of a Native XML Database

Keywords: Database, XML, XQuery, XPath, DOM, Transaction, Application architecture

George Feinberg
Sleepycat Software
Falmouth
Maine
United States of America
gmf@sleepycat.com

Biography

George Feinberg is responsible for the architecture and implementation of the Berkeley DB XML native XML database, at Sleepycat Software.


Abstract


Most people in the XML community are aware of the term, "Native XML Database." Fewer are aware of the design details and implementation trade-offs made in construction of a native XML database.

This paper focuses on issues surrounding storage in a native XML database. The format of stored XML, as well as the granularity of stored documents, has a large effect on database design and scalability, as well as how a system may be used by an application. Indexing of stored information is another topic that is at the core of XML database performance.

While database design in these areas is not explicitly dictated by XML-related specifications, it is strongly influenced by such standards. Where appropriate, the influence of standards is discussed. A native XML database involves more than just these topics. Other relevant design issues will be covered in the context of the primary subjects.


Table of Contents


1. Introduction
     1.1 Terminology
2. Storage Formats and Granularity
     2.1 Intact Document Storage
     2.2 Fine Granularity Storage
          2.2.1 Round Tripping
          2.2.2 Data Model
          2.2.3 Granularity
          2.2.4 Physical Storage
               2.2.4.1 Node IDs
               2.2.4.2 Text and Physical Storage
          2.2.5 Partial Document Modification
          2.2.6 Granularity and Serialization
3. Indexing
     3.1 Index Types
          3.1.1 Structural Indexes
          3.1.2 Value Indexes
          3.1.3 Miscellaneous Index Types
     3.2 Index Scope and Target
     3.3 Index Maintenance
          3.3.1 Index Maintenance Example
     3.4 Indexes and Query Processing
4. Summary
Acknowledgements
Bibliography

1. Introduction

Imagine you have decided to build a native XML database, from scratch. Now, imagine some of the decisions you will have to make in terms of requirements, features, design trade-offs, and implementation. Some of the more user visible choices include these topics:

Such choices are important, particularly transactional support, as it significantly affects the programming model. While these are important, and have a large impact on a system's usability and feature set, this paper addresses some less visible design decisions that can dramatically affect the end product. These are how XML is stored and indexed in a native XML database. Special consideration is given to XML standards and how they affect system design in these areas.

The simplest way to think of database access is in this pattern:

  1. store some things
  2. index things
  3. query over stored things
  4. handle query results

In a relational database, pieces of a relational table are being stored, queries are SQL, and results are, again, tabular. This abstraction, and standardization is very powerful, and is what makes one relational database look very similar to another, from a user's perspective. What is less user-visible, and less considered, is precisely how "things" are stored, how they are indexed, and how a query can leverage the combination of storage format, indexes, and query language, to quickly answer a question.

The same concepts exist in a native XML database. In this case, the "things" are XML documents, the query may be an XPath [XPath] expression, or draft specification of XQuery 1.0 [XQuery]. The results may be XML documents, DOM, SAX, or a proprietary form.

The heart of a native XML database, or any database, is how information is stored, indexed, and queried. These design details may not be obvious from the outside, but are critical to the function, performance, and scalability of a database system. A native XML database exposes a logical model of storing and retrieving XML documents; however, its internal storage model is not necessarily equivalent to the document. The storage format, in many ways, drives the features that can be supported by the database.

Indexing is a crucial of any database. Without the ability to intelligently index stored information, a database system is little better than a file system for information retrieval. Indexing XML is interesting because there are a number of options on what, and how to index, some of which are dictated by the storage format.

Query processing, which builds on both storage format and indexes, is covered relatively briefly in this paper. It is a large topic that cannot be satisfactorily addressed in a small space. Implementation and performance of query processing is affected not only by storage format and indexes, but by the query language(s) available. Some systems use XPath 1.0, or XSLT 1.0. At this time, they are the major query-related W3C recommendations. Although the 2.0 versions of these recommendations, and that of XQuery 1.0, are not yet complete, drafts of these specifications have been implemented in a number of products.

1.1 Terminology

For the purposes of this paper, it is assumed that native XML databases store XML documents as a set, in a container or collection, which will be called a "collection." It is further assumed that a document within a container can be referenced by a unique name or document identifier (id).

The term, "round-tripping," is used. In general, this refers to getting the same XML out of a system (e.g. database) as you put in. There are varying degrees of round-tripping. The simplest refers to getting XML with the same XML Infoset [Infoset] from the system. It can also refer to getting the same characters, byte-for-byte, including things like identical attribute order and XML declaration. In this paper, the term, 100% round-tripping is used to imply the latter case.

In the interest of brevity, there is no discussion on "what is a native XML database?" R. P. Bourret has a very complete discussion of XML and databases, including references to native XML databases, on his web site [Bourret].

2. Storage Formats and Granularity

Most, if not all, native XML databases are oriented toward storing XML, which means documents. Nearly all input operations will be creating a new document, removing a document, or modifying a document in place. In database terms, granularity can be described in several different ways:

A distinction is made between access granularity and addressability. For example, access may be provided via DOM to a system with an addressable granularity of XML document, by parsing the document. In this sense, access granularity is user-visible, while addressability is an internal concept.

This section discusses storage format and granularity choices for XML databases, along with their respective advantages, disadvantages, and other design trade-offs.

2.1 Intact Document Storage

There are two major choices in terms of how to store a document -- intact and not intact. The first is easiest to talk about; in terms of granularity, it provides an addressable granularity of XML document. While it may seem simplistic to store documents intact, quite a number of features can be provided by such a system.

Systems that store XML documents intact will generally parse the XML in order to ensure it is well-formed, or for validation, but otherwise store the documents unchanged. This is useful for applications that require 100% round-tripping capability (in terms of byte comparisons), which is difficult to attain any other way. Further, for relatively small documents that tend to be retrieved, and processed whole, such a system is ideal. The major issue for intact document storage is how to address target documents in a collection of documents. There are 2 primary mechanisms:

  1. by some unique identifier, such as name or document id
  2. by query (XPath, XQuery)

The first results in exactly one document. A query may result in N documents in a result set.

For a large collection, it must be possible to target a small set of result documents in a query. For intact document storage, this implies an indexing mechanism. If a document is parsed upon insertion into a collection, it can be indexed, as well, based on the system's indexing specifications. Indexes in this type of system will use document granularity addressing. It is desirable to avoid parsing documents in order to resolve a query. Additional parsing can be avoided if these conditions are met:

A clear disadvantage of intact document storage is that for certain applications and queries, it can take a long time, and a large amount of memory, to process a request, this is mostly due to the need to parse documents to satisfy a query.

There is a type of system worth mentioning that stores documents intact that can use a finer granularity addressing scheme than document. If it is known that a document is read-only, and will never be partially modified, it is safe to reference offsets (byte, or other) within it, for indexing purposes.

In summary, intact document storage has the following properties:

It is useful for applications with some or all of these characteristics:

2.2 Fine Granularity Storage

Many native XML databases store documents at a granularity finer than document. The granularity properties of such systems are:

There are a number of reasons to store documents in pieces, including:

Along with the reasons for storing a document in pieces come a number of issues, including:

The bulk of this section is devoted to discussion of options when designing and implementing fine-grained document storage.

2.2.1 Round Tripping

Round tripping is the term used for getting back the same document inserted. 100% round tripping, as used here, means that a document retrieved is byte-for-byte the same as the document inserted into the database. This is most easily achievable using intact document storage. Virtually any decomposition of a document for storage results in loss, or change of information, such as reordering of attributes, or a change in the XML declaration. This is because there is not a 1:1 mapping from XML Infoset to bytes in a document. That is, there are bytes within an XML document that are not considered relevant to the infoset, and therefore, may not even be passed through by a parser.

A fine-grained storage system needs to choose the degree of round tripping it will attempt to support. For example, will it track entity references, which are generally expanded during parsing? Will it track ignorable white space, such as that before and after the document element? Such decisions are unimportant in terms of querying, and retrieval of partial documents, but for a given application, can be very important in terms of document serialization. Some systems export configuration options to determine handling of some of these issues.

While not specific to round tripping, namespace awareness in a fine-grained storage system is relevant to the topic. Namespace-aware systems need to store expanded qualified names, which means associating a namespace URI with a local name. The issue is, once a document is parsed, namespace prefixes are somewhat irrelevant, semantically, and a system is free to choose its own prefixes for serialization purposes. Tracking the original prefix mappings, while more complex, is necessary for more complete round tripping.

2.2.2 Data Model

Intact document storage has the vastly simplifying advantage of being unconcerned with the data model of the XML documents it stores. Fine-grained document storage must make a choice of data model stored. This decision can be closely tied to query processing and query language support. For example, XQuery's data model is typed, and type information can appear in XQuery expressions; however, XPath 1.0 expressions are not richly typed, so no additional type information is necessary.

A simple example of the data model issue is DOM vs XQuery. The DOM is relatively simple, where most every object is a Node, some nodes have names, some have values, and some have children and siblings. The DOM is essentially a tree, with little semantic information, and virtually all of its information is contained in the XML document itself. On the other hand, the XQuery data model is typed. XQuery does support simple, well-formed XML; however, it also supports type information, as obtained from a schema-validated document, where the schema information comes from outside the document.

It is possible to choose a storage data model equivalent to the XML Infoset, or DOM, but then the powerful type facilities of XPath 2.0 and XQuery 1.0 are not fully available. However, designing and implementing a typed data model adds complexity to a native XML database. A schema-validated document has type information available at the time it is parsed and validated. A system where parsing/validation and querying occur at the same time has no problem obtaining type information to satisfy the query. However, in a fine-grained storage system, the parse/validate and query events are not related. This means that at the time of the query, type information must be found, if it is to be used for the query. Here are several choices on how a system can implement types:

2.2.3 Granularity

Closely tied to data model is the topic of granularity of addressability in fine-grained storage of XML. At one extreme is the choice of DOM objects as the addressable unit. This means that each DOM node, be it a document, element, or attribute value, is an addressable, and perhaps separately stored, object. While simple, this approach is quite expensive in terms of memory, disk space, and CPU.

There are other, coarser-grained solutions. One is using the element as an addressable unit, and closely associating its attributes and child text nodes. Another is to address elements and text nodes, and associate attributes with elements. There are no clear advantages to either approach. The former may be better for locality of reference, if an element it its attributes and text nodes are likely to be referenced together.

Yet another approach, called "mixed-granularity," combines some of the features of intact document storage by storing selected sub-trees of a document as one, addressable XML unit. Querying into such sub-trees would be expensive, and complex, as it would require parsing of the XML; however, the intent of such a model is that intact sub-trees would be used only for large pieces of text that are not relevant to the major queries used by the application. In such a system, it is possible to think of intact document storage as the "degenerate" case. Or, conversely, a fully-exploded, fine-grained document could be considered degenerate. At this writing, there are no known systems that use this style of storage; although it has been considered.

2.2.4 Physical Storage

Native XML databases that store documents as fine-grained nodes must assign addressable node identifiers (node ids) to addressable units. Node ids can be used to retrieve specific nodes during processing. When it comes to physical storage, size matters. Smaller nodes and node ids means better locality of reference and fewer disk accesses to read and write data.

A common choice of data structure for storing nodes is the B-tree, where node ids are allocated in document order, which is also iteration order on the B-tree. This means that once a node is located, serialization or child navigation can occur via iteration, rather than additional lookup operations.

2.2.4.1 Node IDs

With the appropriate sorting/comparison function, a node id that is a B-tree key can take on many physical forms. It can be as simple as an integer, or it can be a complex array or string. Node numbering is one of the more interesting, and important design choices in a native XML database. There are node numbering schemes that have the following properties:

There is a very good discussion, with some references, regarding numbering schemes in a paper on the eXist native XML database [eXist]. The eXist system has a carefully chosen numbering scheme that allows for DOM-style navigation to related nodes without direct references, using numbers, and numbering rules, alone. The disadvantage of the scheme is that it is difficult to insert new nodes without renumbering (although there is supposed to be ongoing work to fix this). Berkeley DB XML, release 2.0, uses a numbering scheme that allows some direct relationship comparisons, and attempts to minimize the need to materialize nodes for navigation. While it doesn't have the features of the eXist scheme, in that navigation requires "pointers," it avoids renumbering in the face of partial document modifications.

2.2.4.2 Text and Physical Storage

XML documents can have a lot of repetition in text. For example, there may be hundreds, or thousands of instances of the element, "name," in a given document. It is possible to factor this out of the physical storage of the document, and reference the name by a name id. This additional mapping appears to be a logical choice, mostly because of potential space savings along with possible reduction in disk access. There are some less obvious implications of using this indirection, which is why it is not always done:

Another aspect of text storage is encoding of the text. Two common choices for encoding inside a system are UTF-8 and UTF-16. Which of these is most space efficient is a function of the language used for the XML. The other goal is to avoid unnecessary transcoding of strings during normal operations, if possible. This is not much of an issue for systems implemented in pure Java, because it uses UTF-16 internally, and most systems will not change that. One more point on encoding is that stored UTF-8 is portable between big-endian and little-endian platforms, but UTF-16, which depends on byte order, is not, and must be translated. Systems that desire, or require format exchange among platforms may be safer with UTF-8 as their "native" encoding.

2.2.5 Partial Document Modification

Intact document storage requires that an entire document be replaced if any part of it is modified. This is potentially expensive, but simple. Fine-grained storage allows a small piece of a document to be modified separately, without touching its bulk. There is a significant potential performance and scalability benefit in such "surgical" changes; however, it can be quite difficult to do correctly, and efficiently. It is for this reason that some systems do not support partial modification of documents, or, if they do, it is only by means of a well-defined update interface, such as XUpdate [XUpdate], and not through a direct DOM manipulation.

A partial modification can render a document invalid, or worse, malformed; however, re-parsing for validation negates much of the benefit of partial modifications. Insertion or removal of an addressable object, such as an element, affects the system's node numbering scheme, as described above. Indexes are also affected. Consider changing the value of an indexed attribute, or indexed text node. Not only is the value changed, an index update must determine the location of the affected object within the document.

Because of the expense and potential complexity, a native XML database that supports partial update may choose to not re-validate or parse after a modification unless the applications explicitly requests it to be done. This is an area where most XML databases diverge from relational databases, where a change that is incompatible with the database schema is not tolerated. Native XML databases typically store either well-formed or validated XML, which means they do not need a schema to store or query XML.

2.2.6 Granularity and Serialization

Fine-grained document storage has an advantage of not requiring parsing of documents to satisfy queries, or for partial document retrieval. Where it does not shine is in serialization of the entire document, or large portions thereof. In this situation, it is required that an iterator traverse the addressable pieces of the document, serializing it. If this is a common operation, it is worth optimizing, or perhaps caching the serialized document for reuse (which creates a possible concurrency problem). A native XML database can optimize document serialization in several ways:

3. Indexing

Indexes make databases fast. Proper specification and use of indexes can increase query speeds by orders of magnitude. Indexes can be large. As with most things in Computer Science, a speed benefit has a space trade-off. Indexes can cause some operations to be slow. In the face of frequent data update, excessive index specification can outweigh its benefits.

There are some issues specific to indexes and native XML databases. The data models for querying XML imply that virtually all indexes deal with elements, attributes, and their respective text content, as well as possible data types represented by their value strings. However, there is no standard or convention on how to specify indexes, or even what is indexed, and how. Different XML databases have made different choices regarding indexes in these areas:

3.1 Index Types

3.1.1 Structural Indexes

Structural indexes are used for tracking structure and path information, such as "track existence of all element nodes with the path, '/a/b/c'," or "track all paths to the node, 'c'." Such indexes are useful for navigational portions of queries. Not all indexes give definitive answers. That is, often indexes are used to restrict the result set to a smaller set of possible results. For example, the index above that tracks all paths "/a/b/c" can be positive about its answer to the query, "/a/b/c." The index that tracks all paths to "c" cannot be definite, since it will also contain entries for paths such as "/e/f/c." A special case of a structural index is an existence index, that tracks whether or not a specific element or attribute exists at all in a collection or document.

3.1.2 Value Indexes

Value indexes are used to track all values for specific elements or attributes. A value index on the element, "color," would have an index entry for every separate instance of "color," and would be useful for a query such as "//color[.='green']". Value indexes may be typed, so that comparisons can be performed correctly. For example, the value of this element <nameid>1234</nameid> could be an integer, or it could be a string. The type of an index dictates how it is handled. The typed data model of XPath 2.0 and XQuery 1.0 brings a long list of potential data types from the XML Schema recommendation, such as xs:date, xs:time, and various numeric formats. Support for typed indexes allows applications to use them directly, rather than modify their content to map, for example, xs:datetime to integer, so that range-based comparisons can be used.

Another type of value index is a substring index. Such an index tracks all substrings for a specific element or attribute. It can be used to speed up substring queries, such as XQuery's contains() or starts-with().

3.1.3 Miscellaneous Index Types

Full-text indexing is a topic unto itself. Full-text indexes and searching as people expect on the web is its own discipline. There is a working draft for full-text extensions to XQuery [XQueryFT], it is not in general use. Some native XML database products implement what they call full-text indexing. Such support is minimally a word index over a document. Because there is no standard, use of a full-text index requires a propriety query language or extension as an interface. Even with those restrictions, full-text searching is quite useful for locating text within XML with little or no knowledge of where that text might be. For example, "find all of the documents that have the text, 'to be or not to be.'"

Orthogonal to index types is whether an index enforces uniqueness within a given scope. This is a useful feature to ensure that a document has only one "name" element, for example. If a system has a unique value index on the element, "name," and a document is inserted that contains two "name" elements, the system would reject that document. Unique indexes provide very definitive answers, since within its chosen scope (presumably document), there can be only one entry.

Another dimension of index type is how indexes are specified. "Voluntary" indexes are explicitly specified, by an interface to the system. These indexes allow for some experimentation to find the minimal, useful set of indexes. Some systems have "automatic" indexes, where a well-defined set of indexes is always created, except for those which are explicitly disabled, via configuration or interface. Any system may choose to have "required" indexes, that cannot be disabled. Required indexes are those which are necessary for proper functioning of the system.

3.2 Index Scope and Target

Most native XML databases store documents in a collection. The scope of a given index could be collection-wide, or it could be restricted to a single document. Related to scope is target, or the object referenced by an index entry. It can be a document, or an object within a document. A native XML database system can choose which scopes and targets it implements. Queries against a collection could return documents or sets of nodes within documents. In order to support efficient restriction of a query to a manageable set of documents, the system must support indexes at the collection scope. This does not mean that it is not also possible to have indexes at the document scope, which contain entries that only apply to a given document.

Another aspect of an index is its target -- does it reference a document or within a document, or both? An index is capable of pointing down to the addressable unit in the system, but such granularity is not always necessary, and can be expensive. Because navigational operations within a document stored with fine granularity are not as expensive as those for intact document storage (due to parsing), it can be sufficient to return the document element, for further navigation. While this is possible, it is the case that most database systems with fine-grained document storage will reference directly to nodes in indexes, rather than to the containing documents.

3.3 Index Maintenance

Index maintenance is the downside of using indexes. Many operations affect an index, especially one at collection scope:

Any time an index must be modified, not only is it expensive in terms of computation, it creates potential concurrency bottlenecks, where lookup operations in an index are being blocked by modifications in an unrelated operation.

Index scope has an important effect on maintenance. Document-scope indexes are easy to change, and remove, without affecting other documents in a collection. Collection-scope indexes can easily affect unrelated operations when, for example, a document is removed, and all index entries that use it are removed, or invalidated. There are techniques, such as invalidation of a given document id, vs entry removal, that can be used to make such operations faster. This technique allows the index to be lazily modified for entry cleanup, at a later time, perhaps via an explicit interface.

3.3.1 Index Maintenance Example

An example is worth considering. Let's say a system has collection scope indexes that target nodes within documents. This means that a given index potentially has entries for many different elements or attributes within each document. Consider removing a document, and how indexes may be invalidated or removed. Cheap invalidation was mentioned, above, but ultimately entries must be removed. It's necessary to find all index entries for the removed document, and remove them. It's possible that references to them could be saved with the document when it is first indexed; however, that operation itself can create space and concurrency issues. Another way is to "re-index" the document, as if it were being indexed for the first time, but with the intent of removing entries, rather than inserting. It's possible to think that index entries could be arranged in order of documents, to make removal fast; however, such organization does not lend itself to index lookup, which is the real point of using an index.

Another case to consider is partial document modification in the face of index updates. The value of an index entry is the target node ID. This means that the node ID must not change, or the index is corrupt. The implication is that node ID renumbering is undesirable. This is one, very good reason to support a node numbering scheme that does not result in the need to renumber all, or part of a document. Such renumbering means renumbering, or mapping node numbers, in all relevant indexes (there is usually more than one index present in a collection).

3.4 Indexes and Query Processing

Query processing and optimization is another topic which is its own discipline, and is merely touched upon here. Regardless of specific query language used, the flow of query processing is generally:

  1. parse the query expression into an execution tree
  2. optimize the tree based on operations
  3. optimize the tree based on indexes
  4. execute the tree

This is overly simplified, but can be used for a brief discussion on the relevance of indexes to query processing. In the case of XML, the query expression is likely XPath 1.0, XPath 2.0, or XQuery. The presence of indexes has a very large effect on query processing and optimization, which is reflected in the performance of indexed queries. Use of indexes also affects how an application might write a query. It is possible to construct complex queries that defeat the abilities of a query optimizer to find a useful index, even if one is present. Also, rewriting a query to "help" a query processor find relevant indexes is also possible.

A schema, if it exists for a document, can be used as an index, of sorts, by query mechanisms. For example, schema information can be used to quickly determine that a given element or attribute name, or position cannot exist in that document. For systems that support, or require a collection of documents to have the same schema, the answer applies to all documents in that collection.

Another optimization that can be used is to use the query itself as a form of implied schema. If a collection has an associated, required schema, and a query implies a different schema, it's easy to determine that there will be no results. An implied schema, determined when parsing a query, can be combined with normal indexes to more quickly process a query.

4. Summary

This discussion has pointed out that, while not visible on the outside, the methods chosen to store and index XML documents in a native XML database have a significant impact on the system. These core choices drive the performance, scalability, and features available in the system.

Acknowledgements

Thank you to John PC Snelson and Gareth Reakes, of Parthenon Computing, LTD, for their help in reviewing this text, as well as John Merrells, also of Parthenon, for getting me back into the world of XML.

Bibliography

[Infoset]
XML Information Set (Second Edition) J. Cowan, R. Tobin, editors, W3C Recommendation, 4 February 2004. Available at http://www.w3.org/TR/xml-infoset.
[XQueryModel]
XQuery 1.0 and XPath 2.0 Data Model M. Fernandez, A. Malhotra, J. Marsh, M. Nagy, N. Walsh, editors, W3C Working Draft 23 July 2004. Available at http://www.w3.org/TR/xpath-datamodel.
[XQuery]
XQuery 1.0: An XML Query Language S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Simeon, editors, W3C Working Draft 23 July 2004. Available at http://www.w3.org/TR/xquery.
[XQueryFT]
XQuery 1.0 and XPath 2.0 Full-Text S. Amer-Yahia, C. Botev, S. Buxton, P. Case, J. Doerre, D. McBeath, M. Rys, J.Shanmugasundaram, editors, W3C Working Draft 09 July 2004. Available at http://www.w3.org/TR/xquery-full-text.
[XUpdate]
XML Update Language A. Laux, L. Martin, Working Draft 09 September 2000. Available at http://xmldb-org.sourceforge.net/xupdate.
[XPath]
XML Path Language (XPath) Version 1.0 J. Clark, S. DeRose, editors, W3C Recommendation 16 November 1999. Available at http://www.w3.org/TR/xpath.
[eXist]
eXist: An Open Source Native XML Database W. Meier, October, 2002. Available at http://exist-db.org/webdb.pdf.
[Bourret]
XML and Databases R. P. Bourret, July, 2004. Available at http://www.rpbourret.com/xml/XMLAndDatabases.htm.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.