XML 2003 logo

Enterprise Database Design for XML

Abstract

The abstract was not available at the time the proceedings were created. Please check an updated version of the paper abstracts at the conference proceedings web site.


Table of Contents

1. Introduction
2. Moving XML into the Back End
3. Relevant XML Standards
3.1. XML CORE and NameSpaces
3.2. XPath, XSLT, SQL/XML, and XQuery
3.3. DTDs and XML Schemas
3.4. DOM, SAX and JAXB
4. XML Applications
5. XMLTypes and Databases
5.1. XMLType CLOB
5.2. XMLType Views
5.3. XML Stored in Native XMLTypes
6. Conclusion
Biography

1. Introduction

Choosing the right XML storage model for your application can mean the difference between success and failure. Like many of the protocols, filesystems, and technologies of the World Wide Web, Extensible Markup Language (XML) has gone from humble beginnings to widespread implementation in a comparatively short period of time. Initially popularized in the Web publishing industry as a document sharing technology, XML has evolved into an industry-wide data communication and storage medium. Whereas Web-based documents once consisted of little more than text and images, these documents now based on XML; have become the medium of choice for delivering data pulled from databases in the back end to applications and documents in the middle tier and front end. To respond to the growing demand for a single source of truth, XML itself is beginning to move into the database, and is becoming integral to the structure of data storage.

2. Moving XML into the Back End

Still, the challenges of storing and manipulating XML in a database are many, and can be daunting for those of you out there who know it "has" to be done, but don't exactly know how. Databases are relational and XML is hierarchical, so until recently there has been no simple, elegant way to integrate the two. Traditionally, developers have had two choices: either use a parser to deconstruct the document data into relational data and store it as such in the database, or store the entire document as a text file, preserving its text-based structure. In both cases the direct access to this content limited to non-XML interfaces as the underlying storage model is unaware that the content is XML. What is therefore needed is a storage model that can expose XML functionality directly to eliminate the serialization and re-parsing required by the traditional models.

The important thing to remember is that moving XML into the back end with the database is not a one-size-fits-all process. Each storage option has its advantages and disadvantages. Knowing which XML storage model is best for the purposes of your application, rather than modeling your application after an XML storage model, is critical.

The objectives of this paper is to discuss the various types of models based upon the XML application space thus helping you determine which type to use. In this paper, I'll explain both document "shredding" (deconstructing an XML file into relational data and then operating on it with SQL) and Character Large Object (CLOB) storage. Then I'll introduce you to the native XMLType which represents a combination of the two. I'll cover both the advantages and disadvantages of each storage model and discuss the types of applications each is suited for. In the end, you'll get a good idea of not only which storage model is best for your application, but why.

3. Relevant XML Standards

Before discussing XML applications and their various storage model requirements, I need to introduce the set of XML standards that developers not only need but will expect to be available. It is not sufficient to simply store and retrieve XML as if it were simply a text document as that pushes all of the processing into the middle tier and throws away the power of the enterprise database. Let's look briefly at the XML family of standards in the context of their being exposed from a persistent store.

3.1. XML CORE and NameSpaces

Once a storage model becomes "XML aware" it must differentiate between content that is XML and simply marked-up documents such as HTML pages. In order to do this it obviously must understand the nature of XML and what differentiates it. The W3C XML CORE specification defines what constitutes an XML document and thus any database that professes to store XML must be able to enforce this specification. One particularly difficult requirement is the required support for Unicode as this can double the size of the storage model and any buffers.

Many consider the W3C XML Namespaces specification as a "necessary evil" to move XML into the business world. This support is necessary in order to specify and distinguish the XML meta data and is a necessary component for data type support as specified by XML Schema. Since namespaces are not simply prefixes to element and attribute names, but can also be inherited without prefixes, the result can be difficult to maintain without access to the data model of the document.

3.2. XPath, XSLT, SQL/XML, and XQuery

Databases store data to be able to retrieve it. While this is an obvious statement it bears re-stating as XML can be considered a container for data from which it is to be retrieved. The W3C XPath specification was developed to provide and navigational syntax that could query into the structure of an XML document. XPath is the basis for both XSLT and XML Query - two languages for querying an XML document or collection and returning or processing the results. For a database to properly support XML, it must expose functions that can take XPath selectors and predicates otherwise the processing will need to occur outside the storage model where it would be less efficient. Relational databases need to virtualize the XPath structure through either object support, sophisticated indexes or both.

Databases also store data in rows and columns as that has proven to be an efficient storage model for data-intensive applications. To serve as a data source for an XML application you need to be able to generate XML documents with the data from these tables. A new ISO standard, SQL/XML provides a set of functions that extends SQL (i.e. XMLElement(), XMLForest(), XMLAttribute(), etc.) to be able to create XML structure with a SQL query. Thus supporting this standard has become an important requirement when using relational data in an XML application.

3.3. DTDs and XML Schemas

The structure or data model for an XML document can be defined by Document Type Definitions(DTD) or XML schemas(XSD) respectively. These provide the capability to validate an input XML document to insure its integrity and completeness for further processing. While DTDs focused on defining the meta data and structure of an XML document, XSDs take this one step further and introduce data types into the model.

Several challenges exist for databases to have their schemas support DTDs and XSDs. Both models must be supported as they serve different applications, yet both focus on the input XML where the database needs a model that delivers performance on the retrieval side. Additionally, DTD support means support for entities which either get expanded and thus are lost, or are maintained but need to be linked. Finally, XSDs introduce not only over 40 simple data types that need to be mapped to the SQL ones but also complex types which mandated some level of object support.

3.4. DOM, SAX and JAXB

The W3C Document Object Model(DOM) specification, Simple API for XML (SAX) and Sun Java Architectural Binding for XML (JAXB) all expose the content of an XML document to applications through a standard set of APIs. While these have not been considered client or mid-tier interfaces, if they could operate on the storage model directly sigifcant efficiencies are avaliable especially on large documents or result sets.

4. XML Applications

To set up a context for the XML storage models in the next section I want to introduce three broad categories of XML applications - data-based, content-based, and instruction-based. Data-based applications are generally found in business applications Specifically, business-to-business and application-to-application are the prime users. Content-based applications are interested in authoring and publishing to and from XML documents to serve the growing variety of electronic devices and paradigms. Finally, instruction-based XML is being used to configure and manage applications and components replacing more fragile file formats that depended solely on name-value pairs or delimiters.

5. XMLTypes and Databases

XML is hierarchical and enterprise databases are relational. While object databases have been used to store XML they have not been demonstrated to support all of the XML applications' requirements at the enterprise level of performance and scalability. Thus much work has been done in the last few years by the major relational database companies to bring their relational power into the hierarchical space. Let's take a look at the various techniques that have been developed.

5.1. XMLType CLOB

In some sense, using an XMLType CLOB is the simplest way to store an XML file. It treats the XML document as exactly that: a document. The file is preserved as a complete text document in storage (with whitespace, comments, and so on intact), by storing it simply as a single, character-based entry in the database. Consequently, files of any size and depth can be stored as long as they are well-formed XML. However, although you may have defined data types in the document for validation against a schema, the data is not typed in the sense that it can be manipulated or retrieved using SQL queries.

Because of this clear limitation, the document must be searched using a text search engine in contrast to SQL queries, which can leverage the functionality of query rewrites, functional indexes, and do on. Efficiently updating the document is limited as it involves pulling the entire file, making the change, and replacing it. If, however, the primary purpose of your XML documents is to encapsulate content in a structure for transformation (as might be the focus in Web publishing, content management, document archiving, and so on), and most changes to content are made at a document level, then the XMLType CLOB is your best choice for XML data storage. In these cases, you aren't likely to need a SQL context for your data, and you get a guarantee of byte-by-byte fidelity.

5.2. XMLType Views

An alternative to the CLOB XML Type is to create a virtual XML document on top of a set of relational tables as an "XML view." This approach permits a user to insert, update, and delete data in the XML file just as though it were SQL data. Because you are defining a virtual XML document on top of the data store, you aren't limited to just one representation of the data as with CLOB; rather, you can have multiple XML "documents."

Storing data in relational tables also means that you can update individual elements without pulling the entire document. In general, with XMLType Views, you get all the advantages and efficiency that come with SQL operations, because the relational database engine is optimized for these kinds of retrievals. Finally, you can use the SQL data type operations on the XML data types, instead of treating them simply as text. (For example, a date can be treated as a true date from a SQL standpoint, rather than simply a string of characters.).

This approach has some disadvantages, however. Defining XMLType views where the structure is deep (that is, deeper than 8 to 10 tiers) can degrade performance significantly. Inserting and updating views require instead-of-triggers and are more difficult to maintain as you will need to include application code in the trigger. The great virtue of the CLOB approach is the complete preservation of structure and byte-for byte fidelity. But with the XMLType View you lose the guarantee of strict document order, as many items (such as comments and processing instructions) will have disappeared while shredding the document data into tables.

None of this matters, of course, if you are shredding your data for use by data-centric applications that don't "care" about document structure. If your goal is to move data in and out of the database and keep metadata intact as the only context and have all the advantages of DML operations, XMLType Views is ideal: You start from the database schema and generate the corresponding XML schema. Therefore, there is no concept of document order; any required section ordering can be explicitly defined via ROWIDs or other application-specific methods. Functions are provided in the database and XDK to create XML schemas automatically from XMLType Views.

This method is especially useful if you are working with several XML Schemas of differing tag names and structure and do not want them to define your underlying database schema (as when extending an existing legacy database application to now support XML while preserving its legacy functionality). In a typical example, you can repurpose the same data store for a variety of customers who are dictating formats and templates to you. You simply define an XML view for each customer and when you retrieve or insert the data to that view it will have the format appropriate to the corresponding customer.

5.3. XML Stored in Native XMLTypes

It is possible to have your cake and eat it too, at least to an extent: You can store your document as a Native XML Type (see below) in Oracle's XML DB repository, which will preserve byte-by-byte document fidelity and also shred it into SQL tables. This approach gives you complete validation while allowing you to do all the DML operations on the document that you get with XML Views. You still get fine-grained data management, and can create multiple views and documents based on the SQL data. When your XML schema is registered, you store your XML data in your database, by simply inserting an XML document file using SQL, PL/SQL, Java, FTP, HTTP, or WebDAV. Getting XML data out of your database can be as simple as executing a SQL query or reading a file using one of those Internet-standard protocols. This functionality is made possible through the built-in query re-write support eliminating the need for the instead-of-triggers.

Besides the ease of working with XML in either its document or data form, you get an enhancement to the W3C-standard Document Object Model (DOM) APIs when programmatically accessing it. When parsing XML from a file you can build an in-memory tree representation of the entire file in order to manipulate it. (This approach is shared other XML processors such as XSLT.) With the "virtual DOM" feature of the native XMLType described below, you build the tree on demand not only preserving resources when using DOM APIs and XSLT, but in cases of large documents or row sets, your application simply works instead of crashing.

There are many advantages to the XML DB repository, but it isn't right for every application. Overhead is involved in maintaining the relationship between the full document and its shredded data. However, the biggest problem comes with schema evolution. Because the document dictates the storing process (mapping which data to which tables), when you want to change the document schema, it is no longer a simple abstraction, but is intimately bound to the database schema whose structure it dictated. That means you can't do most nontrivial changes of either the database or the document schema, without having to export all the data and re-import it into the database.

That could be a real nightmare if you are working with multiple industry schemas and you have to dump your data every time there's a change. If there is no need for document fidelity, these changes can be abstracted from your database schema using the XMLType View storage model described previously.

6. Conclusion

I've compared and contrasted the three different DB storage options for XML data implemented to give you a better idea of which is most appropriate for your purposes. Choosing carefully according to your applications' needs rather than on the basis of what is lately being billed by other companies as the be-all, end-all solution to integrating XML documents and databases is very important. This choice can literally mean the difference between merely creating a successful prototype, and bringing it successfully to production.

Biography

Mark Scardina is Oracle's XML Evangelist for Server products and is the Group Product Manager for the CORE and XML Development Group tasked with providing the XML infrastructure components used throughout the Oracle product stack including the XML Developer's Kits. Mark chairs Oracle XML Standards committee and is an editor on the W3C XSL Working Group. He is a frequent speaker at industry trade shows and conferences and is co-author of The Oracle9i XML-Handbook.