Abstract
All major relational database platforms have recently increased support for storing and managing XML natively. There are a few reasons for this beyond the marketing factor. An obvious one is the need for data to be commingled with XML in a manner that supports the relationships between the two. If this is true, then one might conclude that organizations are finding an increased need to customize the relationship between structured content and relational data. As a content management consultant I have found that not only is this statement true in the publishing industry it is becoming quite common. From the need to streamline publishing to multiple sources in multiple levels of granularity all the way to easing the editorial pain normally encountered when managing content and associated metadata.
How do I know all of this? I have worked recently on a number of projects that have taken advantage of this type of implementation. The basis for this case study is to pass along some lessons learned (both on the positive and negative sides) in one particular instance in which a system was architected using Microsoft SQL Server to house large XML documents modeled in a relational structure.
Due to the unique nature of the processes defined to support the assignment of metadata and the different levels at which this task is accomplished within the XML document model, we were forced to architect a custom solution to meet the clients needs.
This case study will take you through the analysis process and requirements that drove us to this solution as well as the technical architecture and implementation of the finished product. We will address everything from the DTD to the assembly of the XML for eventual output from the system. These and many other topics in between will guarantee a real world point of view on the inclusion of XML support within the RDBMS world and why it is such an important feature to embrace.
Keywords
Table of Contents
Developing content storage solutions can be challenging for any organization. Moving between metadata and content markup can be difficult to maintain and without the proper attention from the start can quickly spin out of control. Over the past two years, the major relational database platform vendors have rushed to release new versions of their software that have inherent XML support. This is a big step for the content management administrator's ability to house content and metadata within a single storage solution.
Why do we first look to the relational database vendors and their ability to marry XML and metadata? Well for one thing, the ability to control the consistency of the values used when assigning metadata to content has traditionally been an easy fit within a relational structure. You typically have a set of values that coorespond to columns in tables within a database. What's the next logical step? Creating an XML repository within the RDBMS (relational database management system) or some document model knowledge wihtin the database engine so the management of the content within the data structure is easily comingled with the metadata.
Today, Oracle, IBM, and Microsoft all have a head start in this area. Oracle in particular has excellent XML support in their 9i product line. Specifically in their XMLDB repository and the XML parser that is accessible from within applications (PL/SQL or Java) running within the database engine. IBM's DB2 platform has a series of XML Extenders that provide similar functionality to Oracle's. Microsoft's XML support within SQL Server 2000 takes advantage of the native MSXML Core Services (formerlly the MSXML parser) to provide greater understanding of XML-based content by the database.
The common theme throughout the products just mentioned (and the technology industry in general) is the growing support for XML. Additionally, there seems to be a firm belief by all of the vendors involved that today's content is marked up in XML. Designing custom repository solutions is not always the best answer but having options like these allow for customization without having to recreate the wheel.
CQ Press (www.cqpress.com) is a Washington DC-based publisher and sister company to Congressional Quarterly (CQ). CQ Press has traditionally been the book publishing arm of CQ, focusing on the library, college textbook, and government and professional markets. However, over the past two years CQ Press has concentrated on developing new comprehensive electronic products that respond to librarians’ and researchers’ desire for web-based content. As part of this effort, CQ Press has moved some of its traditional print publications online, and marrying those sources with new original content created specifically for the web. The content originated in various parts of the organization and included a wide array of formats. In order to accomplish this goal of integrated all the varied content into a single environment, existing content needed for the new electronic products was converted to XML. This case study details the issues faced while trying to organize this content based on a standard DTD and designing a relational content repository around it.
Storage Requirements
The requirements for how this content was organized, stored, and ultimately interacted with was segmented into three main areas. The first area involved the document model. The DTD was already used in production and part of the existing web publishing process workflow. The second area dealt with the metadata management within the storage architecture. The storage architecture must support the metadata that managed the publishing process and accompanying product definition. The third and final area pertained to the content repository platform. The platform must be based on a relational model and use the CQ Press standard Microsoft SQL Server.
One of the initial challenges faced when addressing the content storage requirements for CQ Press is a very common one. The metadata contained within the relational database strucuture must be married with the XML documents at various levels within the document model heirarchy. By defintion, these two data structures are very different. What made this project a challenge compared with past projects was the use of the various nodes in the document structure in different web-based products on different levels. The content of a book is segmented into sections of varying sizes. CQ Press assigns metadata to each section, allowing it to be individually selected for publication to CQ Press web products. This makes the management of metadata a very difficult task and one that required a great deal of focus on the database design and specifically how and on what level XML documents were stored within the database.
The metadata that comprised the supporting architecture within the database was based on a number of descriptive and workflow entities. At the core of these database tables were the relationships between CQ Press electronic (or more specifically, web) products and the structure of the XML files where definition of the web document was created. In other words, the content repository must be built on a SQL Server 2000 database platform, marry the document model defined by the DTD and the metadata used to describe web document defintion based on product, and take advantage of as much internal XML knowledge that the data store can offer (in order to provide an easily maintained and supported code base).
The DocBook DTD
The CQ Press DTD was based on the widely used Docbook model. It is a very detailed document model that provides for just about any possible print publication markup scenario. The original goal for the DTD was to provide a document model for the conversion of book-based content (sometimes from Quark, sometimes by conversion house from print) to XML. Expanding on this point, one of the core requirements for the system was to keep the DTD intact as much as possible without compromising the extensibility of the end product. One of the main reasons for this requirement was a key project goal that there be no impact on any of the existing processes in place for distributing content to electronic products. The byproduct of this decision was the preclusion of any additional DTD fragments to manage the migration from document model to relational architecture. This meant that content transformation programs in place before the content management system was created were not to be altered as part of this development cycle.
The DTD itself is modeled as a tree that represents the basic publication structure for a book that was customized by CQ Press to support the tagging of the nodes at various levels with web product output instructions. The output instructions set the value of web document titles and keywords for indexing based on electronic product.
As you can see from the tree view of the DTD structure for the main nodes, the highest level of the document is the chapter. The section level introduces metadata that suppports such key data as the orginal print source for the content, author information, as well as allowing for web product defintion to occur at this level. These elements are more data centric (as opposed to document centric) and lend themselves nicely to a design that allows for the commingling of XML and relational data elements. The same basic flow is followed in the nodes below the section. Certain elements allow for the definition of web documents to occur and may be tagged accordingly. The end result is a DTD that was structured originally for print conversion to XML and now has multipurpose in the electronic publishing world.
Storage Architecture The database architecture that was defined for this solution comprised key entities that represent the structure of the document model with the ability to store the XML that makes up the core of the file. As mentioned previously, the assignment of metadata to elements within the web document skeleton at various levels was the key component in the architecture. By defining a list of elements within the DTD that represented "documentizable" nodes (or sections of the document that could be related to a product and published to the web) and adhering to those definitions, the table structure easily fell into place.
Again, the tables were based on the knowledge of the document outline upon import into the system. All documentizable elements were inspected via Microsoft's DOM parser and recorded in a table called "PubNodes." This table held the parent-child relationships represented within the hierarchy of the XML document. The PubNodes recorded in an XML file provided the gateway for editors to assign metadata to those nodes essentially making them web documents. These relationships that are managed within the table structure are then assembled during the output process into a valid XML document with ties to (potentially) many different web products.
The "PubNodes" table provides the structure for relating an XML node to a file, but what about the other pieces of metadata that describe that document in the context of the product? These entities were split between "predictable" and "unpredictable" data elements. CQ Press had some elements already defined in the DTD that would be beneficial to add or edit but were not key to the structure of the repository. It was decided that these elements could be stored in the database in a very generic fashion, before allowing for the addition of elements of this type in the future without having to extend the system to support them. The structure that housed these "unpredictable" data elements was made up by the "ProductDocumentMetadata" and "ProductMetadataRules" tables. These tables contained the rules that governed how and of what type this data would be stored as well as the actual values associated with them. There is a parent-child representation of this data and its cooresponding values in ProductDocumentMetadata. This structure directly mirrors the XML based hierarchy the data came from. The ProductMetadataRules table provides a tie between DTD elements yet to be defined and the data structure. There are some product-specific data elements that are sure to be added over time that have yet to be defined. This table structure provides the ability to add this to the storage architecture without the need to make code changes to handle it.
The following image details the main database entities that embody the content structure in the relational model.
XML Management within the Database The database itself handled a great deal of the XML management. The column definition for the XML files was of a text datatype. This allowed for very large XML files to be managed within the data structure. To this point, the XML files were stored in their entirety within the "Files" table in the database. There no need, from a technical standpoint or a business requirement, to break the files into smaller XML chunks. The files were parsed (outside the database) and "PubNodes" recorded in the appropriate database tables. All metadata describing those pubnodes was also recorded in the database and stripped from the file.
The process by which the metadata was stripped from the XML file and placed in the "ProductDocumentMetadata" table was the OPENXML function within SQL Server. This function exposes the Microsoft XML Core Services parser via the database engine for parsing XML within Stored Procedures. The OPENXML function takes a parameter of a varchar data type and uses the string representation of the XML to be parsed and accessed in T-SQL column and row-based defintion. This funciton became extremely useful during the parsing of the unpredictable data elements and retaining the hierarchy they represent.
This system now provides CQ Press flexable XML-based management of its content for editorial workflow and web publishing processes. The benefits of this architecture have already been realized by the ability for the CQ Press editorial staff to markup new content and easily feed existing products in a streamlined workflow. In addition, the use of Microsoft-based technologies has allowed the organization to remain consistent in their development approach and platform choices.
Content architecture is challenging when working with data that is at the beginning stages of the repurposing life cycle. This project has allowed CQ Press to really begin to think about the vast choices they now have in terms of content distribution. The design of this system will ultimately evolve based on new product choices and technological options that become available in the future. However, the baseline that has been created by the systematic approach taken when organizing CQ Press content will make this process manageable and greatly expand future choices.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |