XML 2003 logo

Lessons Learned from implementing a new XML content management system

Abstract

This case study presentation will introduce how CQ Press retained RSC to convert its three directory databases (containing hundreds of megabytes of data), as well the entire data collection and publishing processes, from an SGML to an XML-based system.

Keywords


Table of Contents

1. Introduction
2. The Challenges
3. The Solution
4. The Results
5. Lessons Learned
5.1. Document Decomposition
5.2. Document Sizes
5.3. Validation Issues
5.4. Character Entities
5.5. Searching XML Content
Biography

1. Introduction

Retrieval System's customer for this project is CQ Press, a leading publisher of books, directories, subscriptions and Web products on American government and politics, current events, and world affairs. A division of Congressional Quarterly, Inc., CQ Press focuses on three areas:

  • Government and Professional Publishing — comprehensive, up-to-date staff directories of the federal, legislative, and judicial branches of the U.S. government

  • College Publishing — textbooks and reference works focusing on American government and international relations.

  • Library Reference Publishing – references, directories, and electronic products for public and academic libraries.

The three government staff directories – the Congressional Staff Directory, Federal Staff Directory and Judicial Staff Directory – are invaluable resources for government members and employees, the media, professionals dealing with the government, and others. Widely regarded as the most current, comprehensive, and reliable government staff directories available, the directories are subject to ongoing updates and revisions. These changes can range from minor (staff comings and goings and judicial appointments) to wholesale (a new presidential administration can virtually re-write the federal directory).

In 1998 and 1999, CQ Press converted its data to SGML. Up to this point, CQ Press had stored and maintained its data in text files with print composition markup. This conversion to SGML enabled CQ Press to output its data into multiple formats for print as well as the Web.

2. The Challenges

By early 2001, several factors led CQ Press to consider upgrading the content and publishing system for the three directories. First, information gathered from the 2000 census caused many congressional districts to be redrawn, meaning that large amounts of database information needed to be changed, moved or separated out to create new SGML files. The impending 2002 election and change of presidential administration compounded the data-transfer challenges.

Second, CQ Press’ team found it difficult to move and reorganize large amounts of information within the 750 SGML files that comprised the 60 megabyte data store.

Third, as the demand for more frequent updates increased, the problems associated with redundant information within the system grew. For example, biographical information for a federal judge, the secretary of an executive branch agency, or a U.S. Marshall might be listed in several different data files. In the existing environment, there was no way to cross-reference and standardize the data. Compounding the problem, individual pieces of related information might only exist in separate places, e.g., a member of congress that belongs to several House committees might be listed in each committee in separate places in the directory.

CQ Press saw a need for a “central view” of that member where all pertinent information and relationships could be easily stored, accessed, and printed. They set out to find a vendor that could help them revamp the system to achieve the following goals:

  • Associate and cross-reference information across multiple files and even across different directories, to eliminate redundancies and enhance the consistency and accuracy of the information.

  • Reconcile all of the data changes and restructuring caused by the 2000 census.

  • Improve the system’s ability to move large groups of information, even entire government agencies. (This capability would later prove to be crucial, when the vast government reorganization creating the new Department of Homeland Security occurred.)

  • Enhance the system’s ability to publish via the Web.

  • Integrate contact management functions and data, thereby achieving efficiencies in tracking and obtaining updates.

Criteria for the vendor to be selected included expertise in database design, data conversion, content management, SGML and XML, and Omnimark.

3. The Solution

Following an extensive review process, CQ Press selected http://www.retrievalsystems.comRetrieval Systems Corporation (RSC), a data architecture and information management firm that specializes in content, document, and information management services. RSC’s expertise in SGML and XML, as well as the company’s extensive experience working with publishers, law firms and U.S. government agencies, were all factors in the decision.

With a contract in place by August 2001, the RSC team began design and development. A major technical goal of the architectural design was to “normalize” the data to eliminate redundancy and to allow a single instance of data – the formal name of a senator, for example – to be linked for publishing in many different places. To accomplish this, RSC engineers married XML and relational database technologies.

Using the aforementioned goals and more detailed requirements specified by CQ Press, along with a detailed analysis of the SGML content models – each of the three directories uses its own SGML document type definition, or “DTD” – sets of discrete XML components were defined and relationships among those components were developed. This led to establishing several different classes of XML information, including organizations, buildings and addresses, individuals, biographies, and linkage classes to associate individuals with the various positions in which they appear in the directories.

CQ Press Editorial Interface

Navigating the XML Structure of the data stored in the SQL Server using a browser

Figure 1. CQ Press Editorial Interface

These various classes of XML information were stored as XML components in several relational database tables. Handling of data once in a database is a traditional relational concept. Managing XML descriptions of individuals, biographies, and buildings and addresses merges these two technologies, eliminating redundant data entry and the associated potential for error.

A Person, as stored in the database

Viewing an Individual, with an option to edit

Figure 2. A Person, as stored in the database

A key tenet adopted by the design team was to retain the output processing from the original CQ Press system. The extensive SGML-based formatting and print-preparation software was functioning well, and the task of replacing that component would have been onerous. The new system was designed to allow for the export of data back to the original SGML content models. This resulted in the development of a set of XML DTDs that reflected the componentization of the XML data, but retained the original SGML content models for those components. For components, such as linkage classes, which had no direct SGML analog, new XML DTDs were developed.

The new system required an interface to allow CQ Press staff to create, edit and manage XML content quickly and easily. A Web browser interface was developed to provide overall editorial control. It allowed for various logical views of the database, for direct manipulation of the relationships among various database components, selection of XML components for editing, and scheduling of various batch processes. Corel® XMetaL® was selected to serve as the system's XML editor because it is highly customizable and offers an intuitive, word-processor like environment for working with XML content – a tremendous benefit for staff transitioning to the new system. RSC and CQ Press were able to easily link the Web browser interface and Corel XMetaL to tailor the interface to handle specific editing and publishing tasks unique to CQ Press.

Corel's XMetal editing environment

Tags-on view of an individual in XMetal

Figure 3. Corel's XMetal editing environment

In addition to Corel XMetaL, the developers selected several other third-party tools. These include: http://www.stilo.comStilo Corporation’s Omnimark (for converting the original SGML data into the required XML formats and to create SGML data for print publishing from the database); Caucho Resin (a JAVA servlet container) and Antenna House’s XSL Formatter for XSL/FO processing and PDF creation; Microsoft SQL-Server for the relational database; and various XML and XSLT tools from the Apache Software Foundation. The interface and scheduling components were written as JAVA servlets. Heavy use was made of XSL/T for driving the Web browser and for reformatting XML components for output.

CQ Press Editorial Interface

Print/Publish Options, in the browser

Figure 4. CQ Press Editorial Interface

The project also provided the opportunity to examine and improve CQ Press’ workflow processes, such as how information for the directories is gathered and updated. In general, CQ Press staff would fax or mail a standard update sheet (or proof sheet to be OK’d) to an information source (e.g., a senator’s office). The source would review the information, note any needed information or changes by hand, then fax or mail the information back. Staff then needed to log returned updates and track outstanding forms for follow-up.

This process was slow at best, and the new system needed to address several challenges:

  • The entire update process was “outside” the core directory information system, meaning that a separate system was needed to track information on contacts and the status of their update.

  • Thousands of update forms were mailed for each directory several times a year. Matching forms to contacts was extremely time-consuming and labor intensive.

  • After September 11, 2001, the threat of anthrax being delivered via conventional mail heightened the need for alternative methods of requesting and delivering updates.

After 9/11, RSC and CQ Press worked quickly to implement an integrated e-mail functionality to handle updates. Cover letters, update forms, and proof sheets are generated from the system and converted directly from XML to print-ready PDF and RTF formats for sending and collection via fax or e-mail. Forms sent in RTF format allow contacts to make and transmit their edits electronically.

CQ Press Editorial Interface

Contact Management, including generating forms for updates

Figure 5. CQ Press Editorial Interface

4. The Results

The new system went online in December 2002, with the first revised directories published in February 2003. Initial response from end-users of the reformatted directories and Web publications has been positive. And according to Jennifer Ryan, director of Electronic Product Development at CQ Press, the editorial staff is very pleased with the new system’s efficiency and content management features.

Ryan cites the following as the system’s primary initial benefits:

  • Streamlining Administrative Processes – Streamlining of the update process has freed CQ Press staff from hours of labor-intensive tasks. The new system’s cross-checking capabilities virtually eliminate multiple data entry as well.

  • Maintaining Accuracy – CQ Press’ high standards and reputation for accuracy will only be enhanced by the new system. “The new capabilities make it easier for us to maintain our exacting standards and deliver the most accurate, up-to-date government directories available,” says Ryan.

  • Repurposing of Information – Because of the system’s XML architecture and searchability features, CQ Press is able to find, sort and classify information in new ways that enhance the staff’s research capabilities and, eventually, may result in new, innovative product offerings.

“We're still discovering new benefits to the system that will improve our processes and, more important, allow us to better serve our directory users,” says Ryan. “We're confident that we chose the right technology partner in RSC and that our enhanced content management and publishing system will serve us well today and in the future.”

5. Lessons Learned

5.1. Document Decomposition

All of the CQ Press source content was represented by large files in SGML format with lots of replicated content. A name, for example, was repeated in every organization in which that person was employed. Since part of the goal was to eliminate this redundancy, we decomposed the SGML and converted it to XML with references replacing the repetitious element structures.

These repetitious structures (such as names and biographies) were then stored separately and merged to make a single version of the structure. Similarly, the organizational structure of the data was decomposed hierarchically, replacing the single, large SGML structure with a number of linked XML structures containing references to children and parent organizations.

We learned, however, that there were variations on the data in each source representation (variant name spellings, titles, etc.). We took a lesson from object oriented programming and created classes to represent each instance of these structures as they appear, inheriting the parent content but allowing for instance differences. When the data was recomposed for extraction for publishing, there was often a large time factor involved. We found that this would often run overnight or for many hours.

A key lesson we learned was to track the component structures that had been changed so that we could limit extraction and improve performance.

5.2. Document Sizes

Even after the decomposition, many of the XML structures were quite large. This posed an editorial performance burden. We found that we needed to decompose to a lower level than we initially anticipated.

5.3. Validation Issues

Having decomposed the SGML and converted it to XML, we ran into a problem with validation. The original SGML DTDs, even after converting them to XML DTDs, were oriented towards large, autonomous structures. Also, they had no definitions for the reference elements we created as part of decomposition.

We had to modify the DTDs to loosen these limitations. However, this caused some relaxing of validation for the editors. We built some editing rules into the XMetal editor to account for these differences. We maintained the SGML DTDs to ensure that the recomposed SGML was still valid.

5.4. Character Entities

The XML parser handling of character entities posed a problem. The parsers convert the entities into the translated form as defined by the DTD. This caused several problems — it meant that editors couldn't edit the XML data character entities, and it meant that the extract process failed because the translated form of the entity for one form of output was different from the form required by another output. We needed to preserve the character entity values.

We adopted a technique of translating the character entities into processing instructions and stored these in the database. Then, for editing or extraction, we transformed the processing instructions back into entities. This approach was generally successful because there were no character entities in any of the attributes.

5.5. Searching XML Content

CQ Press uses Microsoft SQL Server as it's standard database server. We decided to store the XML structures in SQL Server in "text" fields to facilitate easy recomposition and editing. Although later versions have XSL support, the version of SQL Server we used for this development didn't provide XML content searching on text fields. We learned that some form of content searching was required to allow editors to locate the XML nodes for editing. So we developed tools to extract XML data to be saved in separate columns to support searching.

Biography

With more than 15 years in the industry, Richards provides a strong portfolio of skills in marketing, strategic business planning and development, operations, and customer service management. Richards founded CD/LAW, a legal publishing business that was eventually acquired by Lawyer’s Cooperative Publishing, a subsidiary of Thomson International, now West Group. She then went on to sales positions with Reed Technology and Poet Software. She joins RSC from Softquad Software which was recently acquired by Corel Corporation.