XML 2001 logo

Low-Cost, Flat-File XML for the Masses

Jason Willebeek-LeMair <jasonlemair@insightbb.com>
 PDF version    Latest version   

ABSTRACT

When you hear about XML publishing, you mostly hear about databases, workflow tools, and content management systems. These are typically costly systems aimed towards the information management needs of larger enterprises, where the sheer volume of information pumped through these systems provides a fairly rapid return on investment. This fosters the perception that you need one of these complex, expensive, enterprise solutions to use take advantage of the modularity and flexibility of authoring in XML.

That is simply not true. You can realize the benefits of publishing from modularized XML, without the expense of an enterprise publishing system, by implementing the authoring environment on top of nothing more than your operating system's file system. Although this environment is not adequate for enterprise publishing needs, it is more than adequate for the needs small writing teams, businesses with a limited number of related products, proof-of-concept demonstrations, and even home users.

The AIC documentation group at Cisco Systems has implemented such an authoring environment. We have been able to reuse and re-purpose modular, XML-based information without implementing a database back end. By examining how the AIC team implemented XML in a flat-file environment, you will see:

Table of Contents

The AIC documentation group consists of seven writers who were originally dedicated to a single product, CSPM. However, by implementing an Extensible Markup Language (XML) publishing solution, the group has been able to increase their efficiency and now supports five related products without increasing the number of supporting writers. Additionally, what started out as a local effort has led to involvement with a corporate-wide XML publishing initiative.

This case study shows how the AIC documentation group implemented a file-system-based XML publishing solution and shows some of the benefits and drawbacks of doing so.

1. Background: Leading Up to XML

1.1. The Problem

To fully understand the challenges that faced the AIC documentation team, you must understand the problems the group faced that led to our XML solution:

1.1.1. Deliverables And Formats

The AIC doucmentation group supports several software products that use a variety of information delivery formats, including:

Many times, information is shared between formats. For example, field definitions used in an HTML Help system are also used in a WinHelp system to provide field-level, context sensitive help.

1.1.2. Historical Solutions and Process Failures

Two different processes were used to create the various formats:

Both cases relied on FrameMaker documents to convert to PDF output using the standard Cisco publishing system, which was also responsible for generating yet another HTML format that is posted to the Cisco.com website.

The primary process failure encountered in the both cases was caused by the same problem. Once we converted to a particular deliverable type, such as FrameMaker or raw HTML, it was easier to make changes in the multiple source files than to perform another round of conversions. This failure to maintain a single source resulted in "stale" information information that was updated in one outpout format but not in another across each type of output, frustrating writers and editors alike.

Another process failure was the inability to train writers and editors on the use of the "hybrid" HTML format used to produce the HTML Help system, which resulted in those who understood the latest conversion scripts having to do clean-up work at the end of a release to convert any new source over to all of the required formats.

The following examples show some of the code used in the "hybrid" HTML markup. The first example demonstrates the use of class attribute to "type" the HTML elements:

<table border="0" cellpadding="4">
<tr valign="top">
<td class="note" width="65"><b>Note:</b></td>
<td class="note">Note Text</td>
</tr>
</table>

This example shows the code needed to embed index entries in an HTML file for use with the Microsoft HTML Help compiler:

<object type="application/x-oleobject" classid="clsid:1e2a7bd0-dab9-11d0-b93a-00c04fc99f9e">
<param name="Keyword" value="Log on to Cisco Secure Policy Manager dialog box">
<param name="Keyword" value="Log on to Cisco Secure Policy Manager dialog box,field values">
<param name="Keyword" value="Cisco Secure Policy Manager,logging on to"></object>

The final example shows the misuse of the HTML "meta" element to embed help context IDs in an HTML file:

<tr valign=top>
<meta map="WinHelp" content="HIDC_OPEN_LOCAL_OPENBOX, HIDC_OPEN_BOXSERV_OPENBOX, HIDC_OPEN_REMOTE_OPENBOX, HIDC_OPEN_SERVER_OPENBOX, HIDC_OPEN_PORT_OPENBOX">

Creating these structures in a commercial HTML editor is not a straightforward process, and we were seeing a lot of errors, such as missing class attributes, in the HTML markup.

Finally, the heading levels used in the help system did not correspond to the heading levels used in the hardcopy. The information was simply arranged differently. At first, we worked around this problem by using a cascading style sheet to make all of the HTML heading levels appear the same, then we set the heading levels to those used in the books. This trick worked until the authors became used to reusing information from other areas. What was a first-level heading in one book was not a first-level heading in another. We started seeing conflicts.

1.1.3. Dependencies

The primary dependency we faced was that the help systems that accompanied the products are required to build with the various products, being picked up by the install scripts during the production phase. Therefore, we could not simply pull out the help systems until we were finished and ready to reconvert because it broke milestone builds of the products that we supported and delayed the product testing.

In addition, we saw on the horizon the need to support another propriety help system that would not be able to use the HTML markup that we had developed in Case 1 or the HTML that resulted from the FrameMaker conversion in Case 2. Eventually, that source information would have to be ported to another help system format.

1.1.4. Tools and Training

To produce the various forms of documentation supported by the AIC group, the writers had to know:

Brining a new writer up to speed on these tools, and the processes we followed to create the deliverables, was an arduous task, taking almost 6 months and very heavy editing cycles. We did not have an effective method of isolating the technical intricacies of the solution to just a few programmer/writers. Therefore, new staff could not include entry level writers.

1.2. The Solution

We knew that there had to be a better way to reuse the information that we developed and support the multiple output formats currently in use and any future formats. XML had just arrived on the scene, and I was given the task to see if it would be feasible to move our information to XML and to automate much of the deliverable output process.

When we first started to investigate using XML to solve our information delivery issues, none of us had extensive experience in developing a publishing system based on a markup language, although several of us had used SGML extensively. All of our research pointed towards the need for the three main components of any XML-based publishing system: authoring, content management, and content delivery.

However, this was a local effort meant to alleviate the problems we encountered in our authoring process. At the time, a content management system was beyond the scope of our needs and budget we did not need an enterprise-level solution for a group of seven writers.

So, we developed a system that used the native file system of our file server as the content management component of our XML solution. There were some trade-offs, of course, but doing so also taught us exactly what we needed in a content management system.

2. Defining the Requirements

There was a strong temptation to start developing the system immediately. XML was a relatively new technology, and our writing group thrives on being on the cutting edge of technology. However, reason prevailed, and we set forth to define the goals and requirements of the system first.

Our goals were simple:

Moreover, our overriding goal for the system was to create an authoring environment that was less complex than the one the authors were currently using. If we could not make the system easier to use, then we we would not do it at all. We wanted to make the writer's jobs easier, not more difficult.

The requirements were slightly more difficult to define. Because we were not using a content management system, we focused on the authoring tool. We realized that the authoring tool would need to assume some of the functions of a content management system, while the other functions, such as generating unique IDs for the XML fragments and documents, would have to be assumed by our internal processes. Therefore, we defined these requirements:

From the list of several authoring tools that met these requirements, we chose XMetaL by SoftQuad. XMetaL provided an interface much like the HTML authoring tool the authors were accostumed to using and provided a rich programming environment so that we could develop custom scripts and programs to take the place of features that would be provided by an XML content management system.

3. Developing the Architecture

Once we determined our needs and that XML met those needs, it was time to develop the architecture. We were faced with the choice of using an existing XML DTD, such as DocBook, or to create our own. We also had to figure out how to manage our XML fragments in a file system-based work environment.

3.1. Developing the DTD

Developing the DTD was fairly easy. We already had a well-defined and chunked information architecture that we used when authoring in HTML (see Figure 1). We decided that, because the authors were already familiar with this architecture, it would be best to codify it in an XML DTD instead of using an existing DTD.

Figure 1: The Original Information Architecture

We ended up with a three-tiered, topic-based DTD architecture (see Figure 2). We define a topic as a heading plus accompanying text and graphics. This architecture has been surprisingly flexible in meeting the needs of a diverse group of products.

Figure 2: The Tiered Architecture of the CiscoBook Family of DTDs.

The bottom tier of our information architecture, the Core DTD, contains the body-level elements, such as paragraphs, lists, index entries, tables, and so on. For this portion of the DTD, we borrowed heavily from HTML. The authors were already familiar with HTML markup, so we thought that making the change from authoring in HTML to XML would be have less of an impact if theyused familiar tags. However, we did simplify several aspects of the base markup. For example, instead of using a complex table structure for notes, we created a simple <note>Note Text</note> tag that is transformed to the correct format.

The middle tier of the DTD contains the topic-specific modules. This tier relies on the Core DTD to provide the basic body markup, while providing the topic-specific markup for the information type. For example, the procedure.dtd contains step and result elements, which are not found in any of the other topic types.

The middle tier was also defined as our lowest level of reusable content. We used the XML definition of external parsed entity to include XML fragments based on the middle tier in the documents defined by the top tier, the deliverable DTDs. However, this decision led to additional difficulties. There were lower-level structures that we wanted to reuse between information types, such as using the expected values of our field definitions in the steps of our procedures. It also prevented us from being able to edit a topic as a standalone document, since the definition of external parsed entity precludes the use of a document type definition within an XML fragment. This limited our ability to nest XML documents to one level of inclusion. We later changed this decision and are now using a different method of inclusion that allows us to edit and reuse smaller units of information.

Finally, the deliverable DTDs in the top tier contain the markup used to arrange the topics for a particular document instance. For example, the book DTD contains markup for arranging topics into chapters and chapters into books. The HTML Help DTD contains the markup used to arrange topics into a help system, as well as additional settings used by the help system.

Heading levels are derived at the top tier. All topics contain a single "title" element with no hierarchical information attached. The book DTD contains enumerated levels, one through four, for arranging the topics into a four-level deep hierarchy. (Four levels are all that are supported by Cisco templates). The HTML Help DTD allows unlimited nesting of topics, to support the behavior of the HTML Help engine. In this way, heading levels are derived from the the topic's placement within the deliverable XML document, eliminating the heading-level conflicts that occurred when the level was explicitly defines at the topic level.

This three-tiered structure has proven to be incredibly extensible and flexible. When a new topic type is needed, we can easily create a DTD module that contains only the unique markup for that topic type and insert that module in the middle tier, drawing from the Core DTD for the common markup. For example, during the project, we discovered that we completely neglected markup for a glossary. We were able to quickly identify the unique markup required for a glossary and add it to the DTD without disrupting any of the other topic types. More recently, we were required to support another output type, a proprietary help system. Again, the unique requirements for that help system were identified and added to the top tier without disrupting the current production processes.

3.2. Developing the Content Repository

Because we lacked a database, we faced several logistical challenges in developing our content repository.

We needed to make the XML fragments easy to find since we did not have built-in search and metadata storage capabilities. We worked around this problem by creating topical sub-folders and a naming standard that reflected the contents of the file. Also, within such a small group, it is easy to ask the person responsible for a feature if they had created a specific piece of information.

We needed a fairly static structure to our XML repository. The XSLT we use to generate our deliverables uses relative paths for items such as embedded stylesheets. We needed to make the structure of our repository consistent so that items, such as our stylesheetlinks, were not broken during the transform.

We ended up creating a fairly flat file structure (see Figure 3). New folders and files can be added as long as they maintain the same relative path to the DTD and XSLT files.

Figure 3: File Structure

3.3. Developing the Workflow

The final step was to develop the workflow that the writers would use when developing new or modifying existing content. We wanted to maintain a workflow similar to the one they were used to. Previously, they developed topics in HTML and then added them to a help project file. Unfortunately, because of the embedding scheme we chose for our topics, individual topics could not be edited as standalone documents. They did not contain a document type declaration, which was required if our authoring tool was to provide context-appropriate tags and authoring guidance.

To work around this problem, we developed a new top-level structure called a workbook. All a workbook did was allow the writers to reference multiple topics from a single document. Then, they could open those documents from within the workbook in the authoring tool. The authoring tool would use the document type declaration of the workbook for the embedded fragments.

To create new topics, the writers could copy a template to the appropriate directory in the file system, rename the file according to stringent guidelines, then include that file in a workbook. They would then add the new content to the workbook.

The workbook turned out to have additional benefits. The writers were able to create workbooks based on products, on technologies, or whatever organizational scheme they desired, including no scheme whatsoever. Then, when they were done writing, they would send the workbook to the editor. Later, after the editing and review process, the writers added those fragments to a help project document or to a chapter document.

One of the more difficult issues we faced was generating unique IDs for the XML fragments. Without unique IDs, cross-referencing becomes impossible. And without a content management system that can generate and maintain unique IDs across the entire source base, it is difficult to maintain unique IDs. We decided to use file names, minus the file extension, as the ID values for each XML fragment. While this approach does not totally eliminate the possibility of duplicate IDs, it does reduce the chances of duplication. However, to use this method, you need to follow a strict naming standard. We customized XMetaL to assign the file name as the ID for each XML fragment created, sparing the authors from the burden of having to assign and verify the IDs. Additionally, we developed a script that runs nightly and reports any duplicate IDs found in our XML source.

4. Implementation

The final two steps in implementing the architecture were to convert the legacy information to the new XML markup and to integrate the system into our publication process.

4.1. Converting Legacy Content, Training, and Moving Forward

Because we were dealing with a fairly small set of legacy content approximately 1500 HTML files we decided to combine the conversion of the legacy HTML information with the authoring tool and DTD training.

Our source was already in HTML format. We used HTMLTidy to perform a batch conversion of our source files to XML. We had originally intended to use XSLT to transform the converted HTML files to our XML vocabulary. However, our XML is semantically richer than the HTML we had been using, even with class-tagged HTML elements. We did not have the time to create scripts to automate the entire conversion process, so some form of manual conversion was going to be necessary.

So, after several days of training on the authoring tool and workflow processes, we gathered all of the writers together and had them manually apply our XML to the converted HTML files. This conversion not only helped them to become familiar with the tool that they would be using, but it also reinforced the training that they just had on our DTD.

Within a week, we had the bulk of our source files converted to our XML markup, and the writers were familiar enough with the markup and tools to finish the job without supervision.

4.2. Integration into the Workflow

Our first step was to integrate the the online help into the product build. We use Make and Korn shell scripts to start the help compilations. The product build scripts call the help build script. The help build then applies a series of XSLT transformations to the XML source, then calls the help compiler.

The first XSLT creates a help project file from a master XML file. The master XML file contains the help system settings, such as the default page that appears when you open the help system, as well as the files and file hierarchy for the help system. The second XSLT call creates the table of contents file for the help system. The third XSLT call creates the index file. The final XSLT call transforms each of the referenced XML files in the master XML document to HTML. The help build script then calls the appropriate help system compiler (HTML Help, Windows Help, or any other proprietary help compiler) and builds the help system.

The help build is performed each time the product is compiled. The only involvement the writers have is to populate the master XML file for the help project with the topics they are working on.

Hardcopy output is accomplished by arranging the XML topics in a chapter file, and then using XSLT to transform the XML chapters to FrameMaker MIF format. FrameMaker is still involved in the workflow because the corporate publishing process requires FrameMaker documents. The writers perform this transformation on an as-needed basis, then clean up the results in FrameMaker. We have developed a small application that performs this transform for the writer, making it easier for them to use.

5. Future Directions

The system is far from perfect. As the amount of information we are managing grows, we are quickly reaching the limits of practical usage of the system. However, we continue to make improvements to the system until we can adopt a full XML authoring system. Some of the future improvements will include:

So, despite the potential of reaching the useful limits of this system, our efforts have not been in vain. As the corporate XML authoring initiative develops, we are positioned to quickly transition to the new system. In fact, we have been selected as one of the pilot technical documentation groups for the system, and have been given the opportunity to provide information about what is required for each phase of the publishing system.

6. Lessons Learned

The AIC documentation group came out of this experience with some useful lessons.

Planning is critical. We had many false starts where things did not work the way we thought they would. Before you even begin to create a DTD, plan out how you plan to embed your XML chunks in other chunks. Will you use the XML standard for external parsed entities? XLink? Transclusion? What level of reuse to you want to attain? What output is required? What are the dependencies and restrictions of the output formats? Map out how you plan to store and find your information chunks. Knowing the answers to these questions can help you avoid many false starts.

Keep the development team small. Forming committees and tiger teams is fine, and so is gathering feedback from all of the project stakeholders. But, if you want to make progress, keep those activities to a minimum. Provide them with something to comment on first, then solicit their feedback. Nothing can derail a project faster than "requirements" pulled out of thin air. As one of my co-workers pointed out, it is easy to bog down in "analysis paralysis".

Test the system on "live" information. When we first tested the system, we used information that was constructed to prove that the system worked. Naturally, it worked. When we converted our first project to our system, we discovered information types that we did not account for and gaping holes in our architecture.

7. Summary

So, were we able to meet our goals and requirements? The answer is an unqualified "yes". We have provided our authors with a single authoring environment. Although we were not able to completely eliminate one of the tools (FrameMaker) from the equation, we were able to minimize the work that needs to be done with that tool, and the need to learn new corporate templates. We are able to provide four types of online help and hardcopy documents from a base set of information modules. We are able to reuse information across documents and even across products. We have an "edit once, appear everywhere" workflow. And, because our authoring tool provides a context-aware selection of tags, we are able to provide guided authoring for our topic types. But most of all, we were able to eliminate much of the complexity and the workarounds used in the old publishing model, making the system easier to use. Feedback from the authors, of which I am one, has been overwhelmingly positive.

Additionally, using a flat-file XML system has allowed the AIC documentation group to muliply the number of products that they support five-fold without increasing the number of staff supporting those products. We were also able to increase the types of deliverables available to the customer.

However, the system was not without its shortcomings:

Because of these limitations, this method is not adequate for an enterprise publishing solution. However, it can be use to prototype an XML publishing solution, or as seen here, used to manage the content for a small team working on a small set of related products.

However, there are also benefits to such a system:

8. Additional Information

You can find a more detailed analysis of how the AIC documentation group implemented a file-syste-based XML authoring solution, visit http://members.home.com/jasonlemair/XML/. There you can find DTD files, XMetaL customizations, and working scripts and XSLT files for creating HTML Help systems.

Acknowledgements

I would like to thank to thank the following people:

Glossary

AIC

ACS, IDS, and CSPM

CSPM

Cisco Secure Policy Manager

CSS

cascading style sheets

HTML

Hypertext Markup Language

PDF

Portable Document Format

XML

Extensible Markup Language

Biography

Jason Willebeek-LeMair
Technical Writer
Cisco Systems
Savoy
Illinois
U.S.A.
Email: jasonlemair@insightbb.com Web: http://jasonlemair.home.insightbb.com

Jason Willebeek-LeMair has over 8 years of technical communication experience. For the last 4 years, he has worked extensively with SGML and XML, both as an author and as an architect.

Jason is currently employed at Cisco Systems, Inc. as the lead technical writer for the Cisco Secure Policy Manager (CSPM) product. He was also the lead developer of the ACS, IDS, and CSPM (AIC) group's XML publishing system.