|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |
XML
Willebeek-LeMair, Jason
, Technical Writer , Cisco Systems, Savoy
Illinois
U.S.A.
Email: jasonlemair@home.com
Web site:http://members.home.com/jasonlemair/XML/
Jason Willebeek-LeMair has over 8 years of technical communication experience. For the last 4 years, he has worked extensively with SGML and XML, both as an author and as an architect.
Jason is currently employed at Cisco Systems, Inc. as the lead technical writer for the CSPM product. He was also the lead developer of the AIC group's XML publishing system.
When you hear about XML publishing, you mostly hear about databases, workflow tools, and content management systems. These are typically costly systems aimed towards the information management needs of larger enterprises, where the sheer volume of information pumped through these systems provides a fairly rapid return on investment. This fosters the perception that you need one of these complex, expensive, enterprise solutions to use take advantage of the modularity and flexibility of authoring in XML.
That is simply not true. You can realize the benefits of publishing from modularized XML, without the expense of an enterprise publishing system, by implementing the authoring environment on top of nothing more than your operating system's file system. Although this environment is not adequate for enterprise publishing needs, it is more than adequate for the needs small writing teams, businesses with a limited number of related products, proof-of-concept demonstrations, and even home users.
The AIC documentation group at Cisco Systems has implemented such an authoring environment. We have been able to reuse and re-purpose modular, XML-based information without implementing a database back end. By examining how the AIC team implemented XML in a flat-file environment, you will see:
the decisions you need to make before implementing a flat-file XML system
the trade-offs, drawbacks, and pitfalls of implementing a flat-file environment (as compared to a database publishing environment)
the benefits of XML that are still available, even without the database
a migration path to a more traditional publishing environment
The AIC documentation group consists of seven writers who were originally dedicated to a single product, CSPM . However, by implementing an XML publishing solution, the group has been able to increase their efficiency and now supports five related products without increasing the number of supporting writers. Additionally, what started out as a local effort has led to involvement with a corporate-wide XML publishing initiative.
This case study shows how the AIC documentation group implemented a file-system-based XML publishing solution and shows some of the benefits and drawbacks of doing so.
To fully understand the challenges that faced the AIC documentation team, you must understand the problems the group faced that led to our XML solution:
The AIC doucmentation group supports several software products that use a variety of information delivery formats, including:
Many times, information is shared between formats. For example, field definitions used in an HTML Help system are also used in a WinHelp system to provide field-level, context sensitive help.
Two different processes were used to create the various formats:
Case 1: Hybrid- HTML --The Hybrid- HTML solution abused the "forgiving nature" of HTML to embed information into HTML source files. The embedded information was typically ignored by browsers. The source files were then converted to the different output formats using custom C++ scripts to create source for the Microsoft help systems and FrameMaker documents.
Case 2: FrameMaker --source files converted to HTML , which was manually cleaned up and wired with context calls to use as a generic HTML help system.
Both cases relied on FrameMaker documents to convert to PDF output using the standard Cisco publishing system, which was also responsible for generating yet another HTML format that is posted to the Cisco.com website.
The primary process failure encountered in the both cases was caused by the same problem. Once we converted to a particular deliverable type, such as FrameMaker or raw HTML , it was easier to make changes in the multiple source files than to perform another round of conversions. This failure to maintain a single source resulted in "stale" information--information that was updated in one outpout format but not in another--across each type of output, frustrating writers and editors alike.
Another process failure was the inability to train writers and editors on the use of the "hybrid" HTML format used to produce the HTML Help system, which resulted in those who understood the latest conversion scripts having to do clean-up work at the end of a release to convert any new source over to all of the required formats.
The following examples show some of the code used in the "hybrid" HTML markup. The first example demonstrates the use of class attribute to "type" the HTML elements:
<table border="0" cellpadding="4"> <tr valign="top"> <td class="note" width="65"><b>Note:</b></td> <td class="note">Note Text</td> </tr> </table>
This example shows the code needed to embed index entries in an HTML file for use with the Microsoft HTML Help compiler:
<object type="application/x-oleobject" classid="clsid:1e2a7bd0-dab9-11d0-b93a-00c04fc99f9e"> <param name="Keyword" value="Log on to Cisco Secure Policy Manager dialog box"> <param name="Keyword" value="Log on to Cisco Secure Policy Manager dialog box,field values"> <param name="Keyword" value="Cisco Secure Policy Manager,logging on to"></object>
The final example shows the misuse of the HTML "meta" element to embed help context IDs in an HTML file:
<tr valign=top> <meta map="WinHelp" content="HIDC_OPEN_LOCAL_OPENBOX, HIDC_OPEN_BOXSERV_OPENBOX, HIDC_OPEN_REMOTE_OPENBOX, HIDC_OPEN_SERVER_OPENBOX, HIDC_OPEN_PORT_OPENBOX">
Creating these structures in a commercial HTML editor is not a straightforward process, and we were seeing a lot of errors, such as missing class attributes, in the HTML markup.
Finally, the heading levels used in the help system did not correspond to the heading levels used in the hardcopy. The information was simply arranged differently. At first, we worked around this problem by using a cascading style sheet to make all of the HTML heading levels appear the same, then we set the heading levels to those used in the books. This trick worked until the authors became used to reusing information from other areas. What was a first-level heading in one book was not a first-level heading in another. We started seeing conflicts.
The primary dependency we faced was that the help systems that accompanied the products are required to build with the various products, being picked up by the install scripts during the production phase. Therefore, we could not simply pull out the help systems until we were finished and ready to reconvert because it broke milestone builds of the products that we supported and delayed the product testing.
In addition, we saw on the horizon the need to support another propriety help system that would not be able to use the HTML markup that we had developed in Case 1 or the HTML that resulted from the FrameMaker conversion in Case 2. Eventually, that source information would have to be ported to another help system format.
To produce the various forms of documentation supported by the AIC group, the writers had to know:
Brining a new writer up to speed on these tools, and the processes we followed to create the deliverables, was an arduous task, taking almost 6 months and very heavy editing cycles. We did not have an effective method of isolating the technical intricacies of the solution to just a few programmer/writers. Therefore, new staff could not include entry level writers.
We knew that there had to be a better way to reuse the information that we developed and support the multiple output formats currently in use and any future formats. XML had just arrived on the scene, and I was given the task to see if it would be feasible to move our information to XML and to automate much of the deliverable output process.
When we first started to investigate using XML to solve our information delivery issues, none of us had extensive experience in developing a publishing system based on a markup language, although several of us had used SGML extensively. All of our research pointed towards the need for the three main components of any XML -based publishing system: authoring, content management, and content delivery.
However, this was a local effort meant to alleviate the problems we encountered in our authoring process. At the time, a content management system was beyond the scope of our needs and budget--we did not need an enterprise-level solution for a group of seven writers.
So, we developed a system that used the native file system of our file server as the content management component of our XML solution. There were some trade-offs, of course, but doing so also taught us exactly what we needed in a content management system.
There was a strong temptation to start developing the system immediately. XML was a relatively new technology, and our writing group thrives on being on the cutting edge of technology. However, reason prevailed, and we set forth to define the goals and requirements of the system first.
Provide a single, standard authoring environment for the technical writers working on the CSPM documentation. We wanted to eliminate the numerous tools the writers needed to know in order to produce the documentation.
Provide a mechanism to single-source various documentation deliverables from a base set of information modules.
Facilitate the reuse of information to enhance the consistency of the documentation, reduce the duplication of work that we were seeing across the documentation set, and eventually enable the sharing of information across product and business lines.
Provide an "edit once, appear everywhere" workflow to prevent duplicated information from becoming "stale".
Provide authoring guidance during the writing process. Through the use of DTD and templates, we needed to make our markup more intuitive and to make sure that required elements were not omitted.
Provide a structured set of tags that could be used during the authoring process and scripted to more easily to the desired output. At the time, XSLT was not yet out or widely supported, so we were thinking in terms of custom scripting. We were also planning to make some of the markup, such as the index entries, much easier to incorporate into the source documents.
Moreover, our overriding goal for the system was to create an authoring environment that was less complex than the one the authors were currently using. If we could not make the system easier to use, then we we would not do it at all. We wanted to make the writer's jobs easier, not more difficult.
The requirements were slightly more difficult to define. Because we were not using a content management system, we focused on the authoring tool. We realized that the authoring tool would need to assume some of the functions of a content management system, while the other functions, such as generating unique IDs for the XML fragments and documents, would have to be assumed by our internal processes. Therefore, we defined these requirements:
DTD or Schema support with "tag chooser" to help guide the writers while they were developing content.
A styled authoring environment to help the writers differentiate between the various elements they were developing.
The ability to save fragments out of an XML file. Because we were not using an XML content management system, which can typically decompose XML documents into pre-defined chunks, we needed the authoring tool to support this function.
From the list of several authoring tools that met these requirements, we chose XMetaL by SoftQuad. XMetaL provided an interface much like the HTML authoring tool the authors were accostumed to using and provided a rich programming environment so that we could develop custom scripts and programs to take the place of features that would be provided by an XML content management system.
Once we determined our needs and that XML met those needs, it was time to develop the architecture. We were faced with the choice of using an existing XML DTD, such as DocBook, or to create our own. We also had to figure out how to manage our XML fragments in a file system-based work environment.
Developing the DTD was fairly easy. We already had a well-defined and chunked information architecture that we used when authoring in HTML (see original_architecture). We decided that, because the authors were already familiar with this architecture, it would be best to codify it in an XML DTD instead of using an existing DTD.
The Original Information Architecture |
We ended up with a three-tiered, topic-based DTD architecture (see architecture). We define a topic as a heading plus accompanying text and graphics. This architecture has been surprisingly flexible in meeting the needs of a diverse group of products.
The Tiered Architecture of the CiscoBook Family of DTDs. |
The bottom tier of our information architecture, the Core DTD, contains the body-level elements, such as paragraphs, lists, index entries, tables, and so on. For this portion of the DTD, we borrowed heavily from
HTML
. The authors were already familiar with
HTML
markup, so we thought that making the change from authoring in
HTML
to
XML
would be have less of an impact if theyused familiar tags. However, we did simplify several aspects of the base markup. For example, instead of using a complex table structure for notes, we created a simple <note>Note Text</note> tag that is transformed to the correct format.
The middle tier of the DTD contains the topic-specific modules. This tier relies on the Core DTD to provide the basic body markup, while providing the topic-specific markup for the information type. For example, the procedure.dtd contains step and result elements, which are not found in any of the other topic types.
The middle tier was also defined as our lowest level of reusable content. We used the XML definition of external parsed entity to include XML fragments based on the middle tier in the documents defined by the top tier, the deliverable DTDs. However, this decision led to additional difficulties. There were lower-level structures that we wanted to reuse between information types, such as using the expected values of our field definitions in the steps of our procedures. It also prevented us from being able to edit a topic as a standalone document, since the definition of external parsed entity precludes the use of a document type definition within an XML fragment. This limited our ability to nest XML documents to one level of inclusion. We later changed this decision and are now using a different method of inclusion that allows us to edit and reuse smaller units of information.
Finally, the deliverable DTDs in the top tier contain the markup used to arrange the topics for a particular document instance. For example, the book DTD contains markup for arranging topics into chapters and chapters into books. The HTML Help DTD contains the markup used to arrange topics into a help system, as well as additional settings used by the help system.
Heading levels are derived at the top tier. All topics contain a single "title" element with no hierarchical information attached. The book DTD contains enumerated levels, one through four, for arranging the topics into a four-level deep hierarchy. (Four levels are all that are supported by Cisco templates). The HTML Help DTD allows unlimited nesting of topics, to support the behavior of the HTML Help engine. In this way, heading levels are derived from the the topic's placement within the deliverable XML document, eliminating the heading-level conflicts that occurred when the level was explicitly defines at the topic level.
This three-tiered structure has proven to be incredibly extensible and flexible. When a new topic type is needed, we can easily create a DTD module that contains only the unique markup for that topic type and insert that module in the middle tier, drawing from the Core DTD for the common markup. For example, during the project, we discovered that we completely neglected markup for a glossary. We were able to quickly identify the unique markup required for a glossary and add it to the DTD without disrupting any of the other topic types. More recently, we were required to support another output type, a proprietary help system. Again, the unique requirements for that help system were identified and added to the top tier without disrupting the current production processes.
Because we lacked a database, we faced several logistical challenges in developing our content repository.
We needed to make the XML fragments easy to find since we did not have built-in search and metadata storage capabilities. We worked around this problem by creating topical sub-folders and a naming standard that reflected the contents of the file. Also, within such a small group, it is easy to ask the person responsible for a feature if they had created a specific piece of information.
We needed a fairly static structure to our XML repository. The XSLT we use to generate our deliverables uses relative paths for items such as embedded stylesheets. We needed to make the structure of our repository consistent so that items, such as our stylesheetlinks, were not broken during the transform.
We ended up creating a fairly flat file structure (see file-system). New folders and files can be added as long as they maintain the same relative path to the DTD and XSLT files.
File Structure |
The final step was to develop the workflow that the writers would use when developing new or modifying existing content. We wanted to maintain a workflow similar to the one they were used to. Previously, they developed topics in HTML and then added them to a help project file. Unfortunately, because of the embedding scheme we chose for our topics, individual topics could not be edited as standalone documents. They did not contain a document type declaration, which was required if our authoring tool was to provide context-appropriate tags and authoring guidance.
To work around this problem, we developed a new top-level structure called a workbook. All a workbook did was allow the writers to reference multiple topics from a single document. Then, they could open those documents from within the workbook in the authoring tool. The authoring tool would use the document type declaration of the workbook for the embedded fragments.
To create new topics, the writers could copy a template to the appropriate directory in the file system, rename the file according to stringent guidelines, then include that file in a workbook. They would then add the new content to the workbook.
The workbook turned out to have additional benefits. The writers were able to create workbooks based on products, on technologies, or whatever organizational scheme they desired, including no scheme whatsoever. Then, when they were done writing, they would send the workbook to the editor. Later, after the editing and review process, the writers added those fragments to a help project document or to a chapter document.
One of the more difficult issues we faced was generating unique IDs for the XML fragments. Without unique IDs, cross-referencing becomes impossible. And without a content management system that can generate and maintain unique IDs across the entire source base, it is difficult to maintain unique IDs. We decided to use file names, minus the file extension, as the ID values for each XML fragment. While this approach does not totally eliminate the possibility of duplicate IDs, it does reduce the chances of duplication. However, to use this method, you need to follow a strict naming standard. We customized XMetaL to assign the file name as the ID for each XML fragment created, sparing the authors from the burden of having to assign and verify the IDs. Additionally, we developed a script that runs nightly and reports any duplicate IDs found in our XML source.
The final two steps in implementing the architecture were to convert the legacy information to the new XML markup and to integrate the system into our publication process.
Because we were dealing with a fairly small set of legacy content--approximately 1500 HTML files--we decided to combine the conversion of the legacy HTML information with the authoring tool and DTD training.
Our source was already in HTML format. We used HTMLTidy to perform a batch conversion of our source files to XML . We had originally intended to use XSLT to transform the converted HTML files to our XML vocabulary. However, our XML is semantically richer than the HTML we had been using, even with class-tagged HTML elements. We did not have the time to create scripts to automate the entire conversion process, so some form of manual conversion was going to be necessary.
So, after several days of training on the authoring tool and workflow processes, we gathered all of the writers together and had them manually apply our XML to the converted HTML files. This conversion not only helped them to become familiar with the tool that they would be using, but it also reinforced the training that they just had on our DTD.
Within a week, we had the bulk of our source files converted to our XML markup, and the writers were familiar enough with the markup and tools to finish the job without supervision.
Our first step was to integrate the the online help into the product build. We use Make and Korn shell scripts to start the help compilations. The product build scripts call the help build script. The help build then applies a series of XSLT transformations to the XML source, then calls the help compiler.
The first XSLT creates a help project file from a master XML file. The master XML file contains the help system settings, such as the default page that appears when you open the help system, as well as the files and file hierarchy for the help system. The second XSLT call creates the table of contents file for the help system. The third XSLT call creates the index file. The final XSLT call transforms each of the referenced XML files in the master XML document to HTML . The help build script then calls the appropriate help system compiler ( HTML Help, Windows Help, or any other proprietary help compiler) and builds the help system.
The help build is performed each time the product is compiled. The only involvement the writers have is to populate the master XML file for the help project with the topics they are working on.
Hardcopy output is accomplished by arranging the XML topics in a chapter file, and then using XSLT to transform the XML chapters to FrameMaker MIF format. FrameMaker is still involved in the workflow because the corporate publishing process requires FrameMaker documents. The writers perform this transformation on an as-needed basis, then clean up the results in FrameMaker. We have developed a small application that performs this transform for the writer, making it easier for them to use.
The system is far from perfect. As the amount of information we are managing grows, we are quickly reaching the limits of practical usage of the system. However, we continue to make improvements to the system until we can adopt a full XML authoring system. Some of the future improvements will include:
Although submitting FrameMaker documents continues to be a requirement in our publishing process, we are looking for ways to automate the process through FrameScript or FOP.
As mentioned before, having our topic-level XML files as fragments has limited some of the reuse potential inherent in the system. We are quickly transitioning our topic-level files to be used as standalone files. This approach will allow authors to edit single topics without having to embed the topic in a higher-level document, and it will allow us to define smaller levels of reusable information.
Despite our best efforts, we over-tagged our content. We are currently looking at combining some of the redundant tags and eliminating the unnecessary ones.
We are continuously looking at new methods for delivering information to our customers, such as JavaHelp or embedding XML -based information in the application interface.
Workflow and Content Management
In the middle of developing this system, we became aware of a corporate initiative to move marketing and technical communication to XML . As a result of our work on this system, we have become involved in the corporate solution, and will be one of the first technical publications groups to use the new authoring system.
So, despite the potential of reaching the useful limits of this system, our efforts have not been in vain. As the corporate XML authoring initiative develops, we are positioned to quickly transition to the new system. In fact, we have been selected as one of the pilot technical documentation groups for the system, and have been given the opportunity to provide information about what is required for each phase of the publishing system.
The AIC documentation group came out of this experience with some useful lessons.
Planning is critical. We had many false starts where things did not work the way we thought they would. Before you even begin to create a DTD, plan out how you plan to embed your XML chunks in other chunks. Will you use the XML standard for external parsed entities? XLink? Transclusion? What level of reuse to you want to attain? What output is required? What are the dependencies and restrictions of the output formats? Map out how you plan to store and find your information chunks. Knowing the answers to these questions can help you avoid many false starts.
Keep the development team small. Forming committees and tiger teams is fine, and so is gathering feedback from all of the project stakeholders. But, if you want to make progress, keep those activities to a minimum. Provide them with something to comment on first, then solicit their feedback. Nothing can derail a project faster than "requirements" pulled out of thin air. As one of my co-workers pointed out, it is easy to bog down in "analysis paralysis".
Test the system on "live" information. When we first tested the system, we used information that was constructed to prove that the system worked. Naturally, it worked. When we converted our first project to our system, we discovered information types that we did not account for and gaping holes in our architecture.
So, were we able to meet our goals and requirements? The answer is an unqualified "yes". We have provided our authors with a single authoring environment. Although we were not able to completely eliminate one of the tools (FrameMaker) from the equation, we were able to minimize the work that needs to be done with that tool, and the need to learn new corporate templates. We are able to provide four types of online help and hardcopy documents from a base set of information modules. We are able to reuse information across documents and even across products. We have an "edit once, appear everywhere" workflow. And, because our authoring tool provides a context-aware selection of tags, we are able to provide guided authoring for our topic types. But most of all, we were able to eliminate much of the complexity and the workarounds used in the old publishing model, making the system easier to use. Feedback from the authors, of which I am one, has been overwhelmingly positive.
Additionally, using a flat-file XML system has allowed the AIC documentation group to muliply the number of products that they support five-fold without increasing the number of staff supporting those products. We were also able to increase the types of deliverables available to the customer.
However, the system was not without its shortcomings:
We do not have a built-in method to see where entities are being used, or if they are being used at all, so we cannot see what documents will be affected by our changes.
We cannot manage link/link end pairs. Using ID/IDREF attributes within documents, however using the document( ) to include standalone XML documents by reference does not validated the included document against the parent document.
We do not have an automatic method for creating and enforcing unique ID values. The best we can do is scan our source for duplicate ID values.
We do not have the ability to link to previous versions of an XML fragment. Once a fragment is changed, it is changed everywhere it appears.
We do not store much metadata, such as product or software version, within our XML documents. We feel that information of this type is best stored at the database level. For the current system, this makes it difficult to determine which fragments are part of which product or products. However, our file and folder naming conventions partially mitigate this problem
Because of these limitations, this method is not adequate for an enterprise publishing solution. However, it can be use to prototype an XML publishing solution, or as seen here, used to manage the content for a small team working on a small set of related products.
However, there are also benefits to such a system:
You can quickly prototype an XML publishing system as a proof-of-concept example.
You can use this type of system for small groups of writers, where a full XML publishing system may not be warranted.
You can use at home, where a content management system may just be beyond the family budget.
You can use this type of system as a starting point before migrating to a full XML publishing system. One of the benefits of this system is that it does not require a large investment to set up. It allows you to discover your requirements before you invest in a full system that may or may not meet your needs.
You can find a more detailed analysis of how the AIC documentation group implemented a file-syste-based XML authoring solution, visit . There you can find DTD files, XMetaL customizations, and working scripts and XSLT files for creating HTML Help systems.
I would like to thank to thank the following people:
Blaine McNutt, who backed this project and was a co-architect of the system
Mark Wilgus, who was a co-architect of the system, who was responsible for developing all of the early XSLTs, and who volunteered his project as the first victim.
Corey Spitzer, who created custom XMetaL scripts and applications that made the writer's work easier, who created XSLTs for the later output formats, and who created reporting scripts that detect duplicate IDs and dead-end links
Michael Priestly of IBM, who confirmed that I was not insane in thinking that such a system would work
and most of all, the writers of the AIC documentation group, who suffered, with good cheer, having this system imposed on them
|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |