XML Europe 2004 logo

XML to MS Word

A Standards-Based Open-Source Approach for Publishing to Microsoft Word

Abstract

The ubiquity of Microsoft Office products is a reality of business in the global enterprise. Even in organizations moving to a standards-based extensible information management system, MS Word has remained a dominant format for inter-enterprise information exchange, and in many circumstances is a required format. This creates a significant work-flow issue if the document provider is using XML as the primary document format in any of their organizational processes. This paper details the design and implementation of a system which transforms XML to Microsoft Word documents in a completely extensible fashion.

There are several commercial products which are specifically designed to address the XML to Word issue. Depending on the specific business requirements, these may present a sufficient ROI. The methodology described in this paper is for those who cannot afford such COTS solutions, wish to leverage existing in-house knowledge of XML and XSLT, or need the degree of customization that the approach laid out provides.

The fundamental problem in producing Word Documents from XML is the basic one of moving from a semantically rich document format to one that is solely format oriented. XSLT's design addresses this issue. The problem with trying to use XSLT directly is that it implies using RTF as an output target. Producing RTF output that contains all of the styling required, is valid, and is robust to schema changes is notoriously hard to do. The system design described in this paper takes the novel approach of letting Microsoft Word do the styling for us. The basic design has the following flow: XML Source Document ----> XSLT transformation to a structurally consistent XML format ----> DOM processing of intermediary form used to programmatically construct the Microsoft Word document.

The basic design described in this paper requires only an XSLT stylesheet engine, an XML Parser, and programmatic means of communicating with the Microsoft Word COM API. For the reference implementation we use Python as the basic implementation language, 4Suite XML tools for XML parsing and XSLT transformation, and the Pythono win32 extensions for COM communication.

Readers will be left with a clear understanding of the complexities involved in an XML to Word solution and the technical knowledge needed to do it.


Table of Contents

1. Introduction
2. Requirements
3. Potential Solutions
3.1. Word ML
3.2. COTS
3.3. Custom
3.3.1. XSLT to RTF
3.3.2. XML to COM
4. Implementation
4.1. System Design
4.2. First Pass
4.3. Refactored Solution
5. Limitations
6. Summary
Bibliography
Biography

1. Introduction

Microsoft Word is an extremely common document format in most of tody's enterprises. This format is used for the exchange of all types of information, and there is an extremely large user base with a facility for using it.

XML is fast becoming the defacto form of information interchange. It is entirely platform independent and the tool-sets which exist to leverage value from it continue to grow. However, the inherent cognitive shift required for everyday users to effectively produce and use it has prevented it from replacing other non-standard formats, such as Word, outright. Use of XML and its related standards for providing information management in back-end systems is on the rise.

This raises the following question: As back-end information systems continue to leverage the flexibility and functionality of XML, what sort of options are available for providing this information in a form that is usable by end users, especially as a word processing document?

The general approach of XML-centric systems has been to produce PDF documents via XSL Formatting objects.[XSL:FO] This is a viable option if the requirements call for a platform independent read-only format. Although the full version of Adobe Acrobat[ACRO] supports editing and annotating of PDF documents, the user base is smaller, and its design has never been that of a full fledged word processor. This precludes it's use if requirements call for a well supported word processing format.

Depending on the distribution of produced documents, and the design of the system to produce them, use of PDF also implies additional licensing costs primarily in the form of an XSL Formatting Engine, and fully licenced version of Acrobat.

This paper details a case study of a system in which Microsoft Word was selected as our output target. The system was designed with an overarching focus on simplicity of solution and minimization of costs.

2. Requirements

The code was developed for a project with the following requirements:

  • Input documents would be XML conforming to a known DTD.

  • Output would be in Microsoft Word 2000 format.

  • End Users would be enabled to configure formatting details of the MS Word documents produced.

Circumstances contributing to these requirements included the following:

  • Producers of source document were a relatively small pool of XML experienced users.

  • Consumers of the information captured in the source documents had no facility with XML. In order to keep costs down, no plans were in place to train them.

  • All consumers had existing licenses for the use of Microsoft Office 2000, and no plans were in place to upgrade to a more current version.

  • Consumers required the ability to modify the rendered documents, and administrators required the ability to modify the styling with which the documents were produced.

  • Once the documents were published, there was no requirement that they be back-linked to their source documents.

  • Other than development costs, no additional funds were available for software not already licensed to the customer.

3. Potential Solutions

3.1. Word ML
3.2. COTS
3.3. Custom
3.3.1. XSLT to RTF
3.3.2. XML to COM

There are actually several options for producing Word documents from XML. We were constrained in our selection from this list due to the tight budget and existing requirements. This following section will touch on these various options and note on the factors leading to the end development solution.

3.1. Word ML

One potential solution is produce documents directly in WordprocessingML, an XML encoding of Microsoft Word documents. The solution would be purely an exercise in XSLT development, and is extremely viable. Although XML encoded word is still fairly young in its life cycle, we believe that this option would have allowed us to meet most of the requirements laid out.

We did not select this option primarily because Word ML is a new feature of Microsoft Word 2003, and no licenses for this version were available.. Additional reasons for declining this approach include:

  • The level of styling control by non-experts was deemed to be insufficient. Through appropriate design, the styling configuration could have been factored out into fairly straightforward configuration files. The styling feedback, and development overhead to achieve this would have been out of scope.

  • The stylesheet development would have been an order of magnitude more difficult than the solution used.

  • The Microsoft Office Word XML Content Development Kit is slated to be released in March 30, 2004. This kit, which includes documentation on using the Word ML schema, is currently only available in beta version, and would have presented additional risk to the project.

3.2. COTS

There are number of commercial off-the-shelf (COTS) solutions which attempt to address the base requirement of producing word documents from XML. Although there is a great maintenance reduction involved in using COTS software, none of the tools we evaluated were a close enough match to our base requirements to warrant selection. This was generally due to licensing costs, or lack of the type of sytling configuration that we required. A good resource for the evaluation of COTS solutions is http://www.xmlsoftware.com/convert.html.

3.3. Custom

3.3.1. XSLT to RTF
3.3.2. XML to COM

3.3.1. XSLT to RTF

A custom solution of producing RTF was deemed too brittle for our situation.

3.3.2. XML to COM

The solution we selected was to use the XML to drive the programmatic construction of an MS Word document through its COM API. The COM API exposed all of the functionality that we desired, and by combining this with Word Templates and named Word styles, the end users would have high a degree of customization available with very little cognitive overhead.

4. Implementation

We selected Python as our implementation language since its exposure of the COM API is very robust, and it has a well supported XML library and built in support for unit testing.

4.1. System Design

The system design is very simple. It consists of three primary classes. Common library classes are not included in the diagram.

click image for full size view

Command

This class provides the controller functionality of the system. It receives requests via the command line and manages construction and connection of the other two classes.

Transformation

This class represents a single transformation, understands the structure of the source XML, and encodes the logic of what DocWriter API calls that structure maps to.

DocWriter

This class wraps up the complexity of the Microsoft Word COM API, and exposes the methods for use by the Transformation class.

Figure 1. System Class Diagram

For the purposes of this paper, we are transforming documents conforming to the Simplified Docbook DTD[DOCBOOK]. Our test cases use various documents that exercise different document components such as ordered lists, tables, etc. The following is a representative example of the types of documents we are formatting:

<article>
  <title>The Article</title>
  <section>
    <title>The First Section</title> 
    <section>
      <title>Character styles</title>
      <para>A paragraph with a <quote>title</quote>, and 
      <emphasis role="italic">emphasis</emphasis>, and a:</para>
	<note role="note">
	  <para>note</para>
      </note>
    </section>
    <section>
      <title>Lists!</title>

      ...

	    <listitem>
	      <itemizedlist>
		<title>An three level embedded list</title>
		<listitem>
		  <para>This is a really, really, really, really, really, 
	  really, really, really, really, really, really, really, really, 
	  really, really, really, really, really, really, really, really, 
	  really, really, really, really, really, really, really long sentence. </para>
		</listitem>
	      </itemizedlist>


4.2. First Pass

The initial document flow of the system was a two stage process. The input XML document was parsed and a pulldom (a hybrid SAX/DOM process) was used to dynamically send construction messages to the Word Writer.

click image for full size view

Figure 2. Initial Document Flow

For unit testing the Transformer, we used the Mock Object paradigm. This paradigm allowed us to create a "mock object" for the DocWriter, which enables presentation of an identical API. We could then query the mock object to verify the correct calls to it had been made. The mock DocWriter was used extensively in the construcion of our unit tests. As with all formatting concerns, unit testing the formatted results is extremely difficult. Given the simplicity of the DocWriter API we decided to forego unit testing of the actual formatting and focused instead on a manual testing process.

Because the system is fairly simple, hooking up the communication infrastructure was relatively easy. To test the infrastructure we simply passed the text through without the formatting information and verified that the Word Document was being created.

With the addition of table handling, the Transformer's logic for mapping input source to it's corresponding styles was getting fairly complex, and not as robust as we would like.

4.3. Refactored Solution

We decided to add in one more step in the process and push the mapping logic into a separate XSLT process before we actually construct the Word document. This would produce an interim form of XML that closely matches the Word Document Object Model. That document model is reflected by the API of the DocWriter.

click image for full size view

Figure 3. Refactored Document Flow

There are three top level COM objects that we interact with in the course of our Word document construction:

Paragraphs

In MS Word, virtually all block elements are paragraphs associated with a particular style including list items of all types.

Ranges of characters.

Each range of characters has a style associated with it that provides all of the styling specifics.

Tables

Fairly straight-forward table object model. It has methods for interacting with the cells, rows, and columns of the table.

Our refactored solution consisted of modeling these objects in our interim XML form. This XML form flattens all of the source XML and determines the correct styles for the different components. This approach greatly simplifies the Transformer. Now, instead of managing all of the mapping logic, the Transformer simply has to apply the stylesheet and use the interim form to generate calls to the DocWriter. Because we used a mock DocWriter in our unit tests, we were able to use the existing tests as is to verify that our refactored approach left the Transformer with the exact same behavior in constructing the Word documents.

The following is a template from the stylesheet which we used to achieve this flattening for itemized lists:

<xsl:template match="itemizedlist/listitem/para">
  <para>
    <xsl:choose>
      <xsl:when test="generate-id(../para[1]) = 
                      generate-id(.)">
        <xsl:call-template name="getParaAttrs">
          <xsl:with-param name="keyName" select="'itemizedlist'"/>
        </xsl:call-template>
        <xsl:if test="generate-id(../../listitem[1]) = 
                      generate-id(..)">
          <xsl:attribute name="restartList">
            <xsl:value-of select='1'/>
          </xsl:attribute>
        </xsl:if>
      </xsl:when>
      <xsl:otherwise>
        <xsl:call-template name="getParaAttrs"/>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:apply-templates select="node()"/>
  </para>  
</xsl:template>

The getParaAttrs named template called above generates the parameters that match closely to the DocWriter API, namely the paragraph style to use and the level of indenting to use. An example of the interim form is shown below:

<wordBuilder>
   <docTitle charStyle="Default">The Article</docTitle>
   <para paraStyle="sectionStyle">The First Section</para>
   <para paraStyle="subsectionStyle">Character styles</para>
   <para paraStyle="Normal">A paragraph with a title, and 
      <text charStyle="emphasisitalicStyle">emphasis</text>, and a:</para>
   <para paraStyle="Normal">
      <text charStyle="notewarningStyle">Warning: </text>note</para>
   </para>
   <para paraStyle="subsectionStyle">Lists!</para>

   ...

   <para paraStyle="Normal" indent="2">
      <text charStyle="labelStyle">An three level embedded list</text>
   </para>
   <para paraStyle="itemizedlistStyle" indent="3" restartList="1">This is a really, really, really, really, 
	  really, really, really, really, really, really, really, really, 
	  really, really, really, really, really, really, really, really, 
	  really, really, really, really, really, really, really, really long sentence. </para>

In our system we use a Microsoft Word Template to define all of the styles we want. We used a fairly straightforward naming scheme which closely matches the semantics of the information item. From the example: itemizedListStyle for the paragraph style associated with the first 'list-item' paragraphs of an itemized list.

The client control of formatting is achieved by editing the word template itself. The formatting of the output can be changed by selecting the appropriately named style in the word template, and changing it's characteristics.

One of the advantages of this approach is that adding new styles no longer requires code changes (other than possibly adding unit tests). Stylistic and content changes are achieved by editing the interim stylesheet, and creating an appropriately named style in the word template. The abilty to effect fairly significant changes in this manner leaves the system open to a good deal of extension.

5. Limitations

The system performance was acceptable for our project, but for a large-scale environment would most likely be too slow. Virtually all of the processing time consumed during a document transformation occurs in the COM layer, making it off limits for performance tuning. Table handling is still fairly simplistic. Although we are using CALS [CALS] tables, we have implemented formatting only for the simplest of instances. This level of support was sufficient for our use cases but would be an obvious refinement for more complex table input. Additionally, configuration of the word template does require some better-than-average knowledge of the Word application.

6. Summary

We were greatly pleased with our success. We were able to meet all of the requirements and were left with a robust extensible system. Although there were a few bumps along the way (trickier details of the COM API can at times be trying) development went smoothly. Our choice of implementation and judicious use of unit-testing and mock objects allowed us to complete development, and respond to lessons along the way, with a minimal of risk to project success.

Bibliography

[ACRO] Adobe Acrobat™: http://www.adobe.com

[CALS] CALS Table Model Document Type Definition: http://www.oasis-open.org/specs/a502.htm

[DOCBOOK] Simplified Docbook DTD by Norman Walsh: http://www.docbook.org/xml/simple/index.html

[PYTHON] Python Programming Language: http://www.python.org

[PYWIN32] Python for Windows Extensions: http://pywin32.sourceforge.net/

[4SUITE] 4 Suite™ XML Library: http://4suite.org/index.xhtml

[XSL:FO] XSL Formatting Objects: http://www.w3.org/TR/xsl/

Biography

Josh has a solid background in hard-core mathematics and over five years of experience addressing challenges in a highly versioned/linked problem domain. When facing any challenges, his goals are to surmount them using a solid extensible architecture and implement any solutions using agile methods, test-driven development, and whatever tools are appropriate. His tool-set includes UML, CORBA, Pattern Based Design, XML/SGML, Java, C++, and Python. Josh may be contacted at jreynolds@innodata-isogen.com

John D. Heintz is a Senior Consultant at Innodata Isogen. He has over nine years of experience in software development and formal modeling. The last two years John has focussed on the versioning and configuration management of hyperdocument systems and there integration with other repositories. When John isn't wrestling with these abstract ideas he is a loving husband, a proud father of a two-year old son, and a dog owner. John may be contacted at mailto:jheintz@innodata-isogen.com