Abstract
It is a fact that Microsoft Word is the most commonly used document preparation system today that is available on nearly every desktop computer. And as such Microsoft Word is typically not only deployed as a type-writer like, easy-to-use document editor but also, in spite of all prejudice, as a capable document formatting-engine that suffices for many applications. However, it lacks direct XML support in both usage scenarios.
A first major drawback when using Microsoft Word to process documents seems to be its WYSIWYG based user-interface paradigm that does not force its users to structure their documents logically while creating them. Simply type and format - that's the typical way how Word documents are prepared, leading often to unstructured accumulations of text that may be printed - but do not represent any further going value. However, this WYSIWYG paradigm may be the simple admission ticket for many typical Microsoft Word users into the world of XML documents. Strictly following some simple rules, they may prepare - with the help of their accustomed document editor - documents that may be translated automatically and tool-based into XML of reasonably good quality.
A second major drawback when using Microsoft Word seems to be its seclusiveness to the import of documents in any open format for structured documents like XML. Microsoft Word accepts as input formats mostly proprietary formats but was not opened by Microsoft to any new idea that aroused in the application domain of XML and its accompanying standards.
This paper describes in general the basic architecture needed to embed Microsoft Word into an environment enabling the import, export and roundtrip of XML documents. In particular it describes the two tools downCast and upCast that implement such an XML architecture for Microsoft Word based on the additional standards XSL and CSS.
Keywords
Table of Contents
Before we look into how Microsoft Word can be embedded as an XML-enabled tool into an XML environment, let’s have a brief look at both, the thing document and its processing and introduce some terminology. Please be aware that for matters of clarity we simplify facts and procedures which are not crucial for the goal of this paper.
Traditionally we expect that a document is a thing describing something. We assume that we may hold a document in our hands like e.g. a newspaper or a papyrus, we know that we may find documents in a public library, filed in a cabinet, or even stored on a computer disk. There are many points of view from which a document may be viewed and classified.
In order to get a clear and precise terminology for the rest of this article let us define initially that in a document preparation process the term intended document qualifies the information someone has in mind (metaphorically speaking: in his head) when he is creating a document. We further define to call the pure information that stands behind an intended document the abstract information that consists of logical objects connected by a logical structure.
Analog to the term intended document the term perceived document shall denote the information someone has in mind after he has consumed (and thus contemplated and analyzed) a document. Again, as for intended documents, we call the pure information that stands behind a perceived document abstract information.
The term concrete document shall denote the physical presentation of a document, something one may (at least nearly) touch, and we call the information that is contained in a concrete document concrete information. Let us assume that an instance of a concrete document may either be a printed (“ink on paper”) thing like e.g. an article consisting of glyphs arranged left-to-right with applied line-breaking and page-breaking printed on sheets of paper or even a sequence of bar codes or morse codes applied on paper strips. Or it may be something digital. In this second case there is a range from final-products like pixmap and PostScript or PDF documents where (in a simplified view) all calculations needed to display or print a document are already carried out, to still changeable/editable non-final-products like Microsoft Word or XML documents, where we still have to apply at least some processing in order to be able to touch them.
In order to be able to refer to specific aspects of concrete documents that are of interest in this paper, we will denote in the following electronic concrete documents like XML documents as logically marked up documents and in contrast hereto electronic concrete documents that exist in a document format like it is used in Microsoft Word graphically marked up documents.
Further we define that logically marked up documents consist of logical objects connected by a logical structure and that graphically marked up documents consist of layout objects (objects available in Microsoft Word that may be used in order to create documents, e.g. characters, paragraphs, lists, tables, …) connected in a layout structure.
Now that we have introduced the terminology needed to reference all the different stages of a document in a document preparation process and the concepts behind these stages, we may analyze the central steps in this process.
When someone writes a logically marked up document (e.g. an XML document) on a computer system he may do so by using a plain text editor. This kind of editor offers a rather simple view onto the created document: it displays the document as an flat stream of characters containing both the markup and the data of the document.
In order to get the written XML document displayed in a more human accessible way we have to apply layout in the so-called formatting process that transforms the logically marked up document into a graphically marked up document. The principal concept behind this formatting step is quite simple: we specify how single logical objects should be transformed into the formatted output that consists, as defined, of layout objects. In order to assign visual semantics to the logical objects, these layout objects may both offer parameters to individualize the applied formatting and they may be combined with other layout objects. In simple cases (that we assume to represent a great part of all reasonable cases) the logical structure of a document will be structural and functional equivalent to its layout structure.
In case someone is using a tool like Microsoft Word to write a document he is rather creating a graphically marked up document consisting directly of layout objects than a logically marked up document. In this case, there is no need any more to map logical objects onto layout objects. Be aware that as a consequence, the person who writes the document may create the logical structure of the document only indirectly by means of applying the formatting that implies the desired logical function. Anyhow, Microsoft Word offers some mechanisms to apply named formatting decisions via a style sheet approach that offers its users to change the formatting of equivalent parts in a document (e.g. all headings of level 1) by only changing the definition of a style rule.
Worth to cite here is the fact that the concept of layout objects is mentioned and introduced in many different publications like the DSSSL (“flow objects”) and the XSL standard (“formatting objects”). The probably most interesting paper preceding all other publications and definitively worth looking at that introduced the concept of layout objects was written by M. Murata and K. Hayashi going back to ideas first described and published by P. Pedersen (“hierarchies of formatters”) in conjunction with the ODA standard.
Some human being intending to consume a concrete document has to read it. He processes layout objects displayed on some medium that are perhaps combined with other layout objects, and he assigns (based on knowledge he acquired while learning to read combined with other knowledge) semantics to these objects. In doing so the human being synthesizes the perceived document and restores a logical structure that may or may not be consistent to the one that was intended by the concrete document or intended by the document’s author who had in mind his intended document.
In order to restore the logically marked up document from a graphically marked up document (no matter whether this logically marked up document existed or not) one has to perform a processing step that may be called “unformatting”. Details concerning this step are described in the next chapter.
The most commonly used document preparation system today that is available on nearly every desktop computer is Microsoft Word. The typical way in which Microsoft Word is deployed is a typewriter like usage: simply type and format. Thanks to the WYSIWYG based user interface paradigm that does not force its users to logically structure their documents during creation, documents created with Microsoft Word may easily feature a poor logical structure (in the underlying data structure of the created documents) while still looking good (featuring a perfect logical structure) in print - but fortunately they do not necessarily have to.
Microsoft Word may easily be used to create documents that feature a reasonably good logical structure as required in many usage scenarios that occur today. Microsoft Word and its underlying document format RTF (Rich Text Format, a document format standard defined by Microsoft) support the logical markup of documents to a certain extent with the concept of styles for e.g. sections, paragraphs and characters that (we assume this in good faith) may easily be used even by unskilled users. Microsoft Word also offers built-in and dedicated support for advanced features like tables, lists and footnotes in order to create nicely structured documents.
Thus, the WYSIWYG paradigm of Microsoft Word may be the simple admission ticket for many typical Microsoft Word users into the world of XML. Strictly following some simple rules, they may prepare - with the help of their accustomed document editor - logically structured documents.
The only unhandy fact remaining is that Microsoft Word offers no native support of XML.
In order to embed Microsoft Word into an environment enabling the generation of XML documents (or to import Word documents into any available XML editor), one needs some external tool that is able to convert these documents into XML. We call this direction of a document conversion upcasting (actually performing an unformatting), leading directly to the name of our product: upCast.
From a pure functional point of view upCast is a standalone (usable independently from a Microsoft Word installation) RTF to XML converter that is implemented in Java. Our marketing would add here that due to this fact upCast may not only convert MS Word documents to XML, but all documents created in applications that offer RTF export functionality.
As we like to see it from a more technical point of view, upCast is a quite complex system that is based on ideas developed at the TU München during the last eight years. Since 1999 upCast is being developed on a commercial basis by a small team of programmers that by then was working on document formatters for structured documents.
The aims we have set ourselves for the development of upCast were quite simple. First of all, we had to implement all the knowledge we gained while working at the TU München in the area of document processing and formatting. Simultaneously, it was obvious that we had to get as much real world experience as possible with our product in order to steer our development into the right direction (thus we offer a free unlimited version for private and noncommercial use). We decided that our implementation had to be purely standards-based wherever possible. Java was chosen for the implementation language due to its nice system independence and object orientation enabling us to use modern design principles for upCast’s architecture. Nonetheless, today upCast offers C/C++ and VisualBasic interfaces in order to embed upCast into any product that needs RTF to XML conversion functionality.
upCast was designed to fit into the most different software architectures one may come across in the area of document management systems by offering a family of products based on the upCast kernel implementing the core conversion functionality. Today, upCast not only fits into simple scenarios where a single independent workplace at a home office is used to create XML documents, but also plugs easily into workflow processes and company wide server based installations.
In order to convert an RTF document, the upCast system imports the given document from different possible sources (e.g. files or network streams) in a first step. While doing so, the given logical structure that is directly specified in the document by means of RTF constructs (via the style sheet and some other constructs offered by Microsoft Word) is analyzed and used to construct an intermediate document tree reflecting the logical structure collected so far.
In a very important second step ensuring the quality of the generated XML, upCast applies heuristics gathered during the initial development and the later field-usage of upCast to the intermediate document tree in order to infer and create a logical structure that is as close as possible matching the intended logical structure of the author who wrote the source document. This step is absolutely necessary due to the way how Microsoft Word and RTF in particular works.
The WYSIWYG approach chosen for the GUI of Microsoft Word and the fact that documents processed in Word are graphically marked up may lead to two different classes of problems. The first class is caused by possible artefacts in Word documents not perceivable by the authors while writing a document. A typical example are empty and thus invisible paragraphs – e.g. technically, they finish a list and start a new one in Microsoft Word documents even though the perceived document displayed on the screen may suggest to its reader that there is only a single continuous list. The second class of problems is caused by the freedom users have in Microsoft Word while graphically marking up documents. The following might serve as an example: an item in a list may, due to its formatting, consist for a human reader of a sequence of paragraphs: all the paragraphs of the item are indented exactly to the amount that the first paragraph of the item is indented – Microsoft Word, however, ends the item and the list after the first paragraph of the item due to insufficient data structures offered directly by Microsoft Word.
Subsequently, after improving the logical structure in this second step, upCast may perform some post processing steps. This is done in order to transform the logical structure created so far based on knowledge that is present only in a certain application environment for a specific customer. The structure created in this step is an enriched logical structure that reflects all the information needs existing in that specific environment. The processing for this step may either be implemented in XSL or in Java, depending on the complexity of the respective requirements.
While upcasting graphically marked up documents, upCast heavily relies on the fact that due to the way how authors are graphically marking up their documents (there are very well accepted standards on how to format certain document parts within our culture that we acquire while learning to write) in most cases there is a very close 1:1 relationship between the layout structure of a given document and the logical structure of the same document – otherwise no one could read and understand a document somebody else has written. This is the reason why we developed for upCast a fixed, specific standard DTD that may reflect all possible logical structures for documents created in Microsoft Word. Mathematically spoken, all possible DTDs behind documents created with Microsoft Word are simply equivalence classes of upCast’s own DTD. There always exists a function to map documents marked up in upCast’s DTD into one of these other equivalence DTDs.
To complete the chapter about upCast, we give a short outline of upCast’s features. upCast fully recreates the logical document structure with automatic section nesting and support for parts. Paragraph and character styles are processed. upCast offers powerful table translation, including row and column spans, offering support for both the HTML and CALS table model. Nesting tables are processed properly. Footnotes, hyperlinks and references are converted to elements, page headers and page footers are handled, document properties (including user defined properties) are processed. Fields and nested fields are supported. Embedded WMF images may be converted into a pixel format (jpeg, png, bmp or pict). All images may be scaled concerning to the output resolution of the target platform. upCast deals with any combination of nested lists, tables and any other combination of layout elements that might occur in Microsoft Word documents. upCast fully supports Unicode and UTF encoding of XML documents. On Microsoft Windows systems upCast may directly convert Microsoft Word binary files if Microsoft Word is installed. upCast offers the direct export of Microsoft Word files to XHTML (content or layout centered), CSS, Docbook and upCast’s own DTD or Schema (content or layout centered). Via customization upCast may easily support many DTDs as used in real world applications.
As XML slowly becomes an accepted standard format for document-centric applications in the industry and thus more and more XML documents are available (often as the only provided format) the question arises, how to print or further process these XML documents. Again, Microsoft Word is available on most computers and would be a good choice to do so, but unfortunately Microsoft Word offers no native support for the import of XML.
Thus, in order to embed Microsoft Word into an environment enabling the print-output or editing functionality of XML documents, one needs some enabling tool and architecture that is able to convert XML documents into a format readable, processable and again exportable by Microsoft Word. A suited format for this kind of import is RTF, offering and supporting all the features that are available within Microsoft Word.
Despite all prejudice, Microsoft Word is a capable document formatting engine that suffices for many applications that need to print process documents (either in print or on screen) containing even sophisticated layout. Hence, it is obviously a good idea to use Microsoft Word as a document formatting engine. The question still to be answered is how to get XML files into Microsoft Word.
The approach we have chosen is based on some simple requirements. First of all, it should be (of course) standards-based as far as possible but simultaneously technologically as reasoned and application oriented as necessary in order to get good results in a reasonable way. Thus, we decided not to use the obvious XSL-FO solution (supplying us with a general vocabulary describing a formatting task), but to rely on the DTD we had developed for upCast (in order to get a specific vocabulary adapted to Microsoft Word describing the desired formatting). If analyzed thoroughly as done in the chapter about upCast one will see that the upCast DTD reflects both, the logical and the layout structure of documents written in Microsoft Word.
For the general conversion process from XML to Microsoft Word, downCast relies on XML documents that are valid according to the upCast DTD. For specifying layout properties, we have chosen a CSS like style property approach (actually a CSS-2 subset, with added proprietary properties where necessary) that lets the user specify name-value pairs in order to specify the exact formatting of XML documents into Microsoft Word.
When trying to use Microsoft Word as a two-way or roundtrip XML editor (import and export of XML documents), the requirements for the import process and thus the tool used for conversion of XML documents to Microsoft Word gets still more challenging than described above. We have to ensure that the logical structure of the document imported to Word is congruent to the one written on exporting the document to XML - otherwise this task would be doomed to failure.
For roundtripping XML documents through Microsoft Word, we determined that it is essential to not only convert the logical structure present in the source XML document to a visual equivalent representation, but to also use the structuring mechanisms supported natively in RTF as completely as possible. Only then will we succeed to retrieve that logical structure without unnecessary loss of information when later exporting to XML again. Otherwise, the clues available in the document to be exported in the upcasting process would be comparable to those available in an OCR process – and we all know about the problems that are encountered there.
Summing up the roundtrip requirements we may confirm the decision of using upCast’s DTD as the basis for the development of downCast. As we have seen, importing a document into Microsoft Word with respect to a possibly required export some time later definitively calls for functionality that may not be supplied by a general purpose standard like XSL-FO. In addition, the selection of CSS to steer details of the desired layout is obvious due to its acceptance and user-friendliness.
We believe that our approach chosen for the architecture of upCast and downCast is fitting very well for many today's, but also future applications where Microsoft Word needs to be used in an XML-enabled environment. Some ten-thousand installations of upCast world-wide demonstrate its usability from small projects where only a few persons are working on XML documents, up to projects where upCast is integrated in a central client/server software with hundreds of users accessing its functionality. Our latest product downCast is already in use with some pilot customers and closes the import-gap that still prevents the use of Microsoft Word in a huge number of projects that need both, import and export of XML documents.
The choice to implement upCast and downCast as stand-alone tools that may be used fully independently from Microsoft Word on any Java-enabled platform or directly integrated into Microsoft Word gives our customers the freedom to implement XML to Word to XML functionality in any architecture they need in order to solve their content management needs.
We hope that our tools upCast and downCast are useful in promoting the idea of XML in a current world of non-XML tools. We want to thank all the people that helped us developing and improving our tools so far by giving us excellent feedback, improving the quality of our products constantly over the past years.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |